skip to content

 

Methods Workshop: Introduction to Text-mining with Python

Mary Chester-Kadwell (CDH Methods Fellow)

Please note this workshop has limited spaces and an application process in place. Application forms should be returned to CDH Learning (learning@cdh.cam.ac.uk) by Monday, 22 February 2021. Successful applicants will be notified by end-of-day Thursday, 25 February 2021. Preparatory material will be released on Thursday 4th March, one week in advance of the first session.​
 
Text-mining is extracting information from unstructured text, such as books, newspapers, and manuscript transcriptions. This foundational course is aimed at students and staff who are new to text-mining, and presents a basic introduction to text-mining principles and methods, with coding examples and exercises in Python. To discuss the process, we will walk through a simple example of collecting, cleaning and analysing a text.
 
If you are interested in attending this course, please fill in the application form. Places will be prioritised for students and staff in the schools of Arts & Humanities, Humanities & Social Sciences, libraries and museums. If you study or work in a STEM department and use humanities or social sciences approaches you are also welcome to apply.
 
Aims
By the end of this course you should be able to:
  • Understand the broad overview of different text-mining methods and their uses.
  • Plan a basic text-mining pipeline for your work.
  • Expand your skills in using Python and Jupyter Notebooks into text-mining.
Topics
We will cover:
  • What text-mining is for and what text-mining methods are available (including topic modelling, sentiment analysis, named entity recognition).
  • The text-mining pipeline and 5 steps of text-mining: choosing and collecting text, cleaning and preparing, exploring, analysing and presenting results.
  • Revision of basic Python:
    • Working with text using strings and manipulating lists of strings;
    • Importing code and calling functions;
    • Using Jupyter notebooks.
  • Methods for:
    • Harvesting text from the web;
    • Reading from and saving text to files;
    • Working with TEI-XML;
    • Cleaning up text (normalising);
    • Splitting strings into words and sentences (tokens);
    • Removing unwanted words (stopwords);
    • Counting tokens (frequency analysis);
    • Visualising results.
  • Next steps: resources and directions.
Format
This course takes a ‘flipped classroom’ approach whereby much of the learning takes place self-paced in your own time. Preparatory material is released in the week before the course takes place. The course starts with a 1-hour remote video session to introduce the topics and materials, and ends with another 1-hour remote video session to discuss progress and next steps. Self-paced materials are provided to work through in between the sessions. A chat forum will be used on Moodle for asking/answering questions during the week.
 
Please make sure you can plan time in your schedule to complete the preparatory and self-paced materials in order to get the most out of the course. Time estimates for working through these materials are as follows:
  • Preparatory materials (total: 15 minutes-3 hours):
    • Introductory video: 15 minutes
    • Optional: Installing Python: 1 hour
    • Optional: Revision of basic Python: 1-2 hours
  • Self-paced Jupyter Notebooks (total: 2-4 hours) 
The amount of time you may wish to spend on the self-paced materials depends on your pre-existing experience and own personal goals.
 
Prerequisites
We expect you to have some basic knowledge of Python, or coding in another language. At a minimum, we recommend that you have attended the CDH Basics session “First steps in coding and Jupyter Notebooks” and subsequently done some follow-on independent learning in basic Python. Alternatively, you may have equivalent basic coding experience in Python or a different language from another course of study.
 
If you are unsure whether your coding experience is sufficient, please apply anyway and we can talk about it together.
 
Requirements
You will need a laptop/desktop to join the sessions and follow the self-paced materials. Installation of Python 3 and Jupyter is needed, but full instructions will be provided in the preparatory materials if you don’t already have these installed.
Date: 
Thursday, 11 March, 2021 - 11:00 to 12:00
Thursday, 18 March, 2021 - 11:00 to 12:00
Event location: 
Online event

DH Funding Opportunities

DH Vacancies

We have no vacancies at this time.