2–15 March 2022

The Cultural Heritage Data School, led by Cambridge Digital Humanities, is an online intensive application-only teaching programme which aims to bring together participants from the wider Galleries, Libraries, Archives and Museums (GLAM) sector and academia to explore the methods used to create, visualise and analyse digital archives and collections.

The curriculum will be structured around the digital collections and archives pipeline, covering the general principles and applied practices involved in the generation, exploration, visualisation, analysis and preservation of digital collections and archives.

Q&A Session

Watch the recording of the Q&A session to hear more about this year’s Cultural Heritage Data School and the application process.

Programme Content

During the Data School, all participants follow a structured programme covering the following topics. Additional modules and workshops may also be offered and applicants are encouraged to check back here to view the full provisional programme when it is released in January.

Our sessions are suitable for those without prior knowledge of programming languages.

  • Project design, data collection and wrangling
  • Introduction to text-mining with R
  • Digital text markup and TEI
  • Named Entity Recognition with Python
  • Basic Principles of Data Visualisation
  • Data sustainability, preservation and destruction

Please note, the Data School timetable and content may be subject to change and are subject to the constraints of staff availability.


Dates and timings in GMT

Please note, the timetable and content are provisional and may be subject to change.

Date Time Topic
2 March 2022
1:30–2:15pm Introduction and welcome (cancelled)
Wednesday 2 March 2022 2:30–3:30pm Session 1: Digital Research Design and the Project lifecycle (cancelled)
Thursday 3 March 2022 1:00–1:30pm Introduction and welcome
Thursday 3 March 2022 1:30–2:30pm Session 2: Data collection, wrangling and preparation
Thursday 3 March 2022 3:00–4:00pm Session 3: Text mining with R I
Friday 4 March 2022 Self-paced study day
Monday 7 March 2022 1:30–2:30pm Session 4: Digital text mark-up and TEI  I
Monday 7 March 2022 3:00–4:00pm Session 5: Named Entity Recognition I
Tuesday 8 March 2022 2:00-4:00pm Public workshop: (Anti)Colonial archives in the digital age
Wednesday 9 March 2022 1:30–2:30pm Session 6: Digital text mark-up and TEI  II
Wednesday 9 March 2021 3:00–4:00pm Session 7: Text mining with R II
Wednesday 9 March 2021 4:15-5:15pm Special Session: Automated Topic Detection
Thursday 10 March 2022 1:30–2:30pm Session 8: Basic Principles of Data Visualisation I
Thursday 10 March 2022 3:00–4:00pm Session 9: Basic Principles of Data Visualisation II
Friday 11 March 2022 Self-paced study day
Monday 14 March 2022 1:30–2:30pm Session 10: Named Entity Recognition II
Monday 14 March 2022 3:00–4:00pm Session 11: Data sustainability, preservation and destruction
Tuesday 15 March 2022 1:30–3:00pm Session 12: Closing plenary and next steps



with Dr Anne Alexander
2 March 202

This session will include a short presentation about CDH, a Q & A session and introductions.

Module 1: Project design, data collection and wrangling
with Dr Anne Alexander
2 & 3 March 2022

This introductory module explores the lifecycle of a digital research project across the stages of design – data capture, transformation, analysis, presentation and preservation. It also introduces tactics for embedding ethical research principles and practices at each stage of the research process. We will discuss the importance of documentation of data provenance, look at the practical and ethical challenge of common methods used for bulk data capture including use of APIs and working with data collected by others. The second session in the module will introduce the data-cleaning tool OpenRefine and a set of exercises for participants to work through in their own time.

Module 2: Introduction to text-mining with R
with Meng Liu
3 & 9 March 2022

Text mining, also called text data mining or quantitative text analysis, refers to the process of transforming unstructured textual data into a structured format to identify meaningful patterns and to generate new insights. This module serves as an entry point into the world of text mining. We will go over fundamental concepts such as tokenisation, n-grams, document-feature matrix, bag-of-words and tf-idf, with the aim to build a foundational understanding of quantitative text analysis. We will be using the gutenbergr package to access literary texts from Project Gutenberg as well as a variety of packages handy for text mining (e.g., tidyverse and tidytext, specific installation instructions will be provided in advance). The main output of this module will be a hands-on project of topic modelling. Participants will have the opportunity to practise exploring and visualising text data from their favourite author (in Project Gutenberg). No prior knowledge of R is assumed to complete the module. Self-paced study materials and exercises (with code scripts) will be provided to all participants.

Module 3: Digital text markup and TEI
with Huw Jones
7 & 9 March

The TEI (Text Encoding Initiative https://tei-c.org/) is a standard for the transcription and description of text bearing objects, and is very widely used in the digital humanities – from digital editions and manuscript catalogues to text mining and linguistic analysis. This module  will take you through the basics of the TEI – what it is and what it can be used for – with a particular focus on uses in research, paths to publication (both web and print) and the use of TEI documents as a dataset for analysis. There will be a chance to create some TEI yourself as well as looking at existing projects and examples. The module will take place over two sessions – with an introductory taught session, then a chance to work on TEI records yourself, followed by a review and discussion session.

Module 4: Named Entity Recognition with Python
with Dr Mary Chester-Kadwell
7 & 14 March

Text-mining is extracting information from unstructured text, in other words, text that has not been encoded with semantic markup. In this module we will look at one way of extracting information from unstructured text by recognising named entities automatically. A named entity is any type of real-world object or concept, such as a person, organisation, location or date. Using the example of letters from the 19th-century botanist John Stevens Henslow, we will introduce how to: recognise and visualise named entities using machine learning; create training data for improving the results; and link named entities to existing knowledge bases.

Participants will be able to choose either a ‘no code’ or a ‘Python’ track for this module. Everyone will join the same virtual sessions, and have access to the same self-paced study materials and exercises, but the suggested directions given will be different depending on which track you choose to follow. For those with experience in Python, the materials include a set of Jupyter notebooks using the spaCy NLP library, but prior knowledge of Python is not required in order to complete the module.

Module 5: Basic Principles of Data Visualisation
with Tobias Lunde
10 March

Data visualisation makes it not only possible to ‘see’ data by representing it graphically, but is an essential tool in describing, summarising, reasoning about and gaining an understanding of data. This module will provide a brief introduction to the basic data visualisation, principles and rules-of-thumb to communicate effectively and clearly using graphics. Topics covered will include: different roles and types of graphs; the basics of graph design, colour and typesetting; labelling and providing context for graphs; and pointers for visualising geographic and time series data. It will also look at some common pitfalls that lead to unclear or even ‘deceitful’ graphics and how to avoid these.

The module does not assume any particular software nor skill level when it comes to programming/computing, instead it provides a general overview that will be apply to a wide range of settings and software.

Module 6: Data sustainability, preservation and destruction
with Dr Anne Alexander
14 March

Ensuring long-term access to digital data is often a difficult task: both hardware and code decay much more rapidly than many other means of information storage. Digital data created in the 1980s is frequently unreadable, whereas books and manuscripts written in the 980s are still legible. This module explores good practice in data preservation and software sustainability and looks at what you need to do to ensure that the data you don’t want to keep is destroyed.

Closing plenary and next steps
15 March

When and Where

This year’s School will be held online during 2–15 March (2022). Data School live sessions are timetabled daily from 1:30–4pm (GMT). To convert this to your timezone you can use this Time Zone Converter. The timetable is still being finalised, but you can view last year’s timetable as an example here.

During the course you will be provided with links to our virtual learning environment (Moodle) where we will publish course content and links to our online video delivery platform for teaching and social interactions.


Cambridge Digital Humanities is committed to democratising access to digital methods and tools and is offering the following subsidised participation fees to encourage applications from those who do not normally have access to this type of training. The fees include all teaching costs.

  • £245* (Standard Rate)
  • £45** (Concessionary Rate)

*Standard rate is applicable if you are part of an organisation with paid staff. However, on the application form you will have the opportunity to apply for a discounted rate if needed (for example if your organisation has a small number of paid staff or limited funding).

**Concessionary rate is for students, unemployed, community projects, unfunded projects, and Global South residents. In addition, a small number of bursaries are available to those who can demonstrate financial need. You can apply for this on the application form.

The deadline for payment is two weeks before the start of the School.


The Cambridge Data Schools are competitive and application-only schools. Places are given to those who we feel can make best use of the classes.

No previous experience of coding is required and there are no specific academic requirements, however the course content is broadly suitable for those with an undergraduate degree or equivalent professional experience. The School is taught in English. You will need a reliable internet connection to join in, and the ability to download free, open software for use during the School.

While we encourage applications from everyone, we particularly welcome applications from women and black and minority ethnic candidates as they have historically been under-represented in the technology and data science sector. We also strongly welcome applications from outside the UK, assuming they can attend the live workshop slots during 1:30-4pm GMT. Sessions will not be recorded and therefore live attendance is required.

Places are generally not open to Cambridge University students or staff, who should be redirected to the CDH Learning pages for information on year-round workshops. If you would like to request an exception to this rule, please get in touch: dataschool@cdh.cam.ac.uk


The 2022 Cultural Heritage Data School teaching team includes:

Sessions will include live-taught instruction, demonstrations and discussions online, with access to self-paced study materials and support via email-based discussion groups between sessions. Participants will need a laptop or desktop computer and internet access to participate in the sessions. Some sessions will require software installation – full instructions will be provided but please ensure you have access rights to install software on the device you will be using.

How to apply

Fill in the application form by 3 February 2022. You will hear whether your application was successful or not by 8 Feb 2022.

The Cultural Heritage Data School is application-only with limited places. During your application you should make best use of the free text sections to explain your current experience, and what you would get out of attending the School.


Heather Stallard (CDH Communications and Events Coordinator)


Cambridge Digital Humanities

Tel: +44 1223 766886
Email enquiries@crassh.cam.ac.uk