Guides: Text Data Mining (TDM): Free Digital Text Corpora

Free Digital Text Corpora

Text Creation Partnership's Collections

Created in the course of the Text Creation Partnership project undertaken by the University of Michigan Library, Bodleian Libraries at the University of Oxford, ProQuest, and the Council on Library and Information Resources. The content from the Phase I is available for access, distribution, use, or reuse by anyone, the removal of restrictions for the content created during the Phase II will occur on or about January 1, 2021

Early English Books Online–TCP The EEBO TCP corpus consists of the works represented in the Early English Books Online collections known as Short Title Catalogues I and II (based on the Pollard & Redgrave and Wing short title catalogs), as well as the Thomason Tracts and the Early English Books Tract Supplement collections. The more than 125,000 volumes cover the period from the first book printed in English in 1475 through to 1700 and include works of literature, philosophy, politics, religion, geography, history, politics, mathematics, music, the practical arts, natural science, etc.
Eighteenth-Century Collections Online–TCP Eighteenth Century Collections Online includes significant English-language and foreign-language title printed in the United Kingdom during the 18th century, and thousands of important works from the Americas. The database contains more than 32 million pages of text and more than 205,000 individual volumes.
Evans Early American Imprints–TCP 5,000 accurately keyed and fully searchable SGML/XML text editions from among the 40,000 titles available in the online Evans Early American Imprints collection.

McGill's.txtLAB texts

Novel450 450 novels in German, French, and English
ContemporaryNovels 1,211 contemporary novels published between 2000-2015
Academic Publishing - A collection of institutional affiliation for 5,664 academic articles published in four prestigious journals in the humanities
LIWC (Linguistic Inquiry and Word Count) tables for 25,000+ fiction and non-fiction texts documents (the nineteenth century canon, Hathi Trust nineteenth-century documents, the twentieth century repositories of Gutenberg and Amazon, and multiple contemporary literary genres from mysteries to prizewinners) in two separate languages (German and English)
Race and film Character dialogue from 780 Hollywood movies produced between 1970 and 2014. Characters have been labeled by their racial and ethnic identity using IMDB.

1880s Fiction

Sample corpora assembled from Project Gutenberg by students in Alan Liu's English 197 course, Fall 2014 at UC Santa Barbara). They can be particularly useful for assignments and individual students' projects:

Adult British Fiction - 1880s (451 works assembled with assistance of Stanford Literary Lab) (metadata spreadsheet)
- Female Corpus (237 works) (metadata)
- Male Corpus (214 works) (metadata)
Children's Fiction - 1880s (135 works of the 1880s additional to adult fiction of the decade assembled by students in UC Santa Barbara's English 197 course, Fall 2014) (metadata spreadsheet)
- American Corpus (33 works)
- European Corpus (102 works) (includes a few Continental European works in addition to the British and American works in this corpus)
- Female Corpus (28 works)
- Male Corpus (107 works)
- British Female Corpus (14 works)
- British Male Corpus (89 works

Miscellanea

Demo text collections assembled by David Bamman, UC Berkeley School of Information:
- Book summaries (2,000 book summaries from Wikipedia) (zip file)
- Film summaries (2,000 movie summaries from Wikipedia) (zip file)
Eighteenth Century Collections Online texts 2,198 plain-text English documents from Eighteenth Century Collections Online [TCP-ECCO] (zip file)
Shakespeare plays (24 plays from Project Gutenberg assembled by David Bamman, UC Berkeley School of Information) (zip file)
Sunday School Books in Nineteenth Century America Dataset (166 texts) (zip file)
Feeding America: The Historic American Cookbook Dataset (76 texts) (zip file)

ARTFL: Public Databases expansive collection of French-language resources in the humanities and other fields from the 17th to 20th centuries
All of PLOS More than 200,000 fully Open Access research articles available for text data mining. The corpus of articles and metadata can be accessed via the PLOS API or directly downloaded as a zipped file.
HathiTrust Digital Library provides long-term preservation and access services for public domain and in-copyright content from a variety of sources, including Google, the Internet Archive, Microsoft, and in-house partner institution initiatives.
Internet Archive Books includes plain-text ["full text"] access to 20,000,000 books, issues of magazines, periodicals, etc.
Project Gutenberg out of copyright works that can be downloaded individually as plain text with some limited automated access

Australian National Corpus s a discovery service that collates and provides access to assorted examples of Australian English text, transcriptions, audio and audio-visual materials

English-Corpora.org formerly known as the BYU corpora, created by Mark Davies, Professor of Linguistics at Brigham Young University, and probably the most widely-used. Contains Global Web-Based English (GloWbE), Wikipedia Corpus, Corpus of Contemporary American English (COCA), Corpus of Historical American English (COHA), Hansard Corpus, TIME Magazine Corpus, British National Corpus (BNC), Strathy Corpus (Canada), and many others (no cost for basic access)

Google Books Corpora American, British, Spanish, and other corpora from Google Books

Late 18-Century Prose about 300,000 words of local English letters on practical subjects, dated 1761-90

Open American National Corpus a massive electronic collection of American English, including texts of all genres and transcripts of spoken data produced from 1990 onward

Oxford Text Archive large number of texts in 25 languages available in a variety of forms, including plain text; texts are accessed one at a time

Text and Data Mining

Text and data mining refers to the processes by which "text or datasets are crawled by software that recognizes entities, relationships, and action." -- GALE, 2017. Text and data mining is an important, new area for academic researchers largely because the output of these processes can result in detecting patterns, trends and also drawing new conclusions.

McGill Library can facilitate access to text corpora for McGill researchers. Assistance can entail helping you locate textual data sources, negotiate access to textual collections for text mining, and, in some cases, purchase or license data. We can also help you find and use tools for managing and analyzing textual data.

Other guides

See the following related guides for:

Network Analysis & Data Visualization for Humanities and Social Sciences
Creating digital books and exhibits as well as interactive map and timelines: Digital and Multimedia Publishing guide
GIS software & tools: Maps and Geospatial Data guide
Statistical data, software, & visualisation: Numeric Data guide

Text Data Mining (TDM)