Skip to Main Content

Text Data Mining (TDM)

A research guide for helping you identify public and licensed text sources for text data mining as well as tools for text analysis.

Free Digital Text Corpora

Text Creation Partnership's Collections

Created in the course of the Text Creation Partnership project undertaken by the University of Michigan Library, Bodleian Libraries at the University of Oxford, ProQuest, and the Council on Library and Information Resources. The content from the Phase I is available for access, distribution, use, or reuse by anyone, the removal of restrictions for the content created during the Phase II will occur on or about January 1, 2021

  • Early English Books Online–TCP The EEBO TCP corpus consists of the works represented in the Early English Books Online collections known as Short Title Catalogues I and II (based on the Pollard & Redgrave and Wing short title catalogs), as well as the Thomason Tracts and the Early English Books Tract Supplement collections. The more than 125,000 volumes cover the period from the first book printed in English in 1475 through to 1700 and include works of literature, philosophy, politics, religion, geography, history, politics, mathematics, music, the practical arts, natural science, etc.
  • Eighteenth-Century Collections Online–TCP Eighteenth Century Collections Online includes significant English-language and foreign-language title printed in the United Kingdom during the 18th century, and thousands of important works from the Americas. The database contains more than 32 million pages of text and more than 205,000 individual volumes.
  • Evans Early American Imprints–TCP 5,000 accurately keyed and fully searchable SGML/XML text editions from among the 40,000 titles available in the online Evans Early American Imprints collection.
 
​McGill's.txtLAB texts
  • Novel450 450 novels in German, French, and English
  • ContemporaryNovels 1,211 contemporary novels published between 2000-2015
  • Academic Publishing - A collection of institutional affiliation for 5,664 academic articles published in four prestigious journals in the humanities 
  • LIWC (Linguistic Inquiry and Word Count) tables for 25,000+ fiction and non-fiction texts documents (the nineteenth century canon, Hathi Trust nineteenth-century documents, the twentieth century repositories of Gutenberg and Amazon, and multiple contemporary literary genres from mysteries to prizewinners) in two separate languages (German and English)
  • Race and film Character dialogue from 780 Hollywood movies produced between 1970 and 2014. Characters have been labeled by their racial and ethnic identity using IMDB. 

 

1880s Fiction

Sample corpora assembled from Project Gutenberg by students in Alan Liu's English 197 course, Fall 2014 at UC Santa Barbara). They can be particularly useful for assignments and individual students' projects:

 
Miscellanea
  • ARTFL: Public Databases expansive collection of French-language resources in the humanities and other fields from the 17th to 20th centuries

  • All of PLOS More than 200,000 fully Open Access research articles available for text data mining. The corpus of articles and metadata can be accessed via the PLOS API or directly downloaded as a zipped file.

  • HathiTrust Digital Library provides long-term preservation and access services for public domain and in-copyright content from a variety of sources, including Google, the Internet Archive, Microsoft, and in-house partner institution initiatives.

  • Internet Archive Books includes plain-text ["full text"] access to 20,000,000 books, issues of magazines, periodicals, etc.

  • Project Gutenberg out of copyright works that can be downloaded individually as plain text with some limited automated access 

  • Australian National Corpus s a discovery service that collates and provides access to assorted examples of Australian English text, transcriptions, audio and audio-visual materials
  • English-Corpora.org formerly known as the BYU corpora, created by Mark Davies, Professor of Linguistics at Brigham Young University, and probably the most widely-used. Contains Global Web-Based English (GloWbE), Wikipedia Corpus, Corpus of Contemporary American English (COCA), Corpus of Historical American English (COHA), Hansard Corpus, TIME Magazine Corpus, British National Corpus (BNC), Strathy Corpus (Canada), and many others (no cost for basic access)
  • Late 18-Century Prose about 300,000 words of local English letters on practical subjects, dated 1761-90
  • Open American National Corpus  a massive electronic collection of American English, including texts of all genres and transcripts of spoken data produced from 1990 onward
  • Oxford Text Archive large number of texts  in 25 languages available in a variety of forms, including plain text; texts are accessed one at a time 

McGill LibraryQuestions? Ask us!
Privacy notice