The best way to find out if you have access to collections that the library licenses is to contact your liaison librarian or Collection Services with details about what you are looking for. While many of the libraries' databases do not allow text or data mining due to license agreements, we have been working directly with certain vendors to include text and data mining into future agreements and can help negotiate access for specific projects. We will do our best to get what you need!
Some databases that McGill currently has text mining rights for include:
If you have a smaller group of material you want to work with, digitizing your own material might make the most sense. If the material is in the Library collection we can often work with you to do the digitization and some processing.
Directory of tools used in text analysis and retrieval. Includes reviews of tools, and curated lists of the most commonly used tools.
Tool that helps you use the command line to work through common challenges that come up when working with digital primary sources such as: casting – changing one type of data to another type (e.g. PDF to TXT for text analysis purposes), wrangling – manipulating and navigating data (e.g. remove punctuation, normalize case), getting – grabbing data from various locations (e.g. webscraping all relevant images from portions of a website), managing – editing and managing your work with data (e.g. save command line history).
HathiTrust Research Centre
Web-based tools and algorithms that enable enabling computational analysis of public domain works from the HathiTrust digital library. A good place to start if you want to try things out or use it in a class assignment. You need to register for an Research Centre account with your McGill email address.
Easy to use web based program for reading and analyzing digital texts.
A tool designed to collate and compare different versions of digital texts.
A text analysis tool that helps prepare a set of texts for computational stylistics operations, such as word frequency lists and counting functionalities, with the ultimate purpose of determining authorship of a disputed literary work, or analyzing the style of a work or group of works. Intelligent archive function is primarily statistical.
Citizen science portal which lets you set up projects that allow you to crowdsource transcription and annotation, among other things.
Web based tool used to visualize complex datasets on a map, as a graph or as a gallery.
Python 3 - Natural Language Toolkit (NLTK)
Platform for building Python programs to work with human language data.