Skip to main content

Text mining for searching and screening the literature

This guide is intended to provide an overview of the definition and application of text mining in search strategy development and study selection; it includes a list of tools and resources that librarians or other motivated searchers may wish to try


Active learning: In machine learning, "process of using a classification model's predictions to iteratively select training data."

Automatic term recognition: Term used in natural language processing to describe automatic detection of phrasal terms

Bag of words (BoW): In text mining, the treatment of words as single tokens, ignoring the order of words; also known as vector space model (VSM); disregards grammar and word order, method still frequently used; an alternative to BoW is natural language processing (NLP)

BM25: Term frequency transformation/weighting scheme used in word frequency analysis that captures diminishing return of higher term frequencies, avoids dominance of a single term; considered a stronger weighting method than TF-IDF

C-value: In automatic term recognition, "statistical measure used to evaluate how important a term is in a document or collection of documents" (Ananiadou 2009)

Categorization: Text categorization is the assignment of labels, typically from a predefined set, to a text document; one approach is based on hand coding, another on machine learning (two types of machine learning approaches: classification and clustering)​

Classification: Categorization of documents using supervised machine learning; training set and predetermined labels are provided to train a classifier to correctly assign labels to uncategorized documents (Shatkay 2012)

Clustering: Categorization of documents using unsupervised machine learning; goal is to produce clusters of documents that are similar to each other according to some criterion, with different clusters for dissimilar documents (Shatkay 2012)

Collocates: A word’s collocates are words that appear next to or near it (Glanville 2016) 

Concordance: In text mining, concordance tools are used to view words or phrases in context

Deep learning: Subcategory of machine learning  in which most models are based on artificial neural networks

F-Measure (aka F-Score): Combination/harmonic mean of precision and recall/sensitivity

F1: Popular measure to combine precision and recall, captures tradeoff between them, rewards cases in which precision and recall are fairly similar

Hasty generalization: Idea that objective search strategies are biased to what is already known and therefore amplify the retrieval of studies similar to what is already known (Hausner 2016), or the bias that can result if a training set is not representative of the population (O'Mara-Eves 2015), aka bias from non-random sampling

Lemmatization: Application of a dictionary that allows a system to consider variations of a term by using the dictionary entries to normalize words by replacing morphological variations with their root (for example, replacing 'gave' and 'give' with 'give'); more sophisticated than stemming but addresses the same issue (Welbers 2016)

Machine learning: Subfield of artificial intelligence that uses statistical techniques to give computers ability to learn with data without being explicitly programmed to do so (Wikipedia); can be supervised or unsupervised

MetaMap: Maps text to UMLS Metathesaurus; one of foundations of NLM's Medical Text Indexer;

MetaMap Indexing: Method that applies a ranking function to concepts found by MetaMap

MTI: Medical Text Indexer, the automated and semi-automated indexing system used by National Library of Medicine as part of their Indexing Initiatve. Uses MetaMap and MetaMap Indexing for multi-label text categorization

Multi-label classification: Techniques for assigning multiple labels to an instance (for example, to automatically assign MeSH terms to bibliographic records)

N-gram: In text mining, sequences of words generated based on how many words, or n, are specified by the user. For example, the bi-grams (two-word terms) derived from the phrase "I love my dog" are "I love", "love my", "my dog". The order of words is maintained

Part-of-speech (POS) tagging: Process of marking up text based on relationship and related words in a phrase/sentence/paragraph (Wikipedia). Used in natural language processing

Precision: In search strategy development, refers to the number of included studies retrieved (true positives) divided by the total number of studies retrieved (sum of true positives and false positives); also known as positive predictive value in diagnostic testing

Qualitative data analysis software: Allows you to manually code and annotate documents (e.g., QDA Miner, NVivo, Atlas.ti)

Sensitivity: Called recall in computer science and information studies but sensitivity is the preferred term in medical librarianship given its use in diagnostics in medicine; refers to number of cases/records retrieved, i.e., true positives (by a search engine or diagnostic tool) divided by the total number of relevant cases/records, i.e., the sum of true positives and false negatives; requires a denominator which in search filter development is often developed by handsearching a set of results from a pre-selected list of journals and years and then classifying them manually as relevant or non-relevant. Denominator may also be referred to as the quasi-gold standard or gold standard.

Stemming: Technique used to reduce words to their root form by removing their endings (e.g., searching for hospital* to retrieve records containing the words hospital, hospitalized, hospitalised, hospitals, etc.)

Specificity: In search filter development and diagnostics, refers to the percentage of true negatives (true negatives divided by the sum of true negatives and false positives); the more false positives, the worse the specificity and precision are, but these two measures are calculated differently

Support vector machine (SVM): Machine learning algorithm used for classification tasks

Text mining: Sometimes used interchangeably with text analytics; involves the process of analyzing unstructured text data to identify actionable insights (Kwartler 2017). Text mining often involves data preparation/preprocessing/cleaning to differing degrees depending on the tool being used

Textual mining software: Supports quantitative analysis of unstructured text; generally supports preprocessing and data cleaning (e.g., quanteda, tm package in R, WordStat)

TF-IDF (Term Frequency-Inverse Document Frequency): Weighting scheme used in word frequency analysis for storing text as weighted vectors; words with high frequency receive high weight unless they also have a high document frequency (e.g., stop words); "for high document frequency words, the competing effects cancel each other and give the word a low weight"

Liaison Librarian

Genevieve Gore's picture
Genevieve Gore
Liaison Librarian, Schulich Library of Physical Sciences, Life Sciences, and Engineering
On sabbatical from Jan 1 to Jun 30, 2021
Contact: Website

McGill LibraryQuestions? Ask us!
Privacy notice