Guides: Research Data Management: Data Sharing and Publishing

Active Data Storage vs. Research Data Repository

Options for storing or sharing data during the active phase of a project differ from repositories designed for the long-term preservation and accessibility of research data. Hover over the cards below to see how they differ.

This is where you store and share data with collaborators or partners during the active phase of your project. They serve as temporary solutions and are not intended for permanent/long-term storage and sharing. Examples include McGill-licensed Microsoft SharePoint or Microsoft Teams.

Data repositories are used for the long-term archiving and access of research data. Once data is deposited/published, it will remain accessible even after the project concludes. Examples of research data repositories include the McGill Dataverse and the Federated Research Data Repository.

McGill University Dataverse

The McGill Libraries offer an institutional data repository, the McGill University Dataverse, for research data publishing and archiving. McGill faculty, students, and staff are welcome to deposit datasets in the McGill Dataverse repository. All data are stored securely on servers located in Canada. Data can be publicly accessible, available to specific individuals, or private/restricted.

The McGill Dataverse can be found by following this URL: https://borealisdata.ca/dataverse/mcgill

Here are the instructions for creating a draft and submitting the dataset for publication (datasets will be reviewed by the McGill Libraries RDM Specialist before publication):

A few notes:

Any data format can be deposited in the McGill Dataverse collection, but there is a limitation on the file size (each file must be 5GB or smaller).
Once a dataset is published, it cannot be unpublished - this action is irreversible.
The default license is CC-0 public domain, meaning you would give up all copyright. If you want a different license, for example CC-BY, make sure to change this when uploading the dataset draft.
Sensitive data cannot be published in the McGill Dataverse. If your dataset contains information collected from anonymized human participants, contact the RDM specialist (rdm.library@mcgill.ca) with the consent form. The RDM specialist reviews all consent forms prior to granting permissions for dataset deposit.
When you upload the dataset, it will be a draft. When you want to publish it, the dataset should be submitted for review and it will be published if it's not missing any information.

How to deposit:

Create a Dataverse account by logging in (go to log in page and select McGill University from the drop-down under "Your Institution" and it will log you in automatically via single-sign on or you will be prompted to log in via McGill single-sign on): https://borealisdata.ca/dataverse/mcgill
When you’re logged in, go to the main McGill Dataverse page and you should see an Add Data button (https://borealisdata.ca/dataverse/mcgill). You can create a draft by clicking on that button and filling out the information/uploading files.
Provide a descriptive title for the dataset and enough information in the description for other users to understand where the information comes from, how it was collected, etc.

For training on using Dataverse, please see this series of self-paced online modules: Dataverse 101: A Portage Training Module Series

Additional Data Repositories

A wide variety of additional data repositories and databases are available that archive research data from many subject areas. Coverage varies by discipline.

McGill researchers who wish to look for a domain-specific data repository are encouraged to start by using Re3data.org which provides a comprehensive listings of disciplinary and institutional repositories to host and share research data.

Other places to find lists of data repositories include:

Nature - Recommended data repositories
Open Access Directory (OAD) - List of repositories and databases for open data
PLOS One - Recommended data repositories

The following list names a few, reputable general data repositories:

Figshare - a general purpose repository often used in partnership w/ PLOS publications.
Dryad - frequently used for scientific and medical publication
Zenodo - a general purpose repository that integrates with Github for archiving and minting DOIs for Github repos
ICPSR - a repository commonly used for social science data
Qualitative Data Repository - for qualitative data, typically used for digital humanities and social sciences
FRDR - The Federated Research Data Repository is a Canadian solution for archiving large/big data

Open Research Principles

Open data/research is the practice of research in such a way that others can collaborate and contribute, where research data, lab notes and other research processes are freely available, under terms that enable reuse, redistribution and reproduction of the research and its underlying data and methods. Open research data is data that can be freely accessed, reused, remixed and redistributed, for academic research and teaching purposes and beyond. Ideally, open data have no restrictions on reuse or redistribution, and are appropriately licensed as such. Openly sharing data exposes it to inspection, forming the basis for research verification and reproducibility, and opens up a pathway to wider collaboration.

However, there are also special considerations - not all data can or should be open. For example, to maintain Indigenous Knowledge sovereignty and Indigenous Data sovereignty (see CARE principles below in this page), or to protect the identity of human subjects, limited restrictions of access may be implemented.

Read more about Open research: https://book.fosteropenscience.eu/ (CC-0)

FAIR Data Principles

Since the publication in 2016 of "FAIR Guiding Principles for scientific data management and stewardship" in Scientific Data, the best practice for managing data is to adhere to the FAIR principles. The FAIR principles are a framework for ensuring that data collected by researchers across all disciplines and fields meet specific standards to promote open science, reproducibility of research, and maximize the benefits of research to academia and society.

The following description of the FAIR principles is taken directly from https://www.go-fair.org/fair-principles/

Findability:

The first step in (re)using data is to find them. Metadata (the description of the data) and data should be easy to find for both humans and computers. This means assigned a persistent identifier (PID) to the data/dataset (usually in the form of a digital object identifer, or DOI). Identifiers consist of an internet link (e.g., a URL that resolves to a web page where the data are located). Identifiers will help others to properly cite your work when reusing your data.

Accessibility:

Once the user finds the required data, they need to know how can they be accessed, possibly including authentication and authorisation. This does not mean that data should be open, necessarily. There are many reasons to restrict access to data (e.g. the data contain personally identifiable information (PII), are proprietary/licensed as intellectual property (IP), or contain other sensitive information). Accessibility essentially means that it should be clear under what conditions access is allowed. The rule with accessibility can be distilled to: "As Open as Possible, as Closed as Necessary"

Interoperable:

Interoperability refers to the ease by which data can be integrated with other/new data. In practice, storing data in open formats makes it easier to later integrate new data. On the other hand, storing data in proprietary formats hinders this effort. In addition, the data need to interoperate with applications or workflows for analysis, storage, and processing. This means that when possible, it's best practice to use standardized vocabularies/variable labels/terms.

Reusable:

The ultimate goal of FAIR is to optimise the reuse of data. To achieve this, metadata and data should be well-described so that they can be replicated and/or combined in different settings. In practice, this involves creating a README file with details on how to clean, transform, or manage the data, if applicable. This also involves applying a license to let others know if the data are public domain or if copyright is retained to some degree or completely.

Persistent Identifiers (PIDs) for Data

Persistent identifiers are:

A publishing initiative
A permanent link that points to your data, making the data findable (e.g. DOI)
Your data might move locations (URLs, repositories, etc), or the way we access the internet might change, but the DOI will always be the same
Machine-readable

Persistent identifiers allow for:

Archiving and preserving data (digitally) for the long-term
Linking a data article to a published study (normalizing best practices)
Connect PID of Dataset with DOI of journal article, and then subsequent studies that reuse the data, potentially leading to higher citation impact (up to 25%)

Licenses and Terms

For a general overview on copyright issues related to data, please see this Guide to Licensing Open Data from the Open Knowledge Foundation.

The following are typical Creative Commons license templates that are applied to data:

CC 0 (public domain, unambiguously waive all copyright control over your data in all jurisdictions worldwide. Data released with CC0 can be freely copied, modified, and distributed, even for commercial purposes, without violating copyright). This is the default license in Dataverse, as one goal of the project is to promote open science best practices.
CC BY (This license lets others distribute, remix, adapt, and build upon your work, even commercially, as long as they credit you for the original creation.)
CC BY-NC (This license lets others remix, adapt, and build upon your work non-commercially, and although their new works must also acknowledge you and be non-commercial, they don’t have to license their derivative works on the same terms.)
CC BY-SA (This license lets others remix, adapt, and build upon your work even for commercial purposes, as long as they credit you and license their new creations under the identical terms)
CC BY-NC-SA (This license lets others remix, adapt, and build upon your work non-commercially, as long as they credit you and license their new creations under the identical terms.)

Not sure what license to select? Creative Commons has a neat tool to help.

Data Journals

Data journals publish data articles, which are mini-publications about a dataset or database. Similar to the peer review process for the write-up of a journal article or study, the data would be peer reviewed (for an example of peer review guidelines for data articles, see the Earth Science System Data Journal guide). Data articles can be about data that underlie existing publications or they can be independent publications. Publishing data as their own research product allows you to cite the data easily in subsequent publications, link the data to publications, and potentially receive credit for the data itself in addition to any related studies.

Types of data publications:

Data articles
Data papers
Data notes
Data descriptors

What data can you publish?

Data underlying or linked to another study
Orphan data, dark data, null results
Updating an existing database or creating a database as a resource
Pilot studies/preliminary results
Reporting additional controls
Descriptions of data

A (slightly outdated but still accurate) list of data journals

Additional information on data journals:

"How to publish your data in a data journal" (from the NatureJobs job blog)
"Data papers as a new form of knowledge organization in the field of research data" (Schöpfel et al 2019)
"Data journals: incentivizing data access and documentation within the scholarly communication system" (Walters 2020)

Librarian

Alisa Rod

Email me

Contact:

McGill University Libraries
550 Sherbrooke West, West Tower
Montreal, QC H3A 1B9

Research Data Management Specialist