Documentation

Overview

Teaching: 10 min
Exercises: 15 min

Questions

Why should I invest time in good documentation?

How does my target audience influence my documentation strategy?

What are some published examples of good documentation?

Objectives

Describe how documentation is useful to yourself and to others

Evaluate and rank the quality of comments in published notebooks

Evaluate and rank the quality of existing metadata records.

Describe types of metadata directly relevant for research reproducibility.

Overview

Documenting your process, especially as it concerns your data, is a key element of making your research more reproducible. If you do not thoroughly record all the data manipulation steps you used to process data, it will likely be impossible for you, or anyone else, to repeat the analysis in the future (Wilson et al. 2016). Using the Jupyter Notebook for scripting your data processing is powerful because it saves the code – the what – and interspersed it the motivations behind each step, i.e., the why.

There is also project-level documentation that isn’t needed to understand a particular series of data processing steps, but to understand the organization of the project as a whole. Finally, documentation can be used to aid discoverability.

"If you want to slow down your competitors, give them all your data" @ctitusbrown #openscience #titusbuzz
— kelsey wood (@klsywd) January 15, 2016

In this lesson, we will discuss the types and styles for documentation, their utility, and how you might tailor them for different audiences.

Learning objectives

Describe how documentation is useful to yourself and to others
Evaluate and rank the quality of comments in published notebooks
Evaluate and rank the quality of existing metadata records.
Describe the types of and importance of record level metadata.
Describe types of metadata directly relevant for research reproducibility.

Documentation best practices

Consider the target audiences

For yourself: You are your most important collaborator. Remember that your past self doesn’t answer email. Treat it like a digital notebook detailing what you’ve done, why you did it, what worked and what didn’t work.
For your peers: One step before sharing your notebook with others is to restart and rerun all the cells so that the entire workflow is reproduced in order (e.g In [1]: is followed by In [2]:). Intersperse the code cells with text cells (in Markdown) documenting the rationale for the workflow and interpreting the results so that a reader understands the context.

README file

It is important to write a brief overview of your project. A README file is a short file (think 1-pager) in the project’s home directory, and typically is the main entry point for readers to the project, including in particular the code. It should thus answer questions others will commonly have when they come upon the project, including the following:

the purpose of the project, such as which problem does it try to solve, and what is its scope
how suitable for reuse is the project, such as stage of maturity it is in
prerequisites and other dependencies, and how to satisfy or obtain them
where and how to start for using it
how to cite and/or terms of reuse
are contributions welcome, and if so, how to best make them
who to contact and how for questions

A README should be written in text, with markup that is easy to read (such as Markdown, Reitz 2016).

Based on the above, items to include in a README file include the following:

the project’s title
a brief description
a purpose statement
up-to-date contact information
a brief tutorial or how-to
any relevant weblinks
how to cite and license and/or terms of reuse

Exercise 1

Compare and contrast different research product archives for the quality and value of their documentation, and their corresponding utility for reuse.

MS Salmanpour. (2016). Data set [Data set]. Zenodo. https://doi.org/10.5281/zenodo.193025

Solange Duruz. (2016). Simulated breed for GENMON [Data set]. Zenodo. https://doi.org/10.5281/zenodo.220887

Zichen Wang, Avi Ma’ayan. Zika-RNAseq-Pipeline v0.1. Zenodo; 2016. https://doi.org/10.5281/zenodo.56311

Metadata quality: Good - Better - Best

Metadata is the contextual information required to interpret data (Fig 1) and should be clearly defined and tightly integrated with data . The importance of metadata for context, reusability, and discovery has been written about at length in guides for data management best practices. Hart _et al. Ten Simple Rules for Digital Data Storage. PLoS Comput Biol. 2016;12: e1005097_

Metadata include information about data points, observations (rows, columns), samples, etc. There are also record-level metadata (metadata of research inputs and products as records), including typically the following:

Title
Authors
Description
Keywords

Good metadata are important for reproducible research, because they describe the data at various levels:, including measurement protocols, observations, versions of software and other tools, and thus provide the context for interpreting the data, analysis, and results.

Metadata also aid discovery.

Exercise 2

This is a continuation of Exercise 1. Rank the following Zenodo records from from 1 (most helpful/informative) to 3 (least helpful/informative) for metadata quality.

MS Salmanpour. (2016). Data set [Data set]. Zenodo. https://doi.org/10.5281/zenodo.193025

Solange Duruz. (2016). Simulated breed for GENMON [Data set]. Zenodo. https://doi.org/10.5281/zenodo.220887

Zichen Wang, Avi Ma’ayan. Zika-RNAseq-Pipeline v0.1. Zenodo; 2016. https://doi.org/10.5281/zenodo.56311

Discuss the following questions:

What were the criteria that you used to rank?

What was missing?

What was the most helpful?

What was the most critical piece of information?

Examples for learning what’s possible

Gallery of IPython Notebooks that have been used for scientific research and educational tutorials. Browse the topics that pique your curiosity.

Wang and Ma’ayan’s Zika manuscript

The Python for Bioinformatics textbook, a collection of notebooks

Key Points

Your code tells what you did. Your documentation tells why you did it and why it is important.

Documentation is the key to communicating your workflow and findings with your future self, collaborators, peers, and the general public.

Jupyter Notebooks are powerful because it allows documenting the what (the code) and the why (the motivation and/or intepretation) interspersed with each other.

Good, better, best: Some metadata are already much better than none, more metadata make better metadata.

lesson home

Sharing and Publishing Jupyter Notebooks

next episode

Documentation

Overview

Overview

Learning objectives

Documentation best practices

Consider the target audiences

README file

Exercise 1

Metadata quality: Good - Better - Best

Exercise 2

Examples for learning what’s possible

Key Points

lesson home

next episode