Documentation
Overview
Teaching: 10 min
Exercises: 15 minQuestions
Why should I invest time in good documentation?
How does my target audience influence my documentation strategy?
What are some published examples of good documentation?
Objectives
Describe how documentation is useful to yourself and to others
Evaluate and rank the quality of comments in published notebooks
Evaluate and rank the quality of existing metadata records.
Describe types of metadata directly relevant for research reproducibility.
Overview
Documenting your process, especially as it concerns your data, is a key element of making your research more reproducible. If you do not thoroughly record all the data manipulation steps you used to process data, it will likely be impossible for you, or anyone else, to repeat the analysis in the future (Wilson et al. 2016). Using the Jupyter Notebook for scripting your data processing is powerful because it saves the code – the what – and interspersed it the motivations behind each step, i.e., the why.
There is also project-level documentation that isn’t needed to understand a particular series of data processing steps, but to understand the organization of the project as a whole. Finally, documentation can be used to aid discoverability.
"If you want to slow down your competitors, give them all your data" @ctitusbrown #openscience #titusbuzz
— kelsey wood (@klsywd) January 15, 2016
In this lesson, we will discuss the types and styles for documentation, their utility, and how you might tailor them for different audiences.
Learning objectives
- Describe how documentation is useful to yourself and to others
- Evaluate and rank the quality of comments in published notebooks
- Evaluate and rank the quality of existing metadata records.
- Describe the types of and importance of record level metadata.
- Describe types of metadata directly relevant for research reproducibility.
Documentation best practices
Consider the target audiences
- For yourself: You are your most important collaborator. Remember that your past self doesn’t answer email. Treat it like a digital notebook detailing what you’ve done, why you did it, what worked and what didn’t work.
- For your peers: One step before sharing your notebook with others is to restart and rerun all the cells so that the entire workflow is reproduced in order (e.g
In [1]:
is followed byIn [2]:
). Intersperse the code cells with text cells (in Markdown) documenting the rationale for the workflow and interpreting the results so that a reader understands the context.
README file
It is important to write a brief overview of your project. A README file is a short file (think 1-pager) in the project’s home directory, and typically is the main entry point for readers to the project, including in particular the code. It should thus answer questions others will commonly have when they come upon the project, including the following:
- the purpose of the project, such as which problem does it try to solve, and what is its scope
- how suitable for reuse is the project, such as stage of maturity it is in
- prerequisites and other dependencies, and how to satisfy or obtain them
- where and how to start for using it
- how to cite and/or terms of reuse
- are contributions welcome, and if so, how to best make them
- who to contact and how for questions
A README should be written in text, with markup that is easy to read (such as Markdown, Reitz 2016).
Based on the above, items to include in a README file include the following:
- the project’s title
- a brief description
- a purpose statement
- up-to-date contact information
- a brief tutorial or how-to
- any relevant weblinks
- how to cite and license and/or terms of reuse
Exercise 1
Compare and contrast different research product archives for the quality and value of their documentation, and their corresponding utility for reuse.
- MS Salmanpour. (2016). Data set [Data set]. Zenodo. https://doi.org/10.5281/zenodo.193025
- Solange Duruz. (2016). Simulated breed for GENMON [Data set]. Zenodo. https://doi.org/10.5281/zenodo.220887
- Zichen Wang, Avi Ma’ayan. Zika-RNAseq-Pipeline v0.1. Zenodo; 2016. https://doi.org/10.5281/zenodo.56311
Metadata quality: Good - Better - Best
Metadata is the contextual information required to interpret data (Fig 1) and should be clearly defined and tightly integrated with data . The importance of metadata for context, reusability, and discovery has been written about at length in guides for data management best practices. Hart _et al. Ten Simple Rules for Digital Data Storage. PLoS Comput Biol. 2016;12: e1005097_
Metadata include information about data points, observations (rows, columns), samples, etc. There are also record-level metadata (metadata of research inputs and products as records), including typically the following:
- Title
- Authors
- Description
- Keywords
Good metadata are important for reproducible research, because they describe the data at various levels:, including measurement protocols, observations, versions of software and other tools, and thus provide the context for interpreting the data, analysis, and results.
Metadata also aid discovery.
Exercise 2
This is a continuation of Exercise 1. Rank the following Zenodo records from from 1 (most helpful/informative) to 3 (least helpful/informative) for metadata quality.
- MS Salmanpour. (2016). Data set [Data set]. Zenodo. https://doi.org/10.5281/zenodo.193025
- Solange Duruz. (2016). Simulated breed for GENMON [Data set]. Zenodo. https://doi.org/10.5281/zenodo.220887
- Zichen Wang, Avi Ma’ayan. Zika-RNAseq-Pipeline v0.1. Zenodo; 2016. https://doi.org/10.5281/zenodo.56311
Discuss the following questions:
- What were the criteria that you used to rank?
- What was missing?
- What was the most helpful?
- What was the most critical piece of information?
Examples for learning what’s possible
- Gallery of IPython Notebooks that have been used for scientific research and educational tutorials. Browse the topics that pique your curiosity.
- Wang and Ma’ayan’s Zika manuscript
- The Python for Bioinformatics textbook, a collection of notebooks
Key Points
Your code tells what you did. Your documentation tells why you did it and why it is important.
Documentation is the key to communicating your workflow and findings with your future self, collaborators, peers, and the general public.
Jupyter Notebooks are powerful because it allows documenting the what (the code) and the why (the motivation and/or intepretation) interspersed with each other.
Good, better, best: Some metadata are already much better than none, more metadata make better metadata.