Reproducible Research using Jupyter Notebooks

Prerequisites

The course is aimed at graduate students, postdocs, and other researchers who perform computational analysis or work. The material uses basic Python for teaching and illustrating the key concepts. Advanced knowledge of Python is not needed, but some familiarity with Python will aid in absorbing the material.

Workshop Overview

This document provides basic information about Reproducible Science with Jupyter Notebook workshops for instructors:

general outline of a Reproducible Science workshop;
location of materials;
learning goals for each module;
instructor skills needed for each module;
examples of previous workshops.

All of our material is on GitHub with a CC0 copyright waiver: Reproducible Science Curriculum on GitHub

Learning Objectives

The following are the overarching learning objectives for the curriculum.

Understand the value of reproducible research practices for more effective research for current and future you.
Understand the value of reproducible research practices for advancing research as a whole.
Understand what is meant by making your research more reproducible.
Know practices to make your research more reproducible, in particular by using Jupyter Notebooks, and have the skills to do so.
Have the confidence and foundation to continue improving reproducibility of your research.
Understand what’s possible and they still can learn to be more effective with reproducible research.

Workshop outline

A Reproducible Science with Jupyter Notebooks Curriculum workshop currently has five modules:

Introduction to the workshop
Data and Project Organization
Introduction to the Jupyter notebook
Data Exploration
Automation
Publication
Sharing

I. Workshop Introduction

Goals: Introduction to the workshop, including motivation, agenda and goals for the workshop.

Materials
Repository: https://reproducible-science-curriculum.github.io/workshop-introduction-RR-Jupyter/

II. Data and Project Organization

Goals: Students will learn recognizing common data file formats and how to import them into a Jupyter notebook; be able to design and justify a directory structure and file naming convention for a project; be able to move from an empty notebook through exploratory analysis into a more refined script or set of notebooks that communicates results reproducibly.

Instructor’s skills: Good understanding of file organisation in research projects. Understanding of file structure on major operating systems (Windows, Linux/Unix, Mac OS) and the interface/commands for managing files and folders. Understanding of basic file types (binary vs. text). At least a basic overview of how files are stored (and deleted) in different operating systems. Understanding of file and folder naming conventions (names, extensions etc.).

Materials
Repository: https://reproducible-science-curriculum.github.io/organization-RR-Jupyter/

III. Introduction to the Jupyter notebook

Goals: Students will understand the concept, importance, and components of reproducible research; understand the strengths of Jupyter Notebooks as a tool for reproducible research; be able tp create and navigate through a Jupyter Notebook containing Markdown and Code cells; and be able to know and access the broader Jupyter and Python ecosystems and communities.

Instructor’s skills: Familiarity with Jupyter notebooks; familiarity with markdown; basic python skills.

Materials
Repository: https://github.com/Reproducible-Science-Curriculum/introduction-RR-Jupyter

IV. Data Exploration

Goals: Students will be able to assess the structure and cleanliness of their dataset; be able to describe their findings, translate results, and summarize their thought process in a narrative comprised of Markdown text and Python code in a Jupyter Notebook; learn practices for modifying raw data to prepare a clean data set in a reproducible and documented way; and be able to assess whether their data is “Tidy”, and how to arrange it into a tidy format.

Instructor’s skills: Facility with tabular data. Understanding of the steps needed to reshape, merge, and subset data. Knowledge of different types of plots, and which types of plots are appropriate for various kinds of data. Familiaritiy with regular expressions, pandas, and matplotlib is helpful.

Materials
Repository: https://reproducible-science-curriculum.github.io/data-exploration-RR-Jupyter/

V. Automation

Goals: Students will learn how to programmatically assemble a manuscript using elements generated by a notebook, including text, headings and figures generated from code and data.

Instructor’s skills: Good understanding of programming concepts, in particular code modularisation, writing and using functions, code reusability and so on. Good understanding of selected software engineering concepts such as project build and automation, code testing, continuous integration and so on. Solid knowledge of Python, Jupyter, and relevant packages (consult the materials for details). Understanding of basic statistical concepts (consult the materials for details).

Materials:
Repository: https://reproducible-science-curriculum.github.io/automation-RR-Jupyter/

VI. Publication

Goals: Students will learn how to export their notebooks in a variety of formats for publication; be able to describe the utility of documentation to themselves and others; be able to describe and compose appropriate and descriptive keywords for a given record; be able to define and describe the importance of unique identifiers for data, publication and software; and learn how to select an appropriate license for their research artifacts.

Instructor’s skills: Understanding of requirements for reproducible publication. Understanding of differences between publication and sharing. Understanding the difference between open and restricted access publication. Overview of tools and repositories for publishing research outputs. Knowledge of different licensing models and ability to discuss major differences between the most commonly used licenses in research.

Materials:
Repository: https://reproducible-science-curriculum.github.io/publication-RR-Jupyter/

Goals: Students will learn how to share their Jupyter notebooks online, both static (using GitHub) and interactive (using Binder).

Instructor’s skills: Some familiarity with GitHub, understanding of software dependencies and (containerized) environments.

Materials
Repository: https://reproducible-science-curriculum.github.io/sharing-RR-Jupyter/

Workshops held previously

This curriculum is in pilot status. The inital version of the curriculum (resulting from the Jan 2017 development workshop) has been taught as follows:

Duke University, March 2017 (full 2-day curriculum)
NIH, April 2017 (shortened 1-day version)

The next iteration of the curriculum (resulting from the Jan 2018 development workshop) has been taught at the following workshops:

UC Merced, Jan 2018 (full 2-day curriculum)
NIH, Jan 2018

Ongoing work

These materials are being developed and revised on an ongoing basis. The list of GitHub issues for the Reproducible-Science-Curriculum gives a pretty good idea of what is happening and what needs to be done.