Introduction to automation

"Reproducibilty is actually all about being as lazy as possible!",

– Hadley Wickham (via Twitter, 2015-05-03)

Disclaimer

Level of lesson

Depending on your previous experience with R Markdown, this lesson might seem too advanced as it provides ways to deal with annoyances you can face when you try to use it within a large project.
If you have never used R Markdown before this is a good opportunity to learn efficient practices right away.

Goals

Main goal: learn some tips and tricks that will make your life easier with R Markdown if you're using it in the context of a research project involving
- multiple sources of raw data
- that are combined to create multiple intermediate datasets that are themselves combined to generate figures and tables.
Start small: You don't need to adopt all of these practices at once.
- Start with only one dataset, one figure first, and later learn how to use Make, and then Travis.

Toolkit level and skills

None of these tools are exceedingly hard to learn or to grasp, but trying to learn everything at once might feel overwhelming.
In the process of learning these tools, you will learn skills that you will be able to translate into other components of your research.
- Interested in writing a package in R? Many of the advice and tools included in this lesson will greatly help you get started.
- Want to start using Make and Travis? This will provide an opportunity to learn some basics of UNIX and Linux which are general useful skills to have if you are interested in analyzing large datasets (e.g., set up runs on an HPC cluster).

The issue(s)

Challenges of the reproducible workflow

R Markdown allows you to mix code and prose, which is wonderful and very powerful, but can be difficult to manage if you don't have a good plan to get organized.

Challenge: Develop your analysis at the same time as writting your manuscript and refining your ideas, adjusting the aim of your paper, deciding on the data you are going to include, etc.

A solution

Outline

Demonstrate writing functions to generate the clean version of your data, your figures, your tables and your manuscript.

Why? Having all the content of your manuscript as a function will greatly facilitate the upkeep of your manuscript as it forces to be organized.
Additional benefits:
1. Modularity
2. Less variables
3. Better documentation
4. Testing

Modularity

By breaking down your analysis into functions, you end up with blocks of code that can interact and depend on each others in explicit ways.
It allows you to avoid repeating yourself, and you will be able to re-use the functions you create for other projects more easily than if your paper only contains scripts.
- House of cards vs. house made of lego.

Fewer variables to worry about…

so you can focus on the important stuff!

Avoid having to track temporary variables: If your manuscript only contains scripts, you are going to accumulate many variables, and you are going to have to worry about avoiding name conflicts among all these temporary variables that store intermediate versions of your datasets but won't need in your analysis.
- Putting everything into functions will hide these variables from your global environment so that you can focus on the important stuff: the inputs and the outputs of your workflow.
Keep track of dependencies easily: Functions that produce the variables, results, or figures you need in your manuscript allows you to track how your variables are related, which dataset depend on which one, etc.

Documenting your code

Ideally, your code should be written so that it's easy to understand and your intentions are clear.
- However, what might seem clear to you now might be clear as mud 6 months from now or even 3 weeks from now.
- Other times, it might not seem very efficient to refactor a piece of code to make it clearer, and you end up with a piece of code that works but is klunky.
- If you thrive on geekiness and/or nerdiness you might endup over engineering a part of your code and make it more difficult to understand a few weeks later.
In all of these situations, and even if you think your code is clear and simple, it's important that you document your code and your functions, for your collaborators, and your future self.

Can't I just document my scripts?

-If all your analysis is made up of scripts, with pieces that are repeated in multiple parts of your document, things can get out of hand pretty quickly.

Not only it is more difficult to maintain because you will have to find and replace the thing that you need to change in multiple places of your code, but managing documentation is also challenging.
- Do you also duplicate your comments where you duplicate parts of your scripts?
- How do you keep the duplicated comments in sync?
Re-organizing your scripts into functions (or organizing your analysis in functions from the beginning) will allow you to explicitly document the dataset or the parameters on which your function, and therefore your results, depends on.

Tips for documenting your code

Easiest way: add comments around your functions to explicitly indicate the purpose of each function, what the arguments are supposed to be (class and format) and the kind of output you will get from it.
Document not only the kind of input your function takes, but also the format and structure of the output.

Documentation, kicked up a notch

roxygen is a format that allows the documentation of functions, and it can easily be converted into the file formats used by R documentation.

Writing for roxygen is not very different from simple comments, you just need to add some keywords to define what will end up in the different sections of the help files.
This is not a strict requirement, and will it not make your analysis more reproducible, but it will be useful down the road if you think you will convert your manuscript into a package (more on this in a sec).
RStudio makes it easy to write roxygen.
- Once you have started writing a function, in the menu choose Code > Insert Roxygen Skeleton or type Ctrl + Alt + Shift + R on your keyboard.

Testing

When you start writing a lot of code for your paper, it becomes easier to introduce bugs.
- For example, if your analysis relies on data that gets updated often, you may want to make sure that all the columns are there, and that they don't include data they should not.
If these issues break something in your analysis, you might be able to find it easily, but more often than not, these issues might produce subtle differences in your results that you may not be able to detect.
If all your code is made up of functions, then you can control the input and test for the output. It is something that would be difficult if not impossible to do if all your analysis is in the form of a long script.

Testing, kicked up a notch

The testthat package provides a powerful and easy-to-use framework to build tests for your functions.

Organizing your files

File organization

File organization for this lesson

data-raw: the original data, you shouldn't edit or otherwise alter any of the files in this folder.
data-output: intermediate datasets that will be generated by the analysis.
- We write them to CSV files so we could share them with collaborators.
- If it took a long time to generate the files, we may want to also use them for our analysis. For this example, they are small and can be recreated every time.
fig: the folder where we can store the figures used in the manuscript.
R: our R code (the functions)
- Often easier to keep the prose separated from the code.
- If you have a lot of code (and/or manuscript is long), it's easier to navigate.
tests: the code to test that our functions are behaving properly and that all our data is included in the analysis.

What's next?

Today we are going to work on functionalizing a knitr document that is more complex than what we have seen so far but not quite as complex as a "real" research document could look like.

Let's take a look at example-manuscript folder…