Lack of reproducibility in science causes significant issues
Science retracted (without lead author's consent) a study of how canvassers can sway people's opinions about gay marriage
Original survey data was not made available for independent reproduction of results (and survey incentives misrepresented, and sponsorship statement false)
Two Berkeley grad students attempted to replicate the study and discovered that the data must have been faked.
Lack of reproducibility in science causes significant issues
From the authors of Low Dose Lidocaine for Refractory Seizures in Preterm Neonates (doi:10.1007/s12098-010-0331-7:
The article has been retracted at the request of the authors. After carefully re-examining the data presented in the article, they identified that data of two different hospitals got terribly mixed. The published results cannot be reproduced in accordance with scientific and clinical correctness.
Source: Retraction Watch
Lack of reproducibility in science causes significant issues
Reproducible science accelerates scientific progress.
See an experiment on reproducing reproducible computational research
Day 1
Day 2
This is a two-part exercise:
Part 1: Analyze + document
Part 2: Swap + discuss
Complete the following tasks and write instructions / documentation for your
collaborator to reproduce your work starting with the original dataset
(data/gapminder-5060.csv
).
Download material: http://bit.ly/cbb-retreat -> Releases -> Latest
Visualize life expectancy over time for Canada in the 1950s and 1960s using a line plot.
Something is clearly wrong with this plot! Turns out there's a data error
in the data file: life expectancy for Canada in the year 1957 is coded
as 999999
, it should actually be 69.96
. Make this correction.
Visualize life expectancy over time for Canada again, with the corrected data.
Stretch goal: Add lines for Mexico and United States.
Introduce yourself to your collaborator.
Swap instructions / documentation with your collaborator, and try to reproduce their work, first without talking to each oher. If your collaborator does not have the software they need to reproduce your work, we encourage you to either help them install it or walk them through it on your computer in a way that would emulate the experience. (Remember, this could be part of the irreproducibility problem!)
Then, talk to each other about challenges you faced (or didn't face) or why you were or weren't able to reproduce their work.
This exercise:
In a lab setting:
Documentation: difference between binary files (e.g. docx) and text files and why text files are preferred for documentation
Organization: tools to organize your projects so that you don't have a single folder with hundreds of files
Automation: the power of scripting to create automated data analyses
Dissemination: publishing is not the end of your analysis, rather it is a way station towards your future research and the future research of others
Provenance with results pasted into manuscript:
Life expectancy shouldn't exceed even the most extreme age observed for humans.
if (any(gap_5060$lifeExp > 150)) {
stop("improbably high life expectancies")
}
Error in eval(expr, envir, enclos): improbably high life expectancies
The library testthat
allows us to make this a little more readable:
library(testthat)
expect_that(any(gap_5060$lifeExp > 150), is_false(),
"improbably high life expectancies")
Error: any(gap_5060$lifeExp > 150) isn't false
improbably high life expectancies
File organization and naming are effective weapons against chaos.
Your data files contain readings from a well plate, one file per well,
using a specific assay run on a certain date, after a certain treatment.
$ ls *Plsmd*
2013-06-26_BRAFASSAY_Plsmd-CL56-1MutFrac_A01.csv
2013-06-26_BRAFASSAY_Plsmd-CL56-1MutFrac_A02.csv
2013-06-26_BRAFASSAY_Plsmd-CL56-1MutFrac_A03.csv
2013-06-26_BRAFASSAY_Plsmd-CL56-1MutFrac_B01.csv
2013-06-26_BRAFASSAY_Plsmd-CL56-1MutFrac_B02.csv
...
2013-06-26_BRAFASSAY_Plsmd-CL56-1MutFrac_H03.csv
> list.files(pattern = "Plsmd") %>% head
[1] 2013-06-26_BRAFASSAY_Plsmd-CL56-1MutFrac_A01.csv
[2] 2013-06-26_BRAFASSAY_Plsmd-CL56-1MutFrac_A02.csv
[3] 2013-06-26_BRAFASSAY_Plsmd-CL56-1MutFrac_A03.csv
[4] 2013-06-26_BRAFASSAY_Plsmd-CL56-1MutFrac_B01.csv
[5] 2013-06-26_BRAFASSAY_Plsmd-CL56-1MutFrac_B02.csv
[6] 2013-06-26_BRAFASSAY_Plsmd-CL56-1MutFrac_B03.csv
meta <- stringr::str_split_fixed(flist, "[_\\.]", 5)
colnames(meta) <-
c("date", "assay", "experiment", "well", "ext")
meta[,1:4]
date assay experiment well
[1,] "2013-06-26" "BRAFASSAY" "Plsmd-CL56-1MutFrac" "A01"
[2,] "2013-06-26" "BRAFASSAY" "Plsmd-CL56-1MutFrac" "A02"
[3,] "2013-06-26" "BRAFASSAY" "Plsmd-CL56-1MutFrac" "A03"
[4,] "2013-06-26" "BRAFASSAY" "Plsmd-CL56-1MutFrac" "B01"
[5,] "2013-06-26" "BRAFASSAY" "Plsmd-CL56-1MutFrac" "B02"
[6,] "2013-06-26" "BRAFASSAY" "Plsmd-CL56-1MutFrac" "B03"
Noble, William Stafford. 2009. “A Quick Guide to Organizing Computational Biology Projects.” PLoS Computational Biology 5 (7): e1000424.
|
+-- data-raw/
| |
| +-- gapminder-5060.csv
| +-- gapminder-7080.csv.csv
| +-- ....
|
+-- data-output/
|
+-- fig/
|
+-- R/
| |
| +-- figures.R
| +-- data.R
| +-- utils.R
| +-- dependencies.R
|
+-- tests/
|
+-- manuscript.Rmd
+-- make.R
data-raw
: the original data, you shouldn't edit or otherwise alter any of
the files in this folder.data-output
: intermediate datasets that will be generated by the
analysis.
fig
: the folder where we can store the figures used in the manuscript.R
: our R code (the functions)
tests
: the code to test that our functions are behaving properly and that
all our data is included in the analysis.make_ms <- function() {
rmarkdown::render("manuscript.Rmd",
"html_document")
invisible(file.exists("manuscript.html"))
}
clean_ms <- function() {
res <- file.remove("manuscript.html")
invisible(res)
}
make_all <- function() {
make_data()
make_figures()
make_tests()
make_ms()
}
clean_all <- function() {
clean_data()
clean_figures()
clean_ms()
}
testthat
includes a function called test_dir
that will run tests
included in files in a given directory. We can use it to run all the tests in
our tests/
folder.
test_dir("tests/")
Let's turn it into a function, so we'll be able to add some additional
functionalities to it a little later. We are also going to save it at the root
of our working directory in the file called make.R
:
## add this to make.R
make_tests <- function() {
test_dir("tests/")
}
Piwowar & Vision (2013) “Data reuse and the open data citation advantage.” PeerJ, e175
Figure 1: Citation density for papers with and without publicly available microarray data, by year of study publication.
Wicherts et al (2011) “Willingness to Share Research Data Is Related to the Strength of the Evidence and the Quality of Reporting of Statistical Results.” PLoS ONE 6(11): e26828
Figure 1. Distribution of reporting errors per paper for papers from which data were shared and from which no data were shared.
Do's
Don't's
Morin, Andrew, Jennifer Urban, and Piotr Sliz. 2012. “A Quick Guide to Software Licensing for the Scientist-Programmer.” PLoS Computational Biology 8 (7): e1002598.
From the Panton Principles:
[In] the scholarly research community the act of citation is a commonly held community norm when reusing another community member’s work. […] A well functioning community supports its members in their application of norms, whereas licences can only be enforced through court action and thus invite people to ignore them when they are confident that this is unlikely.
Peng, R. D. “Reproducible Research in Computational Science” Science 334, no. 6060 (2011): 1226–1227
The Markdown sources, and the HTML, are hosted on Github: https://github.com/Reproducible-Science-Curriculum/cbb-retreat