Now that we know how to write functions, we can use this concept for
Let's start with the chunk from the manuscript:
## Gathering all the data files split_gdp_files <- list.files(path = "data-raw", pattern = "gdp-percapita\\.csv$", full.names = TRUE) split_gdp_list <- lapply(split_gdp_files, read.csv) gdp <- do.call("rbind", split_gdp_list)
gdp
variable on the last linegdp
variable on the last linegather_gdp_data <- function() { split_gdp_files <- list.files(path = "data-raw", pattern = "gdp-percapita\\.csv$", full.names = TRUE) split_gdp_list <- lapply(split_gdp_files, read.csv) gdp <- do.call("rbind", split_gdp_list) gdp }
Demo 1: Function returns a data.frame with all countries combined
Use the folder where the files are stored and the pattern as arguments: path
and pattern
, respectively
Allows for re-usability for another project where a similar operation (combining many CSV files into a single data.frame
) would be needed
gather_data <- function(path = "data-raw", pattern = "gdp-percapita\\.csv$") { split_files <- list.files(path = path, pattern = pattern, full.names = TRUE) split_list <- lapply(split_files, read.csv) res <- do.call("rbind", split_list) res }
Demo 2: Revised function with inputs
The code here is pretty simple because we know that all datasets have exactly the same column, but in a real life example, we might way to add additional checks to ensure that we won't be introducing any issues.
This also illustrates how general you need to be when writing your functions. We could spend a lot of time optimizing and writing a function that would work on all cases. Sometimes it's worth your time, sometimes it might distract from your primary goal: writing the manuscript.
We can create a make_csv
function to automatically generate CSV files from our data sets.
This might come handy if you want to send your intermediate datasets to your collaborators or if you want to inspect more closely that everything is working as it should.
This function should take a data frame and make a CSV file out of it.
write.csv
row.names = FALSE
because we don't want them in the outputverbose
is useful for keeping track of progressdir.create
creates the directory in the path, recursive = TRUE
only does it if directory dosn't existmake_csv <- function(obj, path, file, verbose = TRUE) { if (verbose) { message("Creating csv file: ", file) } dir.create(path, showWarnings = FALSE, recursive = TRUE) write.csv(obj, file = paste0(path, "/", file), row.names = FALSE) }
Combine the two functions we just wrote (make_csv
and gather_data
) to generate a CSV file that contains the data from all countries:
gdp_data <- gather_data() make_csv(gdp_data, path = "data-output", file = "gdp.csv", verbose = TRUE)
Demo 3: Run the functions.
Transform these two pieces of code into functions:
## Turn this into a function called get_mean_lifeExp mean_lifeExp_by_cont <- gdp %>% group_by(continent, year) %>% summarize(mean_lifeExp = mean(lifeExp)) %>% as.data.frame
## Turn this into a function called get_latest_lifeExp latest_lifeExp <- gdp %>% filter(year == max(gdp$year)) %>% group_by(continent) %>% summarize(latest_lifeExp = mean(lifeExp)) %>% as.data.frame