Basics

Using functions to…

Now that we know how to write functions, we can use this concept for

  • preparing our data sets for analysis,
  • generating the intermediary data files,
  • and exporting them to CSV so that we can, for instance, share them with our collaborators.

Functionalizing a chunk

Let's start with the chunk from the manuscript:

## Gathering all the data files
split_gdp_files <- list.files(path = "data-raw", pattern = "gdp-percapita\\.csv$", 
                              full.names = TRUE)
split_gdp_list <- lapply(split_gdp_files, read.csv)
gdp <- do.call("rbind", split_gdp_list)
  • Enclose these lines of code inside curly brackets
  • Don't foget to return the gdp variable on the last line

Functionalizing a chunk (cont.)

  • Enclose these lines of code inside curly brackets
  • Don't foget to return the gdp variable on the last line
gather_gdp_data <- function() {
    split_gdp_files <- list.files(path = "data-raw", pattern = "gdp-percapita\\.csv$", 
                                  full.names = TRUE)
    split_gdp_list <- lapply(split_gdp_files, read.csv)
    gdp <- do.call("rbind", split_gdp_list)
    gdp
}


Demo 1: Function returns a data.frame with all countries combined

Making functions more versatile

  • Use the folder where the files are stored and the pattern as arguments: path and pattern, respectively

  • Allows for re-usability for another project where a similar operation (combining many CSV files into a single data.frame) would be needed

gather_data <- function(path = "data-raw", pattern = "gdp-percapita\\.csv$") {
  split_files <- list.files(path = path, pattern = pattern, full.names = TRUE)
  split_list <- lapply(split_files, read.csv)
  res <- do.call("rbind", split_list)
  res
}

Demo 2: Revised function with inputs

Caveats

  • The code here is pretty simple because we know that all datasets have exactly the same column, but in a real life example, we might way to add additional checks to ensure that we won't be introducing any issues.

  • This also illustrates how general you need to be when writing your functions. We could spend a lot of time optimizing and writing a function that would work on all cases. Sometimes it's worth your time, sometimes it might distract from your primary goal: writing the manuscript.

Towards automation

  • We can create a make_csv function to automatically generate CSV files from our data sets.

  • This might come handy if you want to send your intermediate datasets to your collaborators or if you want to inspect more closely that everything is working as it should.

  • This function should take a data frame and make a CSV file out of it.

Writing the function

  • Start by just typing the write.csv
  • Use row.names = FALSE because we don't want them in the output
  • verbose is useful for keeping track of progress
  • dir.create creates the directory in the path, recursive = TRUE only does it if directory dosn't exist
make_csv <- function(obj, path, file, verbose = TRUE) {
  if (verbose) {
    message("Creating csv file: ", file)
  }
  dir.create(path, showWarnings = FALSE, recursive = TRUE)  
  write.csv(obj, file = paste0(path, "/", file), row.names = FALSE)
}

Putting it all togehter

Combine the two functions we just wrote (make_csv and gather_data) to generate a CSV file that contains the data from all countries:

gdp_data <- gather_data()
make_csv(gdp_data, path = "data-output", file = "gdp.csv", verbose = TRUE)


Demo 3: Run the functions.

Your turn

Transform these two pieces of code into functions:

## Turn this into a function called get_mean_lifeExp
mean_lifeExp_by_cont <- gdp %>%
    group_by(continent, year) %>%
    summarize(mean_lifeExp = mean(lifeExp)) %>%
    as.data.frame
## Turn this into a function called get_latest_lifeExp
latest_lifeExp <- gdp %>%
    filter(year == max(gdp$year)) %>%
    group_by(continent) %>%
    summarize(latest_lifeExp = mean(lifeExp)) %>%
    as.data.frame