Data & Project Organization: Glossary

Key Points

Introduction
  • Using disorganized data is time-consuming and error prone.

  • Collaborators like your past self do not respond to email.

Project Structure
  • Organize and name files so that they make intuitive sense to your future self, and follow the narrative of the data analysis.

  • Populate folders with README files that describe the project and gives context for the analyses.

  • Original/raw data remains original and should never be modified.

  • Keep a clear record of every modification that has been made. Ideally, this is in the form of a script that can automatically generate cleaned data from the raw data.

  • Generated files (processed data, figures, etc) should not be intermingled in the same directory as files that must be backed up.

  • For something to be reproducible as a whole every step needs to be reproducible.

Metadata
  • All projects should include a README file in the top directory.

  • README files should include contact points and names of maintainers, date, brief description of the intent of the project, and the source of any data files.

  • Use a README to include changes made over time.

  • README files should be made in a plain text format.

Modifying data
  • Cleaned data should have its own README if any manual cleaning was performed.

  • Any modification should have a clear paper trail.

  • Using GUIs for modifying data often has unexpected results.

  • Using GUIs for cleaning up data seems quick, but doing it with even only a modicum of reproducibility becomes laborious fast.

Concluding thoughts
  • Organize files so that they make intuitive sense and follow the narrative of the data analysis.

  • Populate folders with metadata that describes the folder contents, where those contents came from, and gives context for the analyses that you’re about to perform.

  • Always make copies of data for modification, and never over-write the raw data.

  • Keep a clear record of every modification that has been made. Ideally, this is in the form of a script that can automatically generate cleaned data from the raw data.

  • If manual cleaning is necessary, create a README file that details every single change that has been made, such that a newcomer could re-create these changes.

Glossary

FIXME