~/Week 3

Brandon Rozek

Photo of Brandon Rozek

PhD Student @ RPI studying Automated Reasoning in AI and Linux Enthusiast.


People are busy, especially managers and leaders. Results of data analyses are sometimes presented in oral form, but often the first cut is presented via email.

It is often useful therefore, to breakdown the results of an analysis into different levels of granularity/detail

Hierarchy of Information: Research Paper

Hierarchy of Information: Email Presentation

DO: Start with Good Science

DON’T: Do Things By Hand

Things done by hand need to precisely documented (this is harder than it sounds!)

DON’T: Point and Click

DO: Teach a Computer

If something needs to be done as part of your analysis / investigation, try to teach your computer to do it (even if you only need to do it once)

In order to give your computer instructions, you need to write down exactly what you mean to do and how it should be done. Teaching a computer almost guarantees reproducibility

For example, by, hand you can

   	1. Go to the UCI Machine Learning Repository at http://archive.ics.uci.edu/mil/
    	2. Download the Bike Sharing Dataset

Or you can teach your computer to do it using R

download.file("http://archive.ics.uci.edu/ml/machine-learning-databases/00275/Bike-Sharing-Dataset.zip", "ProjectData/Bike-Sharing-Dataset.zip")

Notice here that:

DO: Use Some Version Control

It helps you slow things down by adding changes into small chunks. (Don’t just do one massive commit). It allows one to track / tag snapshots so that one can revert back to older versions of the project. Software like Github / Bitbucket / SourceForge make it easy to publish results.

DO: Keep Track of Your Software Environment

If you work on a complex project involving many tools / datasets, the software and computing environment can be critical for reproducing your analysis.

Computer Architecture: CPU (Intel, AMD, ARM), CPU Architecture, GPUs

Operating System: Windows, Mac OS, Linux / Unix

Software Toolchain: Compilers, interpreters, command shell, programming language (C, Perl, Python, etc.), database backends, data analysis software

Supporting software / infrastructure: Libraries, R packages, dependencies

External dependencies: Websites, data repositories, remote databases, software repositories

Version Numbers: Ideally, for everything (if available)

This function in R helps report a bunch of information relating to the software environment


DON’T: Save Output

Avoid saving data analysis output (tables, figures, summaries, processed data, etc.), except perhaps temporarily for efficiency purposes.

If a stray output file cannot be easily connected with the means by which it was created, then it is not reproducible

Save the data + code that generated the output, rather than the output itself.

Intermediate files are okay as long as there is clear documentation of how they were created.

DO: Set Your Seed

Random number generators generate pseudo-random numbers based on an initial seed (usually a number or set of numbers)

‚Äč In R, you can use the set.seed() function to set the seed and to specify the random number generator to use

Setting the seed allows for the stream of random numbers to be exactly reproducible

Whenever you generate random numbers for a non-trivial purpose, always set the seed.

DO: Think About the Entire Pipeline

Summary: Checklist

Replication and Reproducibility



The Result?

What Problem Does Reproducibility Solve?

What we get:

What we do NOT get

An analysis can be reproducible and still be wrong

We want to know ‘can we trust this analysis

Does requiring reproducibility deter bad analysis?

Problems with Reproducibility

The premise of reproducible research is that with data/code available, people can check each other and the whole system is self-correcting

Who Reproduces Research?

The Story So Far

Evidence-based Data Analysis

Evidence-based Data Analysis

Case Study: Estimating Acute Effects of Ambient Air Pollution Exposure

Case Study: Estimating Acute Effects of Ambient Air Pollution Exposure

DSM Modules for Time Series Studies of Air Pollution and Health

  1. Check for outliers, high leverage, overdispersion
  2. Fill in missing data? No!
  3. Model selection: Estimate degrees of freedom to adjust for unmeasured confounders
    • Other aspects of model not as critical
  4. Multiple lag analysis
  5. Sensitivity analysis wrt
    • Unmeasured confounder adjustment
    • Influential points

Where to Go From Here?

A Curated Library of Data Analysis