Reproducing an analysis requires more than just code. You need the original raw data, access to the appropriate programming languages, and application specific packages (and often specific versions of these packages). This poses a major impediment to reproducibility, even for researchers with a background in bioinformatics. To address this challenge, you’ll learn how to ‘containerize’ your data, scripts and software, making it easy to share and rerun an entire analysis with the push of a button.
- Learn how to make your research analyses reproducible
- Create a reproducible package environment with renv
- Share your project via GitHub and git
- Understand how to streamline code using custom R functions.
- Share your work as an R package
- Discuss the basics of Docker and containerized software
- Use Code Ocean to access the entire DIY course as a reproducible computing environment
What you need to do
- Sign-up for an account on Code Ocean (be sure to use the same email address from your DataCamp login). You’ll get 15 compute hrs/month for free.
- Sign-up for a free GitHub account (doesn’t matter which email you use)
- Download this gitignore file - useful for updating your own .gitignore file in a project repo
- Download this script that walks through how to turn any analysis project into an R package. You may also want this text file as a simple starting point for data documentation, and this function file also as an example.
Part 1 - Reproducibility via the renv package
Part 2 - Connecting your project to GitHub
Part 3 - Code Ocean capsules for full reproducibility in publications
Part 4 - Keeping your code clean via custom functions
Part 5 - How to turn your analysis project into a stand alone R package
We’ll use Code Ocean to interact with a dockerized container that packs all the code, data and software from the course into one reproducible and web-accessible environment. Simply login (or set-up a free account if you don’t already have one) and you’ll be able to re-run the entire course in a matter of minutes, without any software installation or data download. Your first run may take ~15min, since the full computing environment must be being built, but subsequent runs will be much faster. Note that this capsule includes raw fastq files, kallisto outputs, all of MSigDB for running GSEA, and the entire ARCHS4 database for interrogating ~700,000 publically available mouse and human RNAseq datasets. Have fun adapting this capsule for your own analyses!
Happy Git and GitHub with RStudio - Jenny Bryan and team walk through every step of how to install git, connect to GitHub and access version control from within RStudio.
There’s a lot of reading material for how to get started making functions and packages. Beyond the extensive and very well written book on building R packages and excellent documentation for the usethis package, you may also want to check out some great blog posts on making R packages (here, here, here, and here).
Code Ocean whitepaper - describes the need for better tools for reproducible research and introduces their cloud-based computational platform for addressing this need.
Intro to Docker - Code Ocean is based on Docker, a free and open-source tool that allows you to ‘build, share and run applications anywhere’.
Code Ocean On-boarding document - Step-by-step details for how to set-up your own capsule.
Our recent paper, showing a code capsule embedded directly in the joural webpage (a first for any AAAS journal).
Google Collaboratory - Write, edit and share Python code directly in your browser