Dumpster diving in RNA-seq data – DIY.transcriptomics

Image credit: Rutledge and Ariani, 2017. Nature Reviews Microbiology'.

If you’re new to R

Try applying your new found Tidyverse and ggplot skills to this Learn R! challenge

Description

After spending nearly two years working on a large RNA-seq study, you and your advisor have written up the results and submitted your work to a top journal. Three reviewers provided comments on the paper. Reviewer #1 and #3 asked for only minor changes, but reviewer #2 offered a more, ahem, detailed critique. In particular, they raised concerns about the quality of your RNA-seq data (which forms the entire basis of the paper), and noted that it was particularly concerning that only 50-60% of the reads aligned to the human reference transcriptome. ‘If these reads aren’t human, then what are they?’ said the reviewer. To address this critique, you’ll have to employ multiple strategies for identifying the origin of reads in high-throughput data.

To participate in this lab, you will need to have successfully completed the steps outlined in Lecture 1 - Setting up your software environment

Task 1

To help you get started, I already used KneadData to filter out any reads that mapped to the human reference genome. Using a single sample from the course dataset (CL13) as an example, KneadData reduced the number of total reads from 133M to ~9M! Why do you think KneadData is saying that 93% of the reads are of human origin, but Kallisto only mapped ~ 57% to the human reference?

Curious about the origin of the remaining ~9M reads left after running KneadData, you press forward with your analysis. Download the ‘dehosted’ fastq file for CL13 here. Anytime you have a new dataset in-hand, it’s always good practice to check the quality of the data. Open up your Conda environment and run fastqc (or fastp) on this new fastq file. Does the fastqc/fastp report provide any clues as the origin of your mystery reads?

Task 2

Since you have no idea where these reads are coming from, you decide that a broad and unbiased approach is your best bet to identify potential non-human reads in your sample. One strategy is to use Sourmash to create a minHash ‘sketch’ of your fastq file and compare this sketch against a reference set of sketches. To begin this task, you will need to download a set of about 90,000 microbial genomes here (don’t worry, it’s a small file). Here’s some example code to get you started. Be sure you run this in the appropriate Conda environment.

# time = ~2min to sketch ~9M reads
sourmash sketch dna -p scaled=10000,k=31,abund SRR8668774_dehosted.fastq.gz --name-from-first

# time = ~2min
sourmash gather -k 31 SRR8668774_dehosted.fastq.gz.sig genbank-k31.lca.json.gz
# once this is done, try rerunning with an additional argument to relax the threshold used for classification: '--threshold-bp 100'

To understand the ouput from Sourmash gather, check out the documentation here

Task 3

Note: this task will be more computationally demanding than to the task above, but is still feasible on many laptop configurations

Encouraged by the Sourmash results, you try a complementary approach that uses Centrifuge. To get started, you’ll need to download a reference database that contains sequences from bacteria, human and viral genomes. Click here to download (this file is ~ 6Gb, so it may take a while). Double-click this file once downloaded to unzip. Now you can use this database to search for signatures in your fastq file. How do these results compare with what you saw with Sourmash in Task 2? Why do you think the outputs produced by these two tools differ?

Tip: If you haven’t done so already, unzip the file you downloaded above to get a folder.

# time = ~10min
centrifuge -x p_compressed+h+v/p_compressed+h+v -U SRR8668774_dehosted.fastq.gz --report-file CL13_dehosted_report.txt -S CL13_dehosted_results.txt

Tip: once centrifuge is done running, open the ‘CL13_dehosted_report.txt’ ouput file in Excel and sort based on ‘numUniqueReads’ column

On your own

If you’re working through this lab on your own, you should be able to complete the two parts below. If you’re an in-person learner and were unable to attend this lab, you should turn in your Sourmash results for parts below to the TAs before the start of class next week to get credit.

Part 1

You present the Sourmash and Centrifuge results above to your advisor and, after some discussion and head-scratching, you conclude that without a positive control, it’s hard to know what to think of the results! A collaborator suggests evaluating these software tools using a synthetic or ‘mock’ microbial community. After combing through the literature, you identify a recent paper that shares raw data generated from sequencing the ZymoBiomics Mock Community standard. Perfect! Now all you need to do is download one of their fastq files here and run both Sourmash and Centrifuge on the data. How do your results compare with what is reported in Zymo’s product document? If you’re pressed for time and/or are struggling with the compute resources needed to run Centrifuge, then just use Sourmash for this part.

Part 2

A few months later, after your paper is accepted, a colleague comes to you because he is working on identifying SARS-CoV-2 in wastewater collected during the peak of the COVID19 pandemic from municipal sewer systems in Northern California (see this paper). RNA purified from these samples was intially used for virus-specific QPCR, but he also decided to carry out untargeted RNA-seq on the same samples to get a ‘metatranscriptome’ of wastewater. After hearing about your ability to identify pathogens lurking in RNA-seq data, he asks if you could take a look at one of his RNA-seq samples to see if this approach is sensitive enough to detect SARS-CoV-2. You can download one of his raw files here.

Luckily, you discover that the Centrifgue team incorporated over 100 SARS-CoV-2 sequences into an updated reference database available here. Download this reference and run Centrifuge on this sample. How do these Centrifuge results compare to what Sourmash detects in the sample?

Takeaways

Software tools originally developed for metagenomics can be really useful for identifying the origin of even low abundance reads present in RNA-seq.
RNA-seq of tissues, blood, and environmental samples should really be thought of as metatranscriptomes that can be mined for additional information beyond just read mapping to a single species.
Sourmash and Centrifuge are just two of the many tools out there for this type of analysis, and they happen to run really well on a laptop.
Additional reference databases are available for both pieces of software, some of which are much larger than the ones we used today, and you can also create your own custom databases.
Since both tools use a reference database, you have to consider how the choice of database impacts your results.