Dumpster diving in RNA-seq data – DIY.transcriptomics

Image credit: Rutledge and Ariani, 2017. Nature Reviews Microbiology'.

If you’re new to R

Try applying your new found Tidyverse and ggplot skills to this Learn R! challenge

Description

After spending nearly two years working on a large RNA-seq study, you and your advisor have written up the results and submitted your work to a top journal. Three reviewers provided comments on the paper. Reviewer #1 and #3 asked for only minor changes, but reviewer #2 offered a more, ahem, detailed critique. In particular, they raised concerns about the quality of your RNA-seq data (which forms the entire basis of the paper), and noted that it was particularly concerning that only 50-60% of the reads aligned to the human reference transcriptome. ‘If these reads aren’t human, then what are they?’ said the reviewer. To address this critique, you’ll have to employ multiple strategies for identifying the origin of reads in high-throughput data.

To participate in this lab, you will need to have successfully completed the steps outlined in Lecture 1 - Setting up your software environment

Files you need to download for this lab

To help you get started, I already used KneadData to filter out any reads that mapped to the human reference genome. Using a single sample from the course dataset (CL13) as an example, KneadData reduced the number of total reads from 133M to ~9M! Why do you think KneadData is saying that 93% of the reads are of human origin, but Kallisto only mapped ~ 57% to the human reference?

Curious about the origin of the remaining ~9M reads left after running KneadData, you press forward with your analysis. Download the ‘dehosted’ fastq file for CL13 here.

Task 1 - apply Sourmash to our course data

Since you have no idea where these unmapped reads are coming from, you decide that a broad and unbiased approach is your best bet to identify potential non-human reads in your sample. One strategy is to use Sourmash to create a minHash ‘sketch’ of your fastq file and compare this sketch against a reference set of sketches. To begin this task, you will need to download a set of about 90,000 microbial genomes here (don’t worry, it’s a small file). Here’s some example code to get you started. Be sure you run this in the appropriate Conda environment.

# time = ~2min to sketch ~9M reads
sourmash sketch dna -p scaled=10000,k=31,abund SRR8668774_dehosted.fastq.gz --name-from-first

# time = ~2min
sourmash gather -k 31 SRR8668774_dehosted.fastq.gz.sig genbank-k31.lca.json.gz
# once this is done, try rerunning with an additional argument to relax the threshold used for classification: '--threshold-bp 100'

To understand the ouput from Sourmash gather, check out the documentation here

Task 2 - evaluate Sourmash performance on mock community

You present the Sourmash results above to your advisor and, after some discussion and head-scratching, you conclude that without a positive control, it’s hard to know what to think of the results! A collaborator suggests evaluating these software tools using a synthetic or ‘mock’ microbial community. After combing through the literature, you identify a recent paper that shares raw data generated from sequencing the ZymoBiomics Mock Community standard. Perfect! Now all you need to do is download one of their fastq files here and run both Sourmash on the data. How do your results compare with what is reported in Zymo’s product document?

Task 3 - find SARS-CoV-2 in wastewater RNA-seq data

A few months later, after your paper is accepted, a colleague comes to you because he is working on identifying SARS-CoV-2 in wastewater collected during the peak of the COVID19 pandemic from municipal sewer systems in Northern California (see this paper). RNA purified from these samples was intially used for virus-specific QPCR, but he also decided to carry out untargeted RNA-seq on the same samples to get a ‘metatranscriptome’ of wastewater. After hearing about your ability to identify pathogens lurking in RNA-seq data, he asks if you could take a look at one of his RNA-seq samples to see if this approach is sensitive enough to detect SARS-CoV-2. You can download one of his raw files here.

To complete this task we have to use a different reference database. You can see the list of available premade Sourmash databases here. It’s worth taking a look a close look at this page outside of class. Note that there are huge collections of genomes that can be downloaded and searched within a few minutes on a laptop (e.g. over 1 million bacterial genomes from genbank). For the purposes of this task, download the genbank-2022.03-viral-k31.zip database that contains over 47,000 viral genomes. Do not unzip this file.

Note: if you do not get any hits, try modifying the --threshold-bp parameter from the default 50bp to 25 or 100.

On your own

If you’re working through this lab on your own, you should be able to complete the three parts above. If you’re an in-person learner and were unable to attend this lab, you should turn in your Sourmash results for three tasks above to the TAs before the start of class next week to get credit.

Extended learning

Note: this task will be more computationally demanding than the task above, but is still feasible on many laptop configurations

Encouraged by the Sourmash results, you try a complementary approach that uses Centrifuger. To get started, you’ll need to download a reference database that contains the Refseq human, bacteria, archea, virus sequences, plus SARS-CoV2-variants. Go here and click ‘Download all’. This will download a zipped file that is ~45Gb, so it may take several hours to complete. Double-click this file once downloaded to unzip and produce a folder that contains three files. Now you can use this database to search for signatures in your fastq file. How do these results compare with what you saw with Sourmash in Task 1 above? What’s missing from this report and why?

Tip: If you haven’t done so already, unzip the file you downloaded above to get a folder (containing three files with the same prefix). To use this database, you’ll need to either have all 3 files in your working directory, or specify the path to the folder where they are located. Use the prefix to refer to this database.

# time = less than 1min (on 16 cores)
# being by classifying all sequences in the fastq file, and store the results in a new file called 'classification.tsv'.
centrifuger -x cfr_hpv+gbsarscov2 -u SRR8668774_dehosted.fastq.gz -t 32 > classification.tsv

# time = few seconds
# use the classification.tsv file you produced above to create a taxonomic report
centrifuger-quant -x cfr_hpv+gbsarscov2 -c classification.tsv > report.tsv

Tip: once Centrifuger is done running, open the ‘report.tsv’ ouput file in Excel and sort based on the ‘abundance’ column

Luckily, you notice that the Centrifguer team incorporated over SARS-CoV-2 sequences into the same reference database. Rerun Centrifuger, but this time on the wastewater sample from Task 3 above. How do these Centrifuger results compare to what Sourmash detected in the sample? What proportion (roughly) of the sample appears to be SARS-CoV-2?

Takeaways

Software tools originally developed for metagenomics can be really useful for identifying the origin of even low abundance reads present in RNA-seq.
RNA-seq of tissues, blood, and environmental samples should really be thought of as metatranscriptomes that can be mined for additional information beyond just read mapping to a single species.
Sourmash and Centrifuger are just two of the many tools out there for this type of analysis, and they happen to run really well on a laptop.
Additional reference databases are available for both pieces of software, some of which are much larger than the ones we used today, and you can also create your own custom databases.
Since both tools use a reference database, you have to consider how the choice of database impacts your results.