Metatranscriptome read mapping yielding a highly left skewed ORF TPM distribution

brandonkieft

Junior Member

Join Date: Jan 2020

Posts: 1
- Share
- Tweet
#1

Metatranscriptome read mapping yielding a highly left skewed ORF TPM distribution

01-23-2020, 09:19 AM

I'm going to jump right into the problem and put details below for clarification:

What are the possible causes of mapping high quality paired end metatransriptome reads back to high quality assembled contigs, calculating TPM for predicted ORFs on the contigs, then having just the top few ORFs getting assigned 50-60% of total TPM and recruiting 30-40% of reads? Sometimes these "highly abundant" ORFs were real genes and sometimes pseudogenes, and whey they're removed from the mapping files and the procedure is repeated, a new set of ORFs get 50-60% of TPM and recruit many reads. Any ideas??

We conducted mRNA sequencing of a complex microbial community (metatranscriptomics) using Illumina HiSeq 150 bp on a few dozen time-series samples. Sequencing reactions seemed to work very well, yielding ~10 million quality filtered merged fastq reads per sample. We assembled with MetaHit and got reasonable n50 (1500) and number of contigs (150k). We predicted open reading frames (ORFs) on contigs using prodigal and annotated taxonomy with an in-house program and function (clusters of orthologous genes; COGs) with rpsBLAST to the NR protein database. Based on the number of ORFs annotated to different COGs (i.e., a sample's "functional distribution") of the metatranscriptomes, we were very happy with our results and got a consistent functional profile across samples that make sense biologically. When we used BWA MEM to map reads back to the assemblies and Salmon to calculate TPM from the BAM/gff files, there was an extremely biased distribution of ORF TPM. By this I mean that we had 150-200k ORFs predicted per sample (after length filtering for only predicted ORFs > 60 amino acids), but in many samples a single or a few ORFs were getting 400-700k TPM, half or more of the total TPM - this should be more evenly distributed among the ORFs, I assume. When we took the ORF TPM and functional annotation together and plotted function over time, we got nonsensical results. When we looked at the ORFs that recruited tons of reads and got assigned high TPM, they're sometimes bona fide genes with functions and high homology to database genes, and sometimes nothing and look like pseudogenes. As a test, we removed all ORFs that had >10% of total TPM from the mapping files and reran BWA (to see if this was actually a biological signal and the reads did actually come from these ORFs and we'd get lower read mapping/more even TPM) - in fact, new ORFs "took the place" of the high TPM ones from the origingal analysis and we got the same skewed number of reads mapped and TPM distribution.

To convince ourselves this was not purely methodological, we did concurrent metagenome sequencing, assembly, read mapping, and TPM calculation using the same exact procedures and got good results that make sense and give expected TPM distribution of both ORFs and aggregate functions (i.e., the top ORFs recruit ~0.5% of total reads and TPM and aggregate TPM at the functional category level give "correct" results).
Tags: assembly, illumina, metatranscriptomics, read mapping, tpm
Cresil

Junior Member

Join Date: Jan 2020

Posts: 1
- Share
- Tweet
#2

01-25-2020, 10:52 AM

Did you filter against rRNA genes and did you adapter trim the reads? Especially the adapter may cause issues in the assembly.
Comment

Previous template Next

Essential Discoveries and Tools in Epitranscriptomics

by seqadmin

The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
- Channel: Articles
04-22-2024, 07:01 AM
Current Approaches to Protein Sequencing

by seqadmin

Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
- Channel: Articles
04-04-2024, 04:25 PM

Topics	Statistics	Last Post
A Close Examination at Probiotic-Related Bacteremia by seqadmin Started by seqadmin, Yesterday, 08:06 AM	0 responses 16 views 0 likes	Last Post by seqadmin Yesterday, 08:06 AM
Expanded Genetic Insights into Blood Pressure Regulation by seqadmin Started by seqadmin, 04-30-2024, 12:17 PM	0 responses 19 views 0 likes	Last Post by seqadmin 04-30-2024, 12:17 PM
The Role of Enhancers in Defining Cell Fate by seqadmin Started by seqadmin, 04-29-2024, 10:49 AM	0 responses 24 views 0 likes	Last Post by seqadmin 04-29-2024, 10:49 AM
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, 04-25-2024, 11:49 AM	0 responses 28 views 0 likes	Last Post by seqadmin 04-25-2024, 11:49 AM

Seqanswers Leaderboard Ad

Announcement

Metatranscriptome read mapping yielding a highly left skewed ORF TPM distribution

Comment

Latest Articles

ad_right_rmr

News