I'm going to jump right into the problem and put details below for clarification:
What are the possible causes of mapping high quality paired end metatransriptome reads back to high quality assembled contigs, calculating TPM for predicted ORFs on the contigs, then having just the top few ORFs getting assigned 50-60% of total TPM and recruiting 30-40% of reads? Sometimes these "highly abundant" ORFs were real genes and sometimes pseudogenes, and whey they're removed from the mapping files and the procedure is repeated, a new set of ORFs get 50-60% of TPM and recruit many reads. Any ideas??
We conducted mRNA sequencing of a complex microbial community (metatranscriptomics) using Illumina HiSeq 150 bp on a few dozen time-series samples. Sequencing reactions seemed to work very well, yielding ~10 million quality filtered merged fastq reads per sample. We assembled with MetaHit and got reasonable n50 (1500) and number of contigs (150k). We predicted open reading frames (ORFs) on contigs using prodigal and annotated taxonomy with an in-house program and function (clusters of orthologous genes; COGs) with rpsBLAST to the NR protein database. Based on the number of ORFs annotated to different COGs (i.e., a sample's "functional distribution") of the metatranscriptomes, we were very happy with our results and got a consistent functional profile across samples that make sense biologically. When we used BWA MEM to map reads back to the assemblies and Salmon to calculate TPM from the BAM/gff files, there was an extremely biased distribution of ORF TPM. By this I mean that we had 150-200k ORFs predicted per sample (after length filtering for only predicted ORFs > 60 amino acids), but in many samples a single or a few ORFs were getting 400-700k TPM, half or more of the total TPM - this should be more evenly distributed among the ORFs, I assume. When we took the ORF TPM and functional annotation together and plotted function over time, we got nonsensical results. When we looked at the ORFs that recruited tons of reads and got assigned high TPM, they're sometimes bona fide genes with functions and high homology to database genes, and sometimes nothing and look like pseudogenes. As a test, we removed all ORFs that had >10% of total TPM from the mapping files and reran BWA (to see if this was actually a biological signal and the reads did actually come from these ORFs and we'd get lower read mapping/more even TPM) - in fact, new ORFs "took the place" of the high TPM ones from the origingal analysis and we got the same skewed number of reads mapped and TPM distribution.
To convince ourselves this was not purely methodological, we did concurrent metagenome sequencing, assembly, read mapping, and TPM calculation using the same exact procedures and got good results that make sense and give expected TPM distribution of both ORFs and aggregate functions (i.e., the top ORFs recruit ~0.5% of total reads and TPM and aggregate TPM at the functional category level give "correct" results).
What are the possible causes of mapping high quality paired end metatransriptome reads back to high quality assembled contigs, calculating TPM for predicted ORFs on the contigs, then having just the top few ORFs getting assigned 50-60% of total TPM and recruiting 30-40% of reads? Sometimes these "highly abundant" ORFs were real genes and sometimes pseudogenes, and whey they're removed from the mapping files and the procedure is repeated, a new set of ORFs get 50-60% of TPM and recruit many reads. Any ideas??
We conducted mRNA sequencing of a complex microbial community (metatranscriptomics) using Illumina HiSeq 150 bp on a few dozen time-series samples. Sequencing reactions seemed to work very well, yielding ~10 million quality filtered merged fastq reads per sample. We assembled with MetaHit and got reasonable n50 (1500) and number of contigs (150k). We predicted open reading frames (ORFs) on contigs using prodigal and annotated taxonomy with an in-house program and function (clusters of orthologous genes; COGs) with rpsBLAST to the NR protein database. Based on the number of ORFs annotated to different COGs (i.e., a sample's "functional distribution") of the metatranscriptomes, we were very happy with our results and got a consistent functional profile across samples that make sense biologically. When we used BWA MEM to map reads back to the assemblies and Salmon to calculate TPM from the BAM/gff files, there was an extremely biased distribution of ORF TPM. By this I mean that we had 150-200k ORFs predicted per sample (after length filtering for only predicted ORFs > 60 amino acids), but in many samples a single or a few ORFs were getting 400-700k TPM, half or more of the total TPM - this should be more evenly distributed among the ORFs, I assume. When we took the ORF TPM and functional annotation together and plotted function over time, we got nonsensical results. When we looked at the ORFs that recruited tons of reads and got assigned high TPM, they're sometimes bona fide genes with functions and high homology to database genes, and sometimes nothing and look like pseudogenes. As a test, we removed all ORFs that had >10% of total TPM from the mapping files and reran BWA (to see if this was actually a biological signal and the reads did actually come from these ORFs and we'd get lower read mapping/more even TPM) - in fact, new ORFs "took the place" of the high TPM ones from the origingal analysis and we got the same skewed number of reads mapped and TPM distribution.
To convince ourselves this was not purely methodological, we did concurrent metagenome sequencing, assembly, read mapping, and TPM calculation using the same exact procedures and got good results that make sense and give expected TPM distribution of both ORFs and aggregate functions (i.e., the top ORFs recruit ~0.5% of total reads and TPM and aggregate TPM at the functional category level give "correct" results).
Comment