Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Metatranscriptome read mapping yielding a highly left skewed ORF TPM distribution

    I'm going to jump right into the problem and put details below for clarification:

    What are the possible causes of mapping high quality paired end metatransriptome reads back to high quality assembled contigs, calculating TPM for predicted ORFs on the contigs, then having just the top few ORFs getting assigned 50-60% of total TPM and recruiting 30-40% of reads? Sometimes these "highly abundant" ORFs were real genes and sometimes pseudogenes, and whey they're removed from the mapping files and the procedure is repeated, a new set of ORFs get 50-60% of TPM and recruit many reads. Any ideas??

    We conducted mRNA sequencing of a complex microbial community (metatranscriptomics) using Illumina HiSeq 150 bp on a few dozen time-series samples. Sequencing reactions seemed to work very well, yielding ~10 million quality filtered merged fastq reads per sample. We assembled with MetaHit and got reasonable n50 (1500) and number of contigs (150k). We predicted open reading frames (ORFs) on contigs using prodigal and annotated taxonomy with an in-house program and function (clusters of orthologous genes; COGs) with rpsBLAST to the NR protein database. Based on the number of ORFs annotated to different COGs (i.e., a sample's "functional distribution") of the metatranscriptomes, we were very happy with our results and got a consistent functional profile across samples that make sense biologically. When we used BWA MEM to map reads back to the assemblies and Salmon to calculate TPM from the BAM/gff files, there was an extremely biased distribution of ORF TPM. By this I mean that we had 150-200k ORFs predicted per sample (after length filtering for only predicted ORFs > 60 amino acids), but in many samples a single or a few ORFs were getting 400-700k TPM, half or more of the total TPM - this should be more evenly distributed among the ORFs, I assume. When we took the ORF TPM and functional annotation together and plotted function over time, we got nonsensical results. When we looked at the ORFs that recruited tons of reads and got assigned high TPM, they're sometimes bona fide genes with functions and high homology to database genes, and sometimes nothing and look like pseudogenes. As a test, we removed all ORFs that had >10% of total TPM from the mapping files and reran BWA (to see if this was actually a biological signal and the reads did actually come from these ORFs and we'd get lower read mapping/more even TPM) - in fact, new ORFs "took the place" of the high TPM ones from the origingal analysis and we got the same skewed number of reads mapped and TPM distribution.

    To convince ourselves this was not purely methodological, we did concurrent metagenome sequencing, assembly, read mapping, and TPM calculation using the same exact procedures and got good results that make sense and give expected TPM distribution of both ORFs and aggregate functions (i.e., the top ORFs recruit ~0.5% of total reads and TPM and aggregate TPM at the functional category level give "correct" results).

  • #2
    Did you filter against rRNA genes and did you adapter trim the reads? Especially the adapter may cause issues in the assembly.

    Comment

    Latest Articles

    Collapse

    • seqadmin
      Strategies for Sequencing Challenging Samples
      by seqadmin


      Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
      03-22-2024, 06:39 AM
    • seqadmin
      Techniques and Challenges in Conservation Genomics
      by seqadmin



      The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

      Avian Conservation
      Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
      03-08-2024, 10:41 AM

    ad_right_rmr

    Collapse

    News

    Collapse

    Topics Statistics Last Post
    Started by seqadmin, Yesterday, 06:37 PM
    0 responses
    10 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, Yesterday, 06:07 PM
    0 responses
    9 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 03-22-2024, 10:03 AM
    0 responses
    49 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 03-21-2024, 07:32 AM
    0 responses
    67 views
    0 likes
    Last Post seqadmin  
    Working...
    X