Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Distribution of Variants in Genes in RNA-seq

    Hello,

    I've been doing variant calling in RNA-seq data and have noticed somewhat troubling trends when I look at where the variants I've called are distributed along genes. For each variant called, I compute what "fraction" of the gene the variant is in, where 0 is the Transcription Start Site (TSS) and 1 is the Transcription End Site (these are according to knownGene.txt from UCSC). When we plot the distribution of these gene fractions (combining data from 36 samples), we get this:



    I was expecting a relatively uniform distribution from this, so decided to investigate more. My current thought is that there is a higher mutation rate in the 5' and 3' UTRs, and those cause the ends to have a higher number of variants called than the middle. In general, 3' UTRs are longer than 5' UTRs (and are somewhat less involved in regulation, possibly making mutations more common), which is how I'm trying to explain the larger number of variants at the end of the gene.

    To test this, I divided up the gene into 5'UTR, coding region, and 3'UTR (using the lengths of UTRs from foldUTR3/5 from UCSC) and then again plotted the distribution of variants in the coding region. We see a decrease in magnitude from the peaks on the edges, but they are still quite prominent:



    Additionally, I calculated (number of variants)/(total nucleotides) for each of the three regions, getting:

    5' UTR:
    0.0003928025
    Coding Region:
    0.00008306061
    3' UTR:
    0.001019351

    Which makes sense in that the coding region is more conserved than the UTRs.

    However, I'm unsure why there's still a large bias of seeing variants towards the end of coding regions. I'm thinking that the UTR annotations in UCSC are likely not always completely accurate, meaning that some of the "coding regions" actually have portions of 3' UTRs which have higher mutation rates and thus explain the trend in the data.

    Does anyone have experience with how trustworthy the UTR annotations in UCSC are (or have a better source for them)? Alternatively, has anyone seen trends like this before?

    Thanks in advance.

  • #2
    I suggest you plot the coverage across the gene length. Actually, I have a tool which can do that, if you don't already -

    pileup.sh in=mapped.sam normcovo=histogram.txt normc=t normb=50

    ...if mapped.sam contains the reads mapped to the transcriptome (not the genome).

    Typically, there is highly variable coverage across a gene, biased toward one end; and coverage greatly affects accuracy of variation calling.

    Comment


    • #3
      I had a similar thought, and looked for a correlation between the depth of coverage of a variant and its position in the gene. I didn't find any obvious trend there:

      Comment


      • #4
        I'm not sure that graph tells you what you need to know. It indicates that called variants have a similar coverage distribution regardless of their position. But if you had 1000 genes with only coverage over the last 100bp, and 10 genes with coverage across the entire gene, and for all of the 1010 genes the coverage was variable, you could end up with a plot like what you just showed - where there is no obvious correlation between coverage and variant rates, but there is an obvious correlation between position and variant rates. I still recommend you plot the coverage along versus gene position.

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Strategies for Sequencing Challenging Samples
          by seqadmin


          Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
          03-22-2024, 06:39 AM
        • seqadmin
          Techniques and Challenges in Conservation Genomics
          by seqadmin



          The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

          Avian Conservation
          Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
          03-08-2024, 10:41 AM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, Yesterday, 06:37 PM
        0 responses
        7 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, Yesterday, 06:07 PM
        0 responses
        7 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 03-22-2024, 10:03 AM
        0 responses
        49 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 03-21-2024, 07:32 AM
        0 responses
        66 views
        0 likes
        Last Post seqadmin  
        Working...
        X