Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Distribution of Variants in Genes in RNA-seq

    Hello,

    I've been doing variant calling in RNA-seq data and have noticed somewhat troubling trends when I look at where the variants I've called are distributed along genes. For each variant called, I compute what "fraction" of the gene the variant is in, where 0 is the Transcription Start Site (TSS) and 1 is the Transcription End Site (these are according to knownGene.txt from UCSC). When we plot the distribution of these gene fractions (combining data from 36 samples), we get this:



    I was expecting a relatively uniform distribution from this, so decided to investigate more. My current thought is that there is a higher mutation rate in the 5' and 3' UTRs, and those cause the ends to have a higher number of variants called than the middle. In general, 3' UTRs are longer than 5' UTRs (and are somewhat less involved in regulation, possibly making mutations more common), which is how I'm trying to explain the larger number of variants at the end of the gene.

    To test this, I divided up the gene into 5'UTR, coding region, and 3'UTR (using the lengths of UTRs from foldUTR3/5 from UCSC) and then again plotted the distribution of variants in the coding region. We see a decrease in magnitude from the peaks on the edges, but they are still quite prominent:



    Additionally, I calculated (number of variants)/(total nucleotides) for each of the three regions, getting:

    5' UTR:
    0.0003928025
    Coding Region:
    0.00008306061
    3' UTR:
    0.001019351

    Which makes sense in that the coding region is more conserved than the UTRs.

    However, I'm unsure why there's still a large bias of seeing variants towards the end of coding regions. I'm thinking that the UTR annotations in UCSC are likely not always completely accurate, meaning that some of the "coding regions" actually have portions of 3' UTRs which have higher mutation rates and thus explain the trend in the data.

    Does anyone have experience with how trustworthy the UTR annotations in UCSC are (or have a better source for them)? Alternatively, has anyone seen trends like this before?

    Thanks in advance.

  • #2
    I suggest you plot the coverage across the gene length. Actually, I have a tool which can do that, if you don't already -

    pileup.sh in=mapped.sam normcovo=histogram.txt normc=t normb=50

    ...if mapped.sam contains the reads mapped to the transcriptome (not the genome).

    Typically, there is highly variable coverage across a gene, biased toward one end; and coverage greatly affects accuracy of variation calling.

    Comment


    • #3
      I had a similar thought, and looked for a correlation between the depth of coverage of a variant and its position in the gene. I didn't find any obvious trend there:

      Comment


      • #4
        I'm not sure that graph tells you what you need to know. It indicates that called variants have a similar coverage distribution regardless of their position. But if you had 1000 genes with only coverage over the last 100bp, and 10 genes with coverage across the entire gene, and for all of the 1010 genes the coverage was variable, you could end up with a plot like what you just showed - where there is no obvious correlation between coverage and variant rates, but there is an obvious correlation between position and variant rates. I still recommend you plot the coverage along versus gene position.

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Recent Advances in Sequencing Analysis Tools
          by seqadmin


          The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...
          05-06-2024, 07:48 AM
        • seqadmin
          Essential Discoveries and Tools in Epitranscriptomics
          by seqadmin




          The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
          04-22-2024, 07:01 AM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, Yesterday, 02:46 PM
        0 responses
        11 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 05-07-2024, 06:57 AM
        0 responses
        13 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 05-06-2024, 07:17 AM
        0 responses
        17 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 05-02-2024, 08:06 AM
        0 responses
        23 views
        0 likes
        Last Post seqadmin  
        Working...
        X