Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Distribution of Variants in Genes in RNA-seq

    Hello,

    I've been doing variant calling in RNA-seq data and have noticed somewhat troubling trends when I look at where the variants I've called are distributed along genes. For each variant called, I compute what "fraction" of the gene the variant is in, where 0 is the Transcription Start Site (TSS) and 1 is the Transcription End Site (these are according to knownGene.txt from UCSC). When we plot the distribution of these gene fractions (combining data from 36 samples), we get this:



    I was expecting a relatively uniform distribution from this, so decided to investigate more. My current thought is that there is a higher mutation rate in the 5' and 3' UTRs, and those cause the ends to have a higher number of variants called than the middle. In general, 3' UTRs are longer than 5' UTRs (and are somewhat less involved in regulation, possibly making mutations more common), which is how I'm trying to explain the larger number of variants at the end of the gene.

    To test this, I divided up the gene into 5'UTR, coding region, and 3'UTR (using the lengths of UTRs from foldUTR3/5 from UCSC) and then again plotted the distribution of variants in the coding region. We see a decrease in magnitude from the peaks on the edges, but they are still quite prominent:



    Additionally, I calculated (number of variants)/(total nucleotides) for each of the three regions, getting:

    5' UTR:
    0.0003928025
    Coding Region:
    0.00008306061
    3' UTR:
    0.001019351

    Which makes sense in that the coding region is more conserved than the UTRs.

    However, I'm unsure why there's still a large bias of seeing variants towards the end of coding regions. I'm thinking that the UTR annotations in UCSC are likely not always completely accurate, meaning that some of the "coding regions" actually have portions of 3' UTRs which have higher mutation rates and thus explain the trend in the data.

    Does anyone have experience with how trustworthy the UTR annotations in UCSC are (or have a better source for them)? Alternatively, has anyone seen trends like this before?

    Thanks in advance.

  • #2
    I suggest you plot the coverage across the gene length. Actually, I have a tool which can do that, if you don't already -

    pileup.sh in=mapped.sam normcovo=histogram.txt normc=t normb=50

    ...if mapped.sam contains the reads mapped to the transcriptome (not the genome).

    Typically, there is highly variable coverage across a gene, biased toward one end; and coverage greatly affects accuracy of variation calling.

    Comment


    • #3
      I had a similar thought, and looked for a correlation between the depth of coverage of a variant and its position in the gene. I didn't find any obvious trend there:

      Comment


      • #4
        I'm not sure that graph tells you what you need to know. It indicates that called variants have a similar coverage distribution regardless of their position. But if you had 1000 genes with only coverage over the last 100bp, and 10 genes with coverage across the entire gene, and for all of the 1010 genes the coverage was variable, you could end up with a plot like what you just showed - where there is no obvious correlation between coverage and variant rates, but there is an obvious correlation between position and variant rates. I still recommend you plot the coverage along versus gene position.

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Essential Discoveries and Tools in Epitranscriptomics
          by seqadmin




          The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
          Yesterday, 07:01 AM
        • seqadmin
          Current Approaches to Protein Sequencing
          by seqadmin


          Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
          04-04-2024, 04:25 PM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, 04-11-2024, 12:08 PM
        0 responses
        58 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 10:19 PM
        0 responses
        53 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 09:21 AM
        0 responses
        45 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-04-2024, 09:00 AM
        0 responses
        55 views
        0 likes
        Last Post seqadmin  
        Working...
        X