SEQanswers

Go Back   SEQanswers > Applications Forums > RNA Sequencing



Similar Threads
Thread Thread Starter Forum Replies Last Post
distribution of the genes in human genome diya Bioinformatics 1 12-12-2012 12:23 AM

Reply
 
Thread Tools
Old 12-10-2014, 09:51 AM   #1
hallpell
Junior Member
 
Location: Ann Arbor

Join Date: Dec 2014
Posts: 2
Default Distribution of Variants in Genes in RNA-seq

Hello,

I've been doing variant calling in RNA-seq data and have noticed somewhat troubling trends when I look at where the variants I've called are distributed along genes. For each variant called, I compute what "fraction" of the gene the variant is in, where 0 is the Transcription Start Site (TSS) and 1 is the Transcription End Site (these are according to knownGene.txt from UCSC). When we plot the distribution of these gene fractions (combining data from 36 samples), we get this:



I was expecting a relatively uniform distribution from this, so decided to investigate more. My current thought is that there is a higher mutation rate in the 5' and 3' UTRs, and those cause the ends to have a higher number of variants called than the middle. In general, 3' UTRs are longer than 5' UTRs (and are somewhat less involved in regulation, possibly making mutations more common), which is how I'm trying to explain the larger number of variants at the end of the gene.

To test this, I divided up the gene into 5'UTR, coding region, and 3'UTR (using the lengths of UTRs from foldUTR3/5 from UCSC) and then again plotted the distribution of variants in the coding region. We see a decrease in magnitude from the peaks on the edges, but they are still quite prominent:



Additionally, I calculated (number of variants)/(total nucleotides) for each of the three regions, getting:

5' UTR:
0.0003928025
Coding Region:
0.00008306061
3' UTR:
0.001019351

Which makes sense in that the coding region is more conserved than the UTRs.

However, I'm unsure why there's still a large bias of seeing variants towards the end of coding regions. I'm thinking that the UTR annotations in UCSC are likely not always completely accurate, meaning that some of the "coding regions" actually have portions of 3' UTRs which have higher mutation rates and thus explain the trend in the data.

Does anyone have experience with how trustworthy the UTR annotations in UCSC are (or have a better source for them)? Alternatively, has anyone seen trends like this before?

Thanks in advance.
hallpell is offline   Reply With Quote
Old 12-10-2014, 10:11 AM   #2
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

I suggest you plot the coverage across the gene length. Actually, I have a tool which can do that, if you don't already -

pileup.sh in=mapped.sam normcovo=histogram.txt normc=t normb=50

...if mapped.sam contains the reads mapped to the transcriptome (not the genome).

Typically, there is highly variable coverage across a gene, biased toward one end; and coverage greatly affects accuracy of variation calling.
Brian Bushnell is offline   Reply With Quote
Old 12-11-2014, 11:59 AM   #3
hallpell
Junior Member
 
Location: Ann Arbor

Join Date: Dec 2014
Posts: 2
Default

I had a similar thought, and looked for a correlation between the depth of coverage of a variant and its position in the gene. I didn't find any obvious trend there:

hallpell is offline   Reply With Quote
Old 12-11-2014, 12:09 PM   #4
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

I'm not sure that graph tells you what you need to know. It indicates that called variants have a similar coverage distribution regardless of their position. But if you had 1000 genes with only coverage over the last 100bp, and 10 genes with coverage across the entire gene, and for all of the 1010 genes the coverage was variable, you could end up with a plot like what you just showed - where there is no obvious correlation between coverage and variant rates, but there is an obvious correlation between position and variant rates. I still recommend you plot the coverage along versus gene position.
Brian Bushnell is offline   Reply With Quote
Reply

Tags
rna-seq, variant calling

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 07:19 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO