SEQanswers

Go Back   SEQanswers > General



Similar Threads
Thread Thread Starter Forum Replies Last Post
GATK DepthOfCoverage at high depth lletourn Bioinformatics 2 03-28-2012 08:51 AM
Read depth recommendations dpryan RNA Sequencing 2 09-30-2011 12:15 PM
RNA-seq read distribution wenhuang RNA Sequencing 9 11-08-2010 06:00 PM
rna-seq read distribution wenhuang Bioinformatics 1 06-17-2010 10:07 AM
Very high depth of coverage knott76 Bioinformatics 5 11-19-2009 01:27 AM

Reply
 
Thread Tools
Old 05-23-2011, 09:58 AM   #1
ForeignMan
Member
 
Location: Germany

Join Date: Jun 2010
Posts: 20
Default Read distribution at high sequence depth

Hello everyone,

recently, I've been looking at two coverage plots from the same human material sequenced two times at (on average) 8x (run1) and at 30x (run2) sequence depth. I noticed (only by eye) a significant difference in the read distribution leading to high peaks in run2 and a bit messy picture, while the read distribution in run1 looks very "nice" and pretty flat. Is there a problem in the data or is this a usual picture when you deal with data of very high sequencing depth (> 30x)? Is there maybe some kind of "exponential" gain on special genomic regions like gc-rich / -poor, repetetive regions etc. that getting more and more significant the higher the sequencing depth gets?

I'd be very interested in your opinions and experiences and would be very thankful for some ideas.

Cheers,
Christoph
ForeignMan is offline   Reply With Quote
Old 05-23-2011, 10:16 AM   #2
tonybolger
Senior Member
 
Location: berlin

Join Date: Feb 2010
Posts: 156
Default

Quote:
Originally Posted by ForeignMan View Post
recently, I've been looking at two coverage plots from the same human material sequenced two times at (on average) 8x (run1) and at 30x (run2) sequence depth. I noticed (only by eye) a significant difference in the read distribution leading to high peaks in run2 and a bit messy picture, while the read distribution in run1 looks very "nice" and pretty flat. Is there a problem in the data or is this a usual picture when you deal with data of very high sequencing depth (> 30x)? Is there maybe some kind of "exponential" gain on special genomic regions like gc-rich / -poor, repetetive regions etc. that getting more and more significant the higher the sequencing depth gets?
Was it the same library used in each run or were two different libraries prepared?

Assuming the latter, at a guess, i'd say the PCR step during the second library prep has biased the result.
tonybolger is offline   Reply With Quote
Old 05-23-2011, 10:23 AM   #3
Richard Finney
Senior Member
 
Location: bethesda

Join Date: Feb 2009
Posts: 699
Default

How about posting a picture of it? ("worth a thousand words")

What *seq is it? Whole? rnaseq? chipseq?

If your output is BAM, try the rmdup command on the bam and for take a look at the rmdup'ed output bam file.
Richard Finney is offline   Reply With Quote
Old 05-24-2011, 01:42 AM   #4
ForeignMan
Member
 
Location: Germany

Join Date: Jun 2010
Posts: 20
Default

Thanks for your answers and sorry I was lacking so much information. Was hoping it might be a quite general or even normal effect.
So, the whole genome has been (paired-end) sequenced two times (100bp per read). For each run a new library has been prepared. Additionally, the first run comes from Illumina's GA II, the second one from the new HiSeq2000.
I have the alignment (used BWA) in BAM format and removed duplicates with Picard.
I attached an example image of chromosome 1 (run1 is grey, run2 yellow; y-axis runs from 0-40; coverage has been computed over 100.000bp windows). It does not look that bad, but I was just wondering if these deviations, ups and downs, can only be explained by the different sequencing conditions (library, technology) or if you have to expect this in data with high sequence depth. I'm also interested in doing a copynumber analysis with this sequencing data and was asking myself if this is a common effect that can be reduced by normalization (by gc-content, mappability regions etc.) or if the data is really biased.
Thank you for your help and interest.


Last edited by ForeignMan; 05-25-2011 at 01:01 AM.
ForeignMan is offline   Reply With Quote
Old 05-24-2011, 08:51 AM   #5
Richard Finney
Senior Member
 
Location: bethesda

Join Date: Feb 2009
Posts: 699
Default

I cant see http://imiblinux05.uni-muenster.de/~...s_coverage.jpg

Error message is : 404

Not Found

The requested URL /~c_bart07/sc_circos_coverage.jpg was not found on this server.
Richard Finney is offline   Reply With Quote
Old 05-25-2011, 12:59 AM   #6
ForeignMan
Member
 
Location: Germany

Join Date: Jun 2010
Posts: 20
Default

Strange ... I can see the image here and is has a complete different URL.
Maybe this direct link works:
http://s2.postimage.org/tt5qlhll7/sc...s_coverage.jpg
ForeignMan is offline   Reply With Quote
Old 05-25-2011, 05:27 AM   #7
Richard Finney
Senior Member
 
Location: bethesda

Join Date: Feb 2009
Posts: 699
Default

Yes, it should be flatter. (or "rounder", in your image). Good example: http://postimage.org/image/1ohvgyx6s/ You can see tumor copy number changes. My image is log scaled, not logged would look even flatter.

Note the anomaly of high coverage next to centromere on short arm (a frequent occurrence near repeated regions near the centromeres). The "low coverage" has it, the high doesn't. It should be flatter.

I do not know what is wrong and can only recommend some desperate measures: 1) don't remove duplicates and see. 2) take random sample of reads and check that they're lining up as reported (you're not displaying hg18 alignments on hg19 display, for instance).

Otherwise, it's not flat (or flatter) but should be.
Attached Images
File Type: jpg xx.jpg (42.8 KB, 28 views)
Richard Finney is offline   Reply With Quote
Old 05-25-2011, 08:39 AM   #8
tonybolger
Senior Member
 
Location: berlin

Join Date: Feb 2010
Posts: 156
Default

Quote:
Originally Posted by ForeignMan View Post
Strange ... I can see the image here and is has a complete different URL.
Maybe this direct link works:
http://s2.postimage.org/tt5qlhll7/sc...s_coverage.jpg
Agreed - it's a bit odd. Then again, even the first one isn't exactly flat, it has a broadly similar profile except lower.

Incidentally, how did you do the alignment? All reads against all chromosomes - i assume the material wasn't pre-separated per chromosome? And what did you do with ambiguous reads?
tonybolger is offline   Reply With Quote
Old 05-25-2011, 08:57 AM   #9
Richard Finney
Senior Member
 
Location: bethesda

Join Date: Feb 2009
Posts: 699
Default

These are TCGA reads from various sources. You can view the bigwig (zoomable wiggle files) coverage tracks for various diseases at cgwb.nci.nih.gov . Check the various NG tracks. I'm sure the various TCGA research institutions align whole genome reads against all chromosomes (or at least chr1-22,X,Y,M , not sure about the "random" or "unattached" genomic chunks), with no chromosome separation. BWA is the weapon of mass alignment used in most TCGA samples (all TCGA bams? ... I'm not sure). BWA assigns ambiguous reads randomly, i.e. it just picks one of the alignments. SNP calling in ambiguous regions is hard.

I'm wondering if there's some sort of "accordion effect" going in your circular view. Imagine taking an accordion and wrapping it around into an O shape: the inner edge is the same length as starting flat length but outer edge is wavy and longer. There may be an exaggeration effect.

There is some vague resemblance to high "mountain ranges" and CG content, I must admit.

Another desperate check : did you align all whole reads against one chromosome only? probably not i hope
Richard Finney is offline   Reply With Quote
Old 05-25-2011, 09:28 AM   #10
ForeignMan
Member
 
Location: Germany

Join Date: Jun 2010
Posts: 20
Default

Thanks a lot for your comments! And for the link to cgwb.nci.nih.gov.

Your guess was right, Richard. I used BWA for the alignment and ambiguos reads were assigned randomly. And, of course, I aligned to the complete human genome, not only to chromosome 1. Chose this one only for the example to save some space, and since no copynumber change is expected for this chromosome. The profile looks very similar over the complete genome.

I don't think that this "accordion effect" should be very significant here, although I really like the image. Then, apart from the radius, this effect holds for all coverage profiles. Of course, one has to be careful analysing such dense plots, but I think it works for a quick comparison since all datasets were plotted under the same conditions. If the coverage profile would have been good you could definitely see it here .

But I agree to tony noting the similarity to the lower profile. That's also why I got the idea of some kind of stronger deviations having a higher sequence depth depth (like, getting very naive now, having four times the deviation when having four times the depth). And this in correlation with specific regions on the genome. Although Richard's plots and the browser on cgwb.nci.nih.gov look very nice and somehow as I'd expected in my case.

I did a copynumber analysis on this "wavy" data (used FREEC) and the copynumber profile looked quite ok, similar to the other "good" one. Although having a few more (but not so very much) artificial gains and losses. The normalization seems to take effect. I was asking myself if there's a common tools that perform only some kind of normalization on alignment data.

Thanks again for all your help and ideas!

Last edited by ForeignMan; 05-25-2011 at 09:39 AM.
ForeignMan is offline   Reply With Quote
Old 05-26-2011, 04:23 AM   #11
tonybolger
Senior Member
 
Location: berlin

Join Date: Feb 2010
Posts: 156
Default

Quote:
Originally Posted by ForeignMan View Post
That's also why I got the idea of some kind of stronger deviations having a higher sequence depth depth (like, getting very naive now, having four times the deviation when having four times the depth).
Not sure i understand you.

I would expect that 4x the coverage will have very close 4x the deviation from the mean of the coverage (so about the same coefficient of variance) - over a 100K window, poisson noise should be negligible - and every other source of bias should just scale up.
tonybolger is offline   Reply With Quote
Old 05-26-2011, 04:50 AM   #12
ForeignMan
Member
 
Location: Germany

Join Date: Jun 2010
Posts: 20
Default

"Coefficient of variance" is exactly what I meant. Thanks tony! I was not aware of this measure and it confirms (a bit) that both runs are not so very different and that the deviations and bias scale up. Although it's still quite extreme and not very usual in this case, it helps me understanding the results. I know that the whole experiment is a bit biased, so I guess I had to expect this kind of image.
ForeignMan is offline   Reply With Quote
Reply

Tags
coverage, sequence alignment

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 07:56 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO