SEQanswers

Go Back   SEQanswers > Sequencing Technologies/Companies > Illumina/Solexa



Similar Threads
Thread Thread Starter Forum Replies Last Post
Frequency distribution of base quality scores from SAM file VeBeKay Bioinformatics 3 02-11-2014 12:48 AM
What insert size for velveth (1.2.10) with 2 sets of reads with diff. insert sizes? Genomics101 Bioinformatics 4 02-07-2014 11:41 AM
Insert size != Fragment size? Boel Bioinformatics 6 12-12-2013 08:28 AM
Insert size billstevens Sample Prep / Library Generation 3 04-16-2012 03:33 AM
About Insert, Insert size and MIRA mates.file aarthi.talla 454 Pyrosequencing 1 08-01-2011 01:37 PM

Reply
 
Thread Tools
Old 08-18-2014, 10:33 AM   #1
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default Sawtooth base frequency, wavy insert size histograms.

I am analyzing some NextSeq data and see odd patterns in the insert size and base composition histograms, that I can't explain. The library is of a bacteria (M.ruber) and fragmented with sonication to a target 270bp insert size. The run was 2x151bp.

The base composition graph concatenates read 1 and read 2, so position 0-150 are read 1 and 151-302 are read 2. Each read has a sawtooth pattern for all bases, with a period of exactly 3bp.



There's obviously a major problem with base-calling as the A/T ratio is quite skewed, but putting that aside for now, has anyone seen the sawtooth pattern before? I saw it once on some MiSeq Nextera data also, and could not explain it then, either. A second run on the NextSeq (on a fungus) does NOT have the sawtooth pattern, but still has the distorted A/T ratio. Bacteria are mostly coding and the fungus is mostly noncoding, so I'm speculating that it could be a real artifact related to codon frequencies and nonrandom fragmentation sites rather than a software bug, but I'm not sure.

Next, the insert size distribution also has a regular patter, this one with a 10bp period.



This pattern exists when the insert size is calculated using two independent methods, by mapping and by overlap (overlap is of course restricted to under 300bp). So I am confident that it's actually in the data and not a software problem; and furthermore, it's present in genomic reads, or else it would not show up on the mapping histogram. Has anyone seen that before?
Attached Images
File Type: png NextSeq_Base_Frequency.png (21.4 KB, 50 views)
File Type: png Wavy_Insert.png (21.3 KB, 48 views)
Brian Bushnell is offline   Reply With Quote
Old 08-18-2014, 11:17 AM   #2
nucacidhunter
Jafar Jabbari
 
Location: Melbourne

Join Date: Jan 2013
Posts: 1,174
Default

I wonder what is read duplication rate and the number of reads.
nucacidhunter is offline   Reply With Quote
Old 08-18-2014, 03:50 PM   #3
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

The duplication rate appears very low (considering it's only a ~3Mbp organism). Here's a plot of read uniqueness for the first 10m read pairs (out of 124m total pairs):



The way to interpret this... each read is examined for its first 31-mer and a random 31-mer. These are added to a hashtable. If they were already present, the read is considered non-unique; otherwise, it is considered unique. Errors will inflate the apparent uniqueness. The cumulative ratio of unique vs non-unique reads is reported every 25k reads. The more nonuniform the library, the faster the value drops. There are multiple lines because I track "first" and "random" separately, and I also track read 1 and read 2 both separately and combined.

The wavyness here is probably due to some problem with the optics, correlating with individual image frames.
Attached Images
File Type: png NextSeq_Uniqueness.png (37.6 KB, 26 views)
Brian Bushnell is offline   Reply With Quote
Old 08-18-2014, 04:25 PM   #4
nucacidhunter
Jafar Jabbari
 
Location: Melbourne

Join Date: Jan 2013
Posts: 1,174
Default

I would suggest first to check for sequencer faults which person running the machine should be able to do it. If that is ruled out as a possible cause, I would look next to the library prep and its diversity. The wavyness in base frequency looks similar to what I have seen with low diversity mate pair libraries where a library with below 10M unique fragments have been sequenced in 100sM (though the frequency was larger than 3) and also low diversity amplicon libraries. Out of curiosity, how the duplication rate could be low. In a 3 Mb genome there is only possibility of obtaining 3M unique fragments (at least in this case in initial 100 bp). If this library is sequenced to a depth of 124M reads there would be high level of duplication.
nucacidhunter is offline   Reply With Quote
Old 08-18-2014, 08:48 PM   #5
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Quote:
Originally Posted by nucacidhunter View Post
Out of curiosity, how the duplication rate could be low. In a 3 Mb genome there is only possibility of obtaining 3M unique fragments (at least in this case in initial 100 bp). If this library is sequenced to a depth of 124M reads there would be high level of duplication.
So, this is a 2x151bp library; as expected, after 10M read pairs, the number of read1 with a unique first 31-mer drops to around 35%. This is consistent with a high uniqueness - if every starting location on the genome was used, you could only get up to around 31% uniqueness (it's actually about 3.09 Mbp). The fact that some reads have errors pushes it higher to 35% but it's still good.

But there's also pair uniqueness, for which I use a hash of the middle 31-mer in read 1 and read 2. This represents the fraction of read pairs with a unique start+stop combination, and thus is a much better measure of library duplication rate. By that metric, of the first 10 million read pairs, 99% of them are unique, which indicates the library has a very low duplication rate. Though certainly if I extended the graph all the way to 124 million pairs I would expect that to drop a bit.
Brian Bushnell is offline   Reply With Quote
Reply

Tags
base composition, insert size, nextseq, sawtooth, wavy

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 04:52 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO