SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Trends at ASHG 2013 Genohub General 0 10-30-2013 08:54 PM

Reply
 
Thread Tools
Old 06-04-2014, 06:04 AM   #1
salamay
Member
 
Location: canada

Join Date: May 2014
Posts: 20
Default FASTQC trends

I have some metagenomic data obtained from whole genome shotgun sequencing using illumina-hiseq. The reads are 100bp paired end and when I examine the reads in fastqc, I see a couple of things. Firstly, the per base sequence content and per base GC content seem to be very skewed at the beginning of the reads (~ bp 1-16), and the per base N content seems to have a spike at bp 4. As well, I have over represented kmers at the beginning of the reads which do not belong to any adapters (as far as I can tell). I know that these trends are sometimes seen in RNA-seq data due to the (not so) random hexamer priming but I am confused as to why I see this in whole genome data. I am also not sure about the N spike at bp 4. I have attached images of what I mentioned and would appreciate any insight.

thanks.
Attached Images
File Type: png kmer.png (71.0 KB, 18 views)
File Type: png perbasegc.png (51.5 KB, 12 views)
File Type: png perbasen.png (46.0 KB, 8 views)
File Type: png perbasesequence.png (67.5 KB, 12 views)
salamay is offline   Reply With Quote
Old 06-04-2014, 06:24 AM   #2
TonyBrooks
Senior Member
 
Location: London

Join Date: Jun 2009
Posts: 298
Default

Quote:
Originally Posted by salamay View Post
I have some metagenomic data obtained from whole genome shotgun sequencing using illumina-hiseq. The reads are 100bp paired end and when I examine the reads in fastqc, I see a couple of things. Firstly, the per base sequence content and per base GC content seem to be very skewed at the beginning of the reads (~ bp 1-16), and the per base N content seems to have a spike at bp 4. As well, I have over represented kmers at the beginning of the reads which do not belong to any adapters (as far as I can tell). I know that these trends are sometimes seen in RNA-seq data due to the (not so) random hexamer priming but I am confused as to why I see this in whole genome data. I am also not sure about the N spike at bp 4. I have attached images of what I mentioned and would appreciate any insight.

thanks.
I'm assuming these were sequenced on a HiSeq? The spike at 4 cycles is most likely a phenomenon known as Bottom Middle Swath (or BMS in Illumispeak). The HiSeq attempts to find focus before scanning at a fixed point near the inlet port. If a bubble is present over at this point, then there is a mis-focus and that particular swatch is scanned out of focus. You should be able to see if you look at the thumbnail images for cycle 4. Basecalling can't be done on these images, so each cluster is given an N at this position.
TonyBrooks is offline   Reply With Quote
Old 06-04-2014, 06:49 AM   #3
salamay
Member
 
Location: canada

Join Date: May 2014
Posts: 20
Default

Thanks tonybrooks, yes it was on a hiseq. I had not heard about this issue before thanks for bringing it to my attention.
salamay is offline   Reply With Quote
Old 06-04-2014, 06:56 AM   #4
TonyBrooks
Senior Member
 
Location: London

Join Date: Jun 2009
Posts: 298
Default

Quote:
Originally Posted by TonyBrooks View Post
I'm assuming these were sequenced on a HiSeq? The spike at 4 cycles is most likely a phenomenon known as Bottom Middle Swath (or BMS in Illumispeak). The HiSeq attempts to find focus before scanning at a fixed point near the inlet port. If a bubble is present over at this point, then there is a mis-focus and that particular swatch is scanned out of focus. You should be able to see if you look at the thumbnail images for cycle 4. Basecalling can't be done on these images, so each cluster is given an N at this position.
See here for more info

http://seqanswers.com/forums/showthread.php?t=15356
TonyBrooks is offline   Reply With Quote
Old 06-04-2014, 08:31 AM   #5
lac302
Member
 
Location: DE

Join Date: Dec 2012
Posts: 65
Default

I've seen the same fluctuation in GC content over the first 20 or so bases on samples run both on the HiSeq and Miseq. I typically have enough coverage to just trim them off even though the Q scores are always above 30.
lac302 is offline   Reply With Quote
Old 06-04-2014, 10:16 AM   #6
salamay
Member
 
Location: canada

Join Date: May 2014
Posts: 20
Default

Quote:
Originally Posted by lac302 View Post
I've seen the same fluctuation in GC content over the first 20 or so bases on samples run both on the HiSeq and Miseq. I typically have enough coverage to just trim them off even though the Q scores are always above 30.
Thanks lac302, from what I have done so far I have trimmed the sequences up to bp 16 and worked from there as you seem to have done but I can't figure out the cause for it or whether it is a bit wasteful to trim off 15 bp of useful sequence.
salamay is offline   Reply With Quote
Old 06-04-2014, 10:47 AM   #7
mastal
Senior Member
 
Location: uk

Join Date: Mar 2009
Posts: 667
Default

Was the library prep done using a Nextera kit?
mastal is offline   Reply With Quote
Old 06-04-2014, 12:26 PM   #8
salamay
Member
 
Location: canada

Join Date: May 2014
Posts: 20
Default

Quote:
Originally Posted by mastal View Post
Was the library prep done using a Nextera kit?
I believe so but I am not sure and have asked those responsible for the generation of the data. Would using a nextera kit explain what is seen?
salamay is offline   Reply With Quote
Old 06-04-2014, 03:24 PM   #9
mastal
Senior Member
 
Location: uk

Join Date: Mar 2009
Posts: 667
Default

Quote:
Originally Posted by salamay View Post
I believe so but I am not sure and have asked those responsible for the generation of the data. Would using a nextera kit explain what is seen?
Yes. There was a recent thread discussing this. I will post a link if I can find it.
mastal is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 12:12 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO