SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
FastQC per base sequence content analyst Bioinformatics 14 02-15-2017 06:25 AM
FastQC,kmer content, per base sequence content: is this good enough mgg Bioinformatics 10 11-06-2013 10:45 PM
Strange FASTQC Results - PerSequence GC Content chongm Bioinformatics 6 03-05-2013 10:04 AM
FastQC - strange 'per base sequence content' graph gconcepcion Bioinformatics 11 10-31-2011 12:39 AM
FastQC "Per Base Sequence Content": systematic deviation at 3' end of reads d f Illumina/Solexa 4 09-28-2010 09:46 AM

Reply
 
Thread Tools
Old 07-09-2013, 03:15 AM   #1
kirstyn
Junior Member
 
Location: UK

Join Date: May 2012
Posts: 8
Default Strange fastqc per base sequence content 3'end

Hi

I have been processing some 150bp paired end Nextera XT reads from viral cDNA. The first 20 or so bases at the 5'end are due to Nextera which I just trim off, but I also have an unusual base content at the 3'end of all sequences, consistently in all samples. I have attached the fastqc images for the forward reads of one sample, which is representative of all my samples. Does anyone know what this might be?
Attached Images
File Type: png per_sequence_quality.png (25.4 KB, 68 views)
File Type: png per_base_sequence_content.png (38.3 KB, 137 views)
File Type: png per_base_gc_content.png (22.6 KB, 70 views)
kirstyn is offline   Reply With Quote
Old 07-09-2013, 03:23 AM   #2
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,076
Default

Are the inserts smaller than the read lengths (150 cycles in this case)? If this is a paired-end experiment then you can easily see if the two reads overlap to a large degree by using a tool such as FLASH
GenoMax is offline   Reply With Quote
Old 07-10-2013, 02:20 AM   #3
kirstyn
Junior Member
 
Location: UK

Join Date: May 2012
Posts: 8
Default

Thanks for your reply . The average size of my libraries is usually >350bp, which I assumed was ok for 150bp PE reads? Looking in Tablet there is a large degree of overlap between paired reads- is this indicative of the library being too short? I am mapping the reads to a reference sequence and looking for SNPs so I am not worried about overlap in the read pairs. How would this affect the 3'end of my reads?
kirstyn is offline   Reply With Quote
Old 07-10-2013, 03:23 AM   #4
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,076
Default

Quote:
Originally Posted by kirstyn View Post
Thanks for your reply . The average size of my libraries is usually >350bp, which I assumed was ok for 150bp PE reads? Looking in Tablet there is a large degree of overlap between paired reads- is this indicative of the library being too short? I am mapping the reads to a reference sequence and looking for SNPs so I am not worried about overlap in the read pairs. How would this affect the 3'end of my reads?
You can easily determine how big the inserts are by looking at the extent of overlap (it sounds like a fraction of your library is no where near the expected 350 bp size). If some of the inserts are smaller than 150 bp then you will start reading into the adapter at the other end and beyond. If these reads are not aligning well on the 3'-end then you may need to trim them.
GenoMax is offline   Reply With Quote
Old 07-10-2013, 08:27 AM   #5
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,177
Default

Quote:
Originally Posted by kirstyn View Post
The first 20 or so bases at the 5'end are due to Nextera which I just trim off
I know this wasn't the question you were asking but it's not really necessary to trim off those 5' bases. The sequence is not incorrect. It simply represents the slight bias that the Nextera tagmentase has for certain sequence composition.
kmcarr is offline   Reply With Quote
Old 07-12-2013, 01:24 AM   #6
kirstyn
Junior Member
 
Location: UK

Join Date: May 2012
Posts: 8
Default

Quote:
Originally Posted by kmcarr View Post
I know this wasn't the question you were asking but it's not really necessary to trim off those 5' bases. The sequence is not incorrect. It simply represents the slight bias that the Nextera tagmentase has for certain sequence composition.
Yes thanks for that comment. I wasn't quite sure if I should trim off the 5' bases, especially since I am mapping my reads but I had read about both random hexamer and nextera transposome bias so I decided to trim! I think I will try it without too!
kirstyn is offline   Reply With Quote
Old 12-19-2016, 06:15 AM   #7
MU Core
Member
 
Location: Columbia, Missouri

Join Date: Apr 2008
Posts: 57
Default

I observe the same 3' characteristic as previously reported in this thread. A representative plot is provided. The plot shown is a DNA library with insert size >350bp. However, we see this in all our FastQC plots. It is independent of library type, instrument (NextSeq or HiSeq), or read length (50, 75, or 100 bases). Therefore, I'm not inclined to see this as a library prep/chemistry issue. Has anyone also encountered this characteristic and identified a reason? Thank you in advance for comments.
Attached Images
File Type: jpg plot.jpg (23.5 KB, 45 views)
MU Core is offline   Reply With Quote
Old 12-19-2016, 07:39 AM   #8
nucacidhunter
Jafar Jabbari
 
Location: Melbourne

Join Date: Jan 2013
Posts: 1,234
Default

It is either library or possibly demultiplexing issue. Could you post plots from other runs with similar pattern(s) with the library electropherogram.
nucacidhunter is offline   Reply With Quote
Old 12-19-2016, 07:45 AM   #9
Michael.Ante
Senior Member
 
Location: Vienna

Join Date: Oct 2011
Posts: 123
Default

Did you see this pattern also with the "--nogroup" option?
The bases are binned without that option; which let the distribution may look smoother than it is. The last base, shown in your figure, is just a single bin.
Michael.Ante is offline   Reply With Quote
Old 12-19-2016, 07:49 AM   #10
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Quote:
Originally Posted by MU Core View Post
I observe the same 3' characteristic as previously reported in this thread. A representative plot is provided. The plot shown is a DNA library with insert size >350bp. However, we see this in all our FastQC plots. It is independent of library type, instrument (NextSeq or HiSeq), or read length (50, 75, or 100 bases). Therefore, I'm not inclined to see this as a library prep/chemistry issue. Has anyone also encountered this characteristic and identified a reason? Thank you in advance for comments.
Are you using Nextera for fragmentation? That's been identified as the cause of severe bias on the 5' (left) end. However, yours looks very sedate, so if I were to guess, I would say this is NOT a Nextera library. Have you looked into the empirical error rates (from mapping) of the left end to see if there is a corresponding increase? That will indicate whether this is bias, or an actual base-calling/non-genomic sequence issue.

The 3' end is just showing the normal Illumina biased/low-quality last base due to a lack of a subsequent base call needed for calibration; I always trim the last base in 76/101/151/etc. runs.
Brian Bushnell is offline   Reply With Quote
Old 12-19-2016, 09:11 AM   #11
MU Core
Member
 
Location: Columbia, Missouri

Join Date: Apr 2008
Posts: 57
Default

Here are a couple more examples. Sample M is that of a DNA PCR-free library sequenced on a HiSeq. Sample W is a TruSeq mRNA library sequenced on a NextSeq.

Brian and Michael's suggestions both offer an explanation that I think explains these observations. It would also suggest that trimming the reads prior to the FastQC report being generated that the bias in the 3'end will removed. I'll give this a try and share the results.

Thank you again for your comments.
Attached Images
File Type: png SampleM.PNG (63.5 KB, 29 views)
File Type: png SampleM_NGS.PNG (27.2 KB, 22 views)
File Type: png SampleW.PNG (72.5 KB, 35 views)
File Type: png SampleW_NGS.PNG (21.2 KB, 16 views)
MU Core is offline   Reply With Quote
Old 12-19-2016, 09:17 AM   #12
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Sample W looks very typical of Nextera. Trimming the 3' end is not recommended in these cases because the bases are correct. It will not change the bias, just hide the bias so that your FastQC report looks better.
Brian Bushnell is offline   Reply With Quote
Old 12-19-2016, 09:42 AM   #13
MU Core
Member
 
Location: Columbia, Missouri

Join Date: Apr 2008
Posts: 57
Default

Brian, you were correct though that these libraries were not Nextera.
MU Core is offline   Reply With Quote
Old 12-19-2016, 09:47 AM   #14
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Oh, that's odd, then. There are some other things like random-hexamer-primed libraries that also have similar issues. I think it would be worthwhile generating an error-rate histogram to verify whether the mismatch rate is increased in that region. You can do so with BBMap like this:

bbmap.sh in=reads.fq ref=ref.fa mhist=mhist.txt bhist=bhist.txt whist=qhist.txt

If the error rate is not increased, I recommend against trimming.
Brian Bushnell is offline   Reply With Quote
Old 12-19-2016, 09:00 PM   #15
nucacidhunter
Jafar Jabbari
 
Location: Melbourne

Join Date: Jan 2013
Posts: 1,234
Default

I have seen this pattern in low diversity amplicons only and their FastQC pattern matches the Data By Cycle (%Base) in SAV of run.
nucacidhunter is offline   Reply With Quote
Old 01-05-2017, 06:50 AM   #16
MU Core
Member
 
Location: Columbia, Missouri

Join Date: Apr 2008
Posts: 57
Default

I think we've figured out the issue with the peculiar 3' end plot. Our informatics group noticed that there was no minimum length set for trimming adapter sequence. The adapter sequence starts with “AGAT”. Thus, any sequence ending in “A” gets trimmed, any sequence ending in “AG” also gets trimmed, and so forth with “AGA”, “AGAT” and for the rest of the adapter sequence.

Assuming sequences are random:
one fourth of sequences would end in “A”
one sixteenth of sequences would end in “AG”
one sixty-fourth of sequences would end in “AGA”
one 256th of sequences would end in “AGAT”

All of these “matching” endings will be trimmed resulting in the plot characteristics previously posted.
MU Core is offline   Reply With Quote
Old 01-05-2017, 09:58 AM   #17
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Ah, yes, that's a problem. Normally I use a minimum length of 11 to prevent that kind of issue. If you have paired reads, you can use BBDuk's tbo flag in addition to the normal sequence matching. Tbo (trim by overlap) will eliminate even 1bp of adapter sequence based on overlap rather than sequence matching, to yield very complete adapter extinction without incurring bias.
Brian Bushnell is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 10:43 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO