SEQanswers

Go Back   SEQanswers > Applications Forums > De novo discovery



Similar Threads
Thread Thread Starter Forum Replies Last Post
Illumina Nextera Mate Pair libraries. Gina_P Sample Prep / Library Generation 1 04-30-2014 05:35 AM
Assembly of nextera mate pair libraries agseq Bioinformatics 1 03-18-2014 02:39 AM
Adapter Trimming Illumina Mate Pairs (Nextera Protocol) hrarnc Bioinformatics 20 02-28-2014 09:44 AM
difference between pair end and single end reads assembly jjjscuedu Bioinformatics 0 08-13-2013 06:22 PM
lots of overlapping reads - Pair-end library from Nextera clariet Illumina/Solexa 2 06-06-2013 05:26 AM

Reply
 
Thread Tools
Old 07-16-2014, 11:24 AM   #1
dave1
Junior Member
 
Location: California

Join Date: Jul 2014
Posts: 2
Default Illumina Nextera Pair-End Sequence Content Bias-Require trimming for DeNovo Assembly?

I'm working on a bacterial data set that I was having difficulty assembling.

Illumina. 300 bp reads. Pair End Data. Nextera library prep.

The FastQC per-base-sequence-content chart (attached) shows high sequence content bias in the first 15-20 positions. Initially, I thought it was adapter contamination and tried to use a variety of trimming tools (trimmomatic, others) to remove what I thought were adapters. I found a blog here: (https://www.instapaper.com/read/496731324), that suggests this is a library problem due to Nextera kits.

After running the data through trimmomatic, I used the paired data (ignored the data from the unpaired data sets for the time being) and then artificially trimmed off the first 20 positions from the subset of data that was showing the sequence bias. I was finally able to get a reasonable assembly.

Questions:
1) Does the sequence bias in the first 20 bases point to a problem with the library prep? Or is this typical with the Nextera/nothing to worry about?

2) For DeNovo assembly, is it necessary to trim off the first ~20 bases? Is there a recommended tool/process? (rather than just arbitrarily clipping the first 20 bases)?

3) I noticed Trimmomatic separates the reads into reads that are and are not paired. For DeNovo Assembly, is there any reason NOT to include the unpaired data?

Thanks in advance
Attached Images
File Type: png per_base_sequence_content.png (32.9 KB, 61 views)
dave1 is offline   Reply With Quote
Old 07-16-2014, 12:21 PM   #2
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Nextera has highly nonuniform first ~20bp, but it's neither adapter sequence nor errors; just a fragmentation site bias. You don't need to trim it. If you did trim it, though, the only way would be to trim the first X bases.

For assembly, if you use a pair-aware assembler and have sufficient data, it's best to assemble from paired reads. Some assemblers allow you to specify both paired and unpaired reads in the same assembly, in which case you could use both. But if the assembler only allows you to give it paired OR unpaired reads, it's probably best to give it the paired reads only, rather than mixing all the reads together, which would require you running the data as unpaired. There is no strict answer that will be correct for all assemblers, as they make use of pairing data differently, or possibly not at all.
Brian Bushnell is offline   Reply With Quote
Old 07-17-2014, 07:55 AM   #3
dave1
Junior Member
 
Location: California

Join Date: Jul 2014
Posts: 2
Default

Thanks for your help Brian.

Your feedback that it isn't necessary to trim the first 15-20 bases due to fragmentation site bias led me to revisit my QC results.

Another Question: Would you be willing to comment on the quality of the reverse read? Would you consider this a good run? ok run? Do you typically see the large quality range in the first few bases of the reverse read? The lab is tuning its protocols. Does this point to anything that might need to get changed?

Adding this in case it helps others in the future.

Working with Illumina Nextera prepped, pair-end 300 bp reads.

I have typically been taking a quick glance at the FastQC results. If the results looked good, I didn't bother with trimming/filtering the data before de-novo assembly. (Was relying on the assembler to leverage quality score information)

However, when I tried to go assemble the data, the assembly (using a variety of assemblers) were all terrible (thousands of small contigs). Mapping results looked fine.

I was able to get a good assembly after running the data through trimmomatic first. As Brian suggested, it is not necessary to trim off the first 15-20 bases due to fragmentation site bias...
Attached Images
File Type: png R1.png (9.5 KB, 43 views)
File Type: png R2.png (10.5 KB, 43 views)
dave1 is offline   Reply With Quote
Old 07-17-2014, 09:35 AM   #4
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

I have never worked with 2x300bp data; so far, we only go up to 2x250. So I'm not sure how typical the quality is of the last bases on read 2, but it certainly looks like it should be trimmed. And overall the quality variability for read 2 seems higher than it should be, but I don't work on the wet-lab side, so I'm not sure what it might indicate.

If you have plenty of data, you might experiment with throwing away reads with average quality below some threshold (or specifically, pairs in which either read is below the threshold), and see if that improves your assembly.
Brian Bushnell is offline   Reply With Quote
Old 07-17-2014, 03:02 PM   #5
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,032
Default

Since FastQC plots larger intervals it is difficult to see what may be going on with R2. You could turn-off the interval plotting on the command line and see if the tail end of R2 truly requires major trimming/throwing away the reads.

If this is a bacterial genome I would suggest trying SPADes, if you have not already done so.
GenoMax is offline   Reply With Quote
Old 07-21-2014, 12:20 AM   #6
avo
Member
 
Location: Germany

Join Date: Sep 2013
Posts: 14
Default

In my experience the fastqc quality plots look similar to what we see with TruSeq libraries.
However i always do the trimming for adapters and quality.
Especially with Nextera, the bead size selection and 2x300bp reads you might end up with some adapter sequences in your read data.

Do you do the trimming on the MiSeq directly or separately afterwards? To get a feel about the adapter contamination i would recommend to turn off the adapter trimming function on the MiSeq.

Concerning the first 20 bp I agree with Brian and it looks the same for the Nextera libraries we sequenced so far.
avo is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 09:11 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO