SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Assembly (velvet) of mate-pair data from Illumina Protaeus Bioinformatics 2 03-26-2012 06:13 AM
pair reads vs single reads modocthegreat Bioinformatics 2 01-12-2012 06:46 PM
What is the meaning of pair ends for de novo assembly? feralBiologist De novo discovery 2 06-12-2011 02:53 PM
PCR Clean up - 96 well HGENETIC Sample Prep / Library Generation 0 07-26-2010 02:44 AM

Reply
 
Thread Tools
Old 02-23-2010, 10:26 AM   #1
hinsby
Junior Member
 
Location: champaing Illinois

Join Date: Apr 2009
Posts: 4
Unhappy help: how to clean pair-edn Ilumina reads before assembly

Hello everyone,

I have some challenges for the group and any help and suggestion is welcome:

I run several genomes using the 8kb pair end protocol at one genome per lane. The bio informatics group in my facility have little experience on this and are very challenged helping my project. so here are the problems.

A) the runs seem contaminated by chimeric fragments from the sequencing adapters used in making the pair end data. so is there any software or script out there that can remove sequences matching the adapters (and the key part is) allowing for certain percentage of mismatch to adapter sequence (this to account for chimeric multiprimer sequences)

B) now the next problem is that in the pair end data you also use a central adapter and the true pair end data will be the one where the reads start at both ends of genomic fragment far from the central adapter (see pdf protocol for more detail http://www.illumina.com/applications...equencing.ilmn). however since the technology can not control that the position of the central adapter be just in the center, because the random shearing steps required, then the 42 bp adapter and the genomic sequence can come in all combinations possible as follows:
1- sequence read+adapter read (this is the easy one where a 3' triming tool can do the job)
2- adapter read+sequence read (this would need a 5' triming tool) and can be tricky when the adapter read is small as 2 or 3 bases because those bases will appear later in the assembly. But more importantly a read that start with adapter follow by the actual sequence is not a True 8KB pair end. this is actually a pair end of the 500 bases placed in the sequencing reaction. Then this data should be trimmed and move into a 500 b fragment file (along with its pair) and used in that way or just used as single read.

3- adapter in the center case: sequence read+adapter read+sequence read this case should be handle by a 3'end trimmer but the trimmer should be able to recognize the adapter as in the center of the read and not at the end of the reads as they are usually coded for.

4- the removing tool should be able to take an action for the pairs: e.g if kicking one pair as chimeric primer then should also throw away the second one (no chimera allowed). the trimming tool should trim the pair continously and place the trim pair as true 8Kb or 500 b or if one of the reads is eliminated because what is left is too small then the other read should go to a single reads file.

after all the sorting, filtering, etc the reads should be organized in different files: true clipped pair ends of 8Kb, 500 b pair ends, single reads all these after removing the chimeric/artifact reads coming from primer dimerization. This is a complex case and my question is about your recommendations on which tool or set of tools can allow me doing all these steps so I can use my ginormous amount of reads that have been kidnapped by all these issues.

any advice is welcome

Hinsby
hinsby is offline   Reply With Quote
Old 02-23-2010, 02:46 PM   #2
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,156
Default

Hinsby,

You appear to be confusing sequencing platforms and paired-end (or mate-paired) protocols here.

The Illumina paired-end protocol is meant to generate two reads, one from each end of a contiguous fragment of dsDNA. The reads point towards each other (in their 5'->3' directions) and are separated by 200-600 bp, depending on the size of the DNA fragment.

The Illumina mate-pair protocol is meant to generate two reads which are separated by 2-5 kbp. This protocol includes a circularization step and subsequent fragmentation of the circle. The standard protocol does not use any linker DNA in the circularization. The two reads will be separated by 2-5 kbp and will point away from each other.

The Roche/454 paired end protocol is meant to produce two reads which are separated by 3, 8 or 20 kbp depending on the size of your original shearing of the genomic DNA. This protocol also uses a circularization step but includes a 42 bp linker at the point of circularization. The two reads will be separated by 3, 8 or 20 kbp and will point in the same direction.

You state that the sequence data was generated using the Illumina platform but that it has a 42 bp linker. The presence of the 42 bp linker would indicate the data was generated using 454. You need to clarify with your sequencing center what platform was used to generate the sequence before we can advise you on how to process/interpret your data.
kmcarr is offline   Reply With Quote
Old 02-23-2010, 03:51 PM   #3
hinsby
Junior Member
 
Location: champaing Illinois

Join Date: Apr 2009
Posts: 4
Default

I was reading some previous post and yes I have mislabeled my data. they are Illumina mate pairs of 8Kb distance. Indeed the standard illumina protocol does not uses a 42bp central linker which would avoid the problem of having to remove this sequence but, in the believe of our sequencing facility manager, not having a central linker also does not let you recognize true mate pairs (sequencing from the extremes of your 8kb fragments) from reads hitting the central part of the fragment (the joint where the DNA fragment was circularized) and that hitting it will create intragenomic chimeric reads. Thus he changed the protocol and added the extra linker. Now the linker idea sounds pretty much like 454 because it was adapted from that technique. so the data was generated using a modified protocol using a extra central linker which in the long run should help to differentiate true mate pairs from pair ends in Ilumina, however it also created a challenge for actually processing the data (trimming and separation) before assembly.

I am inexperienced with this technology so any help is highly appreciated.

Hinsby
hinsby is offline   Reply With Quote
Old 02-23-2010, 11:03 PM   #4
Pepe
Member
 
Location: Germany

Join Date: Mar 2009
Posts: 28
Default

In any case, the ShortRead package in R will solve your trimming problems.
You'll need to know/learn R though.
Here there are very useful examples on how to do the trimming and much more:
http://manuals.bioinformatics.ucr.edu/home/ht-seq

Also, Google the "vcountPattern" function, it seems very adequate for you.
Pepe is offline   Reply With Quote
Old 02-24-2010, 05:50 AM   #5
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,156
Default

Quote:
Originally Posted by hinsby View Post
...in the believe of our sequencing facility manager, not having a central linker also does not let you recognize true mate pairs (sequencing from the extremes of your 8kb fragments) from reads hitting the central part of the fragment (the joint where the DNA fragment was circularized) and that hitting it will create intragenomic chimeric reads. Thus he changed the protocol and added the extra linker. Now the linker idea sounds pretty much like 454 because it was adapted from that technique.
Well then I'm afraid your sequencing facility manager left you with a hot mess. The Illumina protocol recognizes the possibility that a read could cross the circular junction point but if you follow it as recommended the frequency should be very low. Here is what the Illumina mate-pair guide says:

Quote:
When sequencing a mate pair library, Illumina recommends a read length no longer than 36 bases. A longer read length elevates error rates, because longer reads are more likely to cross over the junction of the two joined ends of a size-selected fragment. The Illumina analysis pipeline discards these junction reads, since they do not align to the reference sequence.

To minimize junction reads, the mate pair library uses a template size range of 350650 bp. This is larger than a typical paired-end library template of 300400 bp. Increasing the size range of the library in the mate pair protocol minimizes the number of sequence reads that pass through a junction.
Did you perform long reads with this library? The mate-pair protocol (as opposed to the paired-end protocol) is meant to provide scaffolding information, not sequence coverage.

You could try the fuzznucc program (http://embossgui.sourceforge.net/dem...l/fuzznuc.html) in the EMBOSS Suite (you would need to install all of EMBOSS). This won't trim the reads, just identify the location of the linker in your reads. You would then need to parse the output and trim or split the reads yourself.
kmcarr is offline   Reply With Quote
Old 03-02-2010, 12:57 PM   #6
hinsby
Junior Member
 
Location: champaing Illinois

Join Date: Apr 2009
Posts: 4
Default

Yeap , it sounds like it is a hot mess indeed.

The reads are in average 80 bp so they are long reads.

Ok, I was not aware of the protocol and the use of shorter reads to reduce the chance of getting in the joint (center)before placing the sequencing order, however I trusted the judgment of our sequencing manager, and his intention was to maximize the information by using the long reads, the adapter to somehow flag true mate pairs and possibly obtain a de novo assembly using a full lane per each 2.6 Mb genomes. The idea seems a good one too me except that the center was not bioinformatically ready to deal with the sorting and cleaning of the sequences before assembly, and now I got that task and I am new in bioinformatics.

I will try the fuzznucc, sounds like it could help but I have 3 samples with something in the order of 15 million reads each which makes this task computationally long and memory demanding, my mac can barely handle the big files. Thanks for the help of course and any other idea or suggestion is welcome any time.
hinsby is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 07:37 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO