Seqanswers Leaderboard Ad

**kmcarr** · 02-23-2010, 02:46 PM

Hinsby,

You appear to be confusing sequencing platforms and paired-end (or mate-paired) protocols here.

The Illumina paired-end protocol is meant to generate two reads, one from each end of a contiguous fragment of dsDNA. The reads point towards each other (in their 5'->3' directions) and are separated by 200-600 bp, depending on the size of the DNA fragment.

The Illumina mate-pair protocol is meant to generate two reads which are separated by 2-5 kbp. This protocol includes a circularization step and subsequent fragmentation of the circle. The standard protocol does not use any linker DNA in the circularization. The two reads will be separated by 2-5 kbp and will point away from each other.

The Roche/454 paired end protocol is meant to produce two reads which are separated by 3, 8 or 20 kbp depending on the size of your original shearing of the genomic DNA. This protocol also uses a circularization step but includes a 42 bp linker at the point of circularization. The two reads will be separated by 3, 8 or 20 kbp and will point in the same direction.

You state that the sequence data was generated using the Illumina platform but that it has a 42 bp linker. The presence of the 42 bp linker would indicate the data was generated using 454. You need to clarify with your sequencing center what platform was used to generate the sequence before we can advise you on how to process/interpret your data.

**hinsby** · 02-23-2010, 03:51 PM

I was reading some previous post and yes I have mislabeled my data. they are Illumina mate pairs of 8Kb distance. Indeed the standard illumina protocol does not uses a 42bp central linker which would avoid the problem of having to remove this sequence but, in the believe of our sequencing facility manager, not having a central linker also does not let you recognize true mate pairs (sequencing from the extremes of your 8kb fragments) from reads hitting the central part of the fragment (the joint where the DNA fragment was circularized) and that hitting it will create intragenomic chimeric reads. Thus he changed the protocol and added the extra linker. Now the linker idea sounds pretty much like 454 because it was adapted from that technique. so the data was generated using a modified protocol using a extra central linker which in the long run should help to differentiate true mate pairs from pair ends in Ilumina, however it also created a challenge for actually processing the data (trimming and separation) before assembly.

I am inexperienced with this technology so any help is highly appreciated.

Hinsby

**Pepe** · 02-23-2010, 11:03 PM

In any case, the ShortRead package in R will solve your trimming problems.
You'll need to know/learn R though.
Here there are very useful examples on how to do the trimming and much more:

Sign in - Google Accounts

http://manuals.bioinformatics.ucr.edu/home/ht-seq

Also, Google the "vcountPattern" function, it seems very adequate for you.

**kmcarr** · 02-24-2010, 05:50 AM

Originally posted by hinsby View Post

...in the believe of our sequencing facility manager, not having a central linker also does not let you recognize true mate pairs (sequencing from the extremes of your 8kb fragments) from reads hitting the central part of the fragment (the joint where the DNA fragment was circularized) and that hitting it will create intragenomic chimeric reads. Thus he changed the protocol and added the extra linker. Now the linker idea sounds pretty much like 454 because it was adapted from that technique.

Well then I'm afraid your sequencing facility manager left you with a hot mess. The Illumina protocol recognizes the possibility that a read could cross the circular junction point but if you follow it as recommended the frequency should be very low. Here is what the Illumina mate-pair guide says:

When sequencing a mate pair library, Illumina recommends a read length no longer than 36 bases. A longer read length elevates error rates, because longer reads are more likely to cross over the junction of the two joined ends of a size-selected fragment. The Illumina analysis pipeline discards these junction reads, since they do not align to the reference sequence.

To minimize junction reads, the mate pair library uses a template size range of 350–650 bp. This is larger than a typical paired-end library template of 300–400 bp. Increasing the size range of the library in the mate pair protocol minimizes the number of sequence reads that pass through a junction.

Did you perform long reads with this library? The mate-pair protocol (as opposed to the paired-end protocol) is meant to provide scaffolding information, not sequence coverage.

You could try the fuzznucc program (http://embossgui.sourceforge.net/dem...l/fuzznuc.html) in the EMBOSS Suite (you would need to install all of EMBOSS). This won't trim the reads, just identify the location of the linker in your reads. You would then need to parse the output and trim or split the reads yourself.

**hinsby** · 03-02-2010, 12:57 PM

Yeap , it sounds like it is a hot mess indeed.

The reads are in average 80 bp so they are long reads.

Ok, I was not aware of the protocol and the use of shorter reads to reduce the chance of getting in the joint (center)before placing the sequencing order, however I trusted the judgment of our sequencing manager, and his intention was to maximize the information by using the long reads, the adapter to somehow flag true mate pairs and possibly obtain a de novo assembly using a full lane per each 2.6 Mb genomes. The idea seems a good one too me except that the center was not bioinformatically ready to deal with the sorting and cleaning of the sequences before assembly, and now I got that task and I am new in bioinformatics.

I will try the fuzznucc, sounds like it could help but I have 3 samples with something in the order of 15 million reads each which makes this task computationally long and memory demanding, my mac can barely handle the big files. Thanks for the help of course and any other idea or suggestion is welcome any time.

Topics	Statistics	Last Post
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, Yesterday, 08:47 AM	0 responses 12 views 0 likes	Last Post by seqadmin Yesterday, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 59 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 54 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM

Seqanswers Leaderboard Ad

Announcement

help: how to clean pair-edn Ilumina reads before assembly

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News