SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



View Poll Results: Does this result surprise you?
Yes 3 23.08%
No 10 76.92%
Voters: 13. You may not vote on this poll

Similar Threads
Thread Thread Starter Forum Replies Last Post
detect fusion gene from Solid single end 50bp reads( colorspace) yinxiaohe SOLiD 5 12-16-2011 04:39 AM
Fusion gene detect tools for Solid (colorspace)single end 50bp RNA-seq data yinxiaohe RNA Sequencing 3 08-22-2011 06:04 PM
Mapping of SOLiD reads JohanS SOLiD 2 05-26-2011 03:04 AM
mapping SOLID reads under 24nt mycky SOLiD 4 07-30-2010 09:03 PM
Using 50bp short-reads to find a translocation agc Bioinformatics 3 07-22-2010 04:33 AM

Reply
 
Thread Tools
Old 11-02-2010, 09:42 AM   #1
bacdirector
Member
 
Location: Pittsburgh, PA

Join Date: Oct 2009
Posts: 12
Thumbs up Trimming SoLiD 50bp Reads = Doubling Our Mapping

We've all seen how the sequencing errors tend to found toward the 3' end of the read. We taken the task up of trimming reads for a highly over-covered study to 25, 30, 35... 50b. and mapping using NextGene under 1, 2, 3 etc mismatches.

The results will floor you.

When we map to (human) reference using all 50bp, we map 25 million reads.

When we map using 35b trimmed reads, we map DOUBLE that... 50 million reads. 100% more reads mapped, and errors and NA's (i.e., -1's) are avoided.

The implications are obvious; this halves the average expected (i.e., target) coverage needed for the same, or better, level of accuracy, thus also cutting in half the laboratory budget.

We are writing it up for a peer-reviewed publication using a variety of mapping tools.

I just wanted to share this tidbit, especially for those support cancer research. Please let me know if you'd like to see the data. I don't want to post images here that will also appear in a publication.

If anyone else tries this, note that trimming all reads to 25 results in the highest number of reads mapped; however, it also increases at 25 the number of valid adjacent errors. The lowest number of valid adjacent errors occurs between 30 and 35b.

Your feedback is welcome.

best,

jlw
pitt bac
bacdirector is offline   Reply With Quote
Old 11-02-2010, 10:26 AM   #2
nickloman
Senior Member
 
Location: Birmingham, UK

Join Date: Jul 2009
Posts: 356
Default

I'm not sure I follow your logic. Surely you mean that you should run shorter read lengths, not that you should change your coverage depth. So if you were running Illumina and getting low coverage 3' ends then I agree you might want to run with fewer number of cycles. I am not familiar with SOLiD particularly so am not sure whether it has configurable read lengths. Anyway, I agree with your findings that one should always trim their reads according to quality and that quality tends to drop off further along each read!
nickloman is offline   Reply With Quote
Old 11-02-2010, 10:47 AM   #3
bacdirector
Member
 
Location: Pittsburgh, PA

Join Date: Oct 2009
Posts: 12
Default

Nikloman,

In SoLiD, the reads were generated using the 50b read kit. The 'default' analysis to use all the reads in mapping, and to let mapping remove those with >1, 2, or 3 mismatches.

By trimming all of the reads >35b to 35, this avoids the errors that tend to occur near the 3' end in the 50b reads; so these reads are retained during mapping. if the errors occurred evenly along the reads, we would not see a doubling of mapping.

We start with reads of all lengths, from 25 to 50bp

---
----
------
--
----------
-------------
--------

and trim all of them 35b to 35b

---
---
--
---
---
---

We are trimming in colorspace.

We have not yet explored read trimming systematically for Illumina.

I hope this clarifies.
bacdirector is offline   Reply With Quote
Old 11-02-2010, 10:50 AM   #4
nickloman
Senior Member
 
Location: Birmingham, UK

Join Date: Jul 2009
Posts: 356
Default

Yeah, I understand that, I just don't understand the bit where you halve the laboratory budget! I guess that would be true if you didn't realise that trimming reads to eliminate error-prone sequence before mapping was a good idea, but I thought everyone did know that
nickloman is offline   Reply With Quote
Old 11-02-2010, 11:00 AM   #5
mrawlins
Member
 
Location: Retirement - Not working with bioinformatics anymore.

Join Date: Apr 2010
Posts: 63
Default

It's actually fairly intuitive that mapping the shortened reads provides greater number of mappings. The number of reads with adequate quality out to base N goes up as N shrinks and the probability of randomly matching "junk" reads goes up as N shrinks (though it's not very large at N>=25). What would be interesting (but not terribly counter-intuitive) is how this trimming affects SNP calling, genome assembly and/or transcriptome splice variant calling. The shorter reads are expected to be a major liability in unambiguously calling splice variants, but would probably increase accuracy for SNP calling and possibly for genome assembly (provided adequate coverage). I'm not sure whether the shorter reads would help or hinder ChIP-Seq as I'm not as familiar with that application.

This is interesting, and I wouldn't have thought to do it. I didn't quite follow what application you're targeting, though, and I doubt that this will provide benefits for all sequencing applications.
mrawlins is offline   Reply With Quote
Old 11-02-2010, 11:04 AM   #6
bacdirector
Member
 
Location: Pittsburgh, PA

Join Date: Oct 2009
Posts: 12
Default

Actually, I'm surprised to hear that you think everyone knows about this... I've not seen anyone reference this approach in NGS studies that are published. Then again the 50bp kits were added late.

I presented this result at a conference in Rhode Island last month. No one out of the hundreds in attendance using SoLID routinely trims their 50b reads. It's debateable and an open question as to whether trimming has the same effect on all mapping algorithms, such as Bowtie.

It cuts your budget in half because instead of using the recommended say 15X, you don't lose 1/2 of your reads to mapping and you can achieve the same realized coverage at 7.5X.

Leaves me wondering what other tricks of the trade are out there that can increase coverage? We only see a modest increase (by comparison) in mapping when we run the SAET tool.

So let me ask, of ABI SoLiD Users only please, do you routinely trim your reads, and what other observations about the fundamental characteristics of SoLiD data does anyone care to share here?
bacdirector is offline   Reply With Quote
Old 11-02-2010, 11:08 AM   #7
nickloman
Senior Member
 
Location: Birmingham, UK

Join Date: Jul 2009
Posts: 356
Default

Well I guess I better caveat and say I was talking about Illumina sequencing, not SOLiD. But in the world of Illumina sequencing, I would say what you describe is a fairly well understood phenomenon.

I pointed this out in a blog post a long time back, although in the context of de novo assembly - see http://pathogenomics.bham.ac.uk/blog...nome-assembly/

If you say that there are loads of SOLiD users who don't quality filter their reads, then I guess you have a duty to bring it to their attention by any means necessary!
nickloman is offline   Reply With Quote
Old 11-02-2010, 11:14 AM   #8
brentp
Member
 
Location: denver, co

Join Date: Apr 2010
Posts: 72
Default

just to clarify, you are counting uniquely mapped reads, correct?
clearly, shorter reads will be more likely to map ambiguously--that is they can map to multiple places in the genome.
[fwiw, i work with illumina and *always* check them in FastQC then trim with fastx-toolkit]
brentp is offline   Reply With Quote
Old 11-02-2010, 11:16 AM   #9
bacdirector
Member
 
Location: Pittsburgh, PA

Join Date: Oct 2009
Posts: 12
Default

Thanks for the info on the blog.

Actually we have found that the quality scores don't quite reflect the information that one would hope for SoLID - we found empirically that mapping is highest with lowest reported errors between 30 and 35; we also found that qvals only weakly trend with base position. I think a lot of people filter first on qvals... but see, a 50b read with a low qval can be 'saved' for mapping by trimming to 35.......... otherwise they would be lost, hence our doubling of the mapping and halving of the cost.

We have a better measure than the qval - we call it ambiguity. You might be interested in it; it's the difference between the ideal entropy given a genotype call and the observed entropy. the score has a distinct distribution for homozygotes and heterozygotes. it's calculated after mapping, and we use ambiguity, along w/ ave realized coverage, qvals and other measures to prioritize variants for validation.

also working up a publication on this...
bacdirector is offline   Reply With Quote
Old 11-02-2010, 11:17 AM   #10
mrawlins
Member
 
Location: Retirement - Not working with bioinformatics anymore.

Join Date: Apr 2010
Posts: 63
Default

I don't remember seeing any reads shorter than the full 50 bp from our SOLiD machine, so I'm a bit surprised that you report seeing them.

For our analysis we figure that the first roughly 27 bp (with 3 mismatches) are sufficient to uniquely identify a transcript in our microbes (smaller genomes than human and no alternative splicing issues). We've set BowTie to ignore all bases after that during mapping (although I think it may still use those additional bases to break ties). That works pretty well, but makes it more difficult to handle alternative splicing. I'm not familiar with how NextGene does its mapping, which may affect the results.

Also, what percentage of your reads are mapping? You mentioned you see between 50 and 100 million, but how many reads do you have overall?
mrawlins is offline   Reply With Quote
Old 11-02-2010, 11:22 AM   #11
bacdirector
Member
 
Location: Pittsburgh, PA

Join Date: Oct 2009
Posts: 12
Default @brent

No this is total reads mapped. good point. As I said, we see an uptick in valid adjacent errors when we go lower than 30, but we see the lowest number of valid adjacent errors when we are between 30 and 35. This means there is the least arbitary mapping between 30-35. It increases, as one expects, as we allow more read in, at 40, or 45, or 50.

We are not filtering reads >35, they are just trimmed to 35.

I'll get back to you on the %reads mapped.
bacdirector is offline   Reply With Quote
Old 11-02-2010, 11:23 AM   #12
bacdirector
Member
 
Location: Pittsburgh, PA

Join Date: Oct 2009
Posts: 12
Default

50/52 million reads mapped....
bacdirector is offline   Reply With Quote
Old 11-02-2010, 03:17 PM   #13
poisson200
Member
 
Location: united kingdom

Join Date: Feb 2010
Posts: 63
Default

Quote:
Originally Posted by bacdirector View Post
50/52 million reads mapped....
96% reads mapping; that sounds impressive to me. I wonder if that is best unique reads mapping. For some applications, I think it is better to ignore reads that map to multiple positions in the genome with the same number of mismatches and alignment length. I am thinking that you don't exclude those reads?

To be honest, it is not so novel to trim reads, even colourspace reads, as we have been experimenting with that ourselves. However, although we get an improvement, we don't get 96% of total reads mapping hence my slight doubts. If you can get a publication out of it, why not? Also, let me know, I can contribute some data :-). Did Watson and Crick really first elucidate the structure of DNA on their own? Maybe/maybe not but they are famous. So go for a publication if you can.

I also wonder if there is a bit of difference with colourspace and Illumina data. I think an error in colourspace causes the rest of the read to be incorrect, whereas it might look like a SNP in Illumina/base space. I am more than happy to be corrected here if I am wrong but based on that scenario, it is possibly more beneficial for cutting off the ends of colourspace reads than Illumina's.

What exactly is your command line for NextGene and/or bowtie?
poisson200 is offline   Reply With Quote
Old 11-02-2010, 06:50 PM   #14
Michael.James.Clark
Senior Member
 
Location: Palo Alto

Join Date: Apr 2009
Posts: 213
Default

Have you tried using something other than NextGENe for alignment? BFAST? Novoalign? BWA? These are aligners you should definitely assess for SOLiD data. This phenomenon seems like an alignment issue, not a technology issue.

If you have tried these algorithms and you're seeing such a phenomenon still, it may be time to call up LifeTech and ask them to come see if something is wrong with your machine. It does not sound right to me.

Quote:
Originally Posted by poisson200 View Post
I think an error in colourspace causes the rest of the read to be incorrect, whereas it might look like a SNP in Illumina/base space. I am more than happy to be corrected here if I am wrong but based on that scenario, it is possibly more beneficial for cutting off the ends of colourspace reads than Illumina's.
This is misinformation. An error in colorspace does mess with the rest of the read momentarily, but colorspace errors are easily corrected thanks to the same phenomenon because there will be an inconsistency across the different ligations, and colorspace error trends are known and modeled. In actuality, colorspace reads are a lot more accurate in theory than base space (but in chemistry, they may get messier at the ends).
__________________
Mendelian Disorder: A blogshare of random useful information for general public consumption. [Blog]
Breakway: A Program to Identify Structural Variations in Genomic Data [Website] [Forum Post]
Projects: U87MG whole genome sequence [Website] [Paper]
Michael.James.Clark is offline   Reply With Quote
Old 11-02-2010, 11:26 PM   #15
Chipper
Senior Member
 
Location: Sweden

Join Date: Mar 2008
Posts: 324
Default

bacdirector,

you need to allow at least 10% errors to get decent 50 bp alignments. Trimming can help to rescue reads, we have used that strategy in SOLiD publications and a similar strategy was also used in the Gibbs/Lupski/ABI genome paper in NEJM (http://www.nejm.org/doi/full/10.1056...99307083290205).

Better yet is to use a decent mapping strategy in the first case. If you re-do your alignment with BFAST it will find an anchor in the good part of the read and still use the lower qv part of the read, which should give you more coverage. Likewise bowtie will use the god part as a seed and still use full length reads. Bowtie (at least early cs versions) sometimes gives different base calls than bfast for the same reads so it may give you more of a reference bias, or fewer false SNPs.

Given that the read is read in 5bp steps your errors dont necessarily accumulate towards the end, if you have a bad primer E you will still not be able to map them at 35 bp or 25 bp. With bfast you can easily design an index that handles this.

Still a thourough evaluation on mapping strategies for CS would be useful, especially to look into effects on snp-calls.

Last edited by Chipper; 11-02-2010 at 11:31 PM.
Chipper is offline   Reply With Quote
Old 11-03-2010, 12:51 AM   #16
poisson200
Member
 
Location: united kingdom

Join Date: Feb 2010
Posts: 63
Default

Quote:
Originally Posted by Michael.James.Clark View Post
This is misinformation. An error in colorspace does mess with the rest of the read momentarily, but colorspace errors are easily corrected thanks to the same phenomenon because there will be an inconsistency across the different ligations, and colorspace error trends are known and modeled. In actuality, colorspace reads are a lot more accurate in theory than base space (but in chemistry, they may get messier at the ends).
Thanks for the correction; I am interested to know where the error trends are modelled? Is that in a publication somewhere and/or do some mapping packages correct/know about these errors when mapping reads and correct for it?

To clarify; you say that colourspace reads are more accurate than base space "theoretically", but because of chemistry issues, in reality they are not. Is that what you mean?
poisson200 is offline   Reply With Quote
Old 11-03-2010, 05:14 AM   #17
dsidote
Member
 
Location: New Jersey

Join Date: Aug 2009
Posts: 23
Default

Bacdirector,

What did you use to trim your SOLiD reads?

Dave
dsidote is offline   Reply With Quote
Old 11-03-2010, 07:05 AM   #18
drio
Senior Member
 
Location: 4117'49"N / 24'42"E

Join Date: Oct 2008
Posts: 323
Default

I am looking forward to see what's the number of uniquely mapped reads when trimming compared to the non trimmed version. I am with MJC, if you use bwa, NOvoalignCS or bfast (particularly the last two) you'll see an increase in mapped reads without trimming. It will be nice if you can recompute your alignments with those and show the numbers.
__________________
-drd
drio is offline   Reply With Quote
Old 11-03-2010, 07:55 AM   #19
westerman
Rick Westerman
 
Location: Purdue University, Indiana, USA

Join Date: Jun 2008
Posts: 1,104
Default

The ABI/LifeTech Bioscope software does 'progressive' mapping where it starts considering reads that map at 50 bases (and with various mismatches), then at 49 bases, etc. This approach is probably superior to one of simply chopping off the reads to 35 bases.

As an example, a partial statistics file from one of my recent SOLiD runs shows:

Read Length 50 0 mismatches 19,901,946 (64.86%)
Read Length 50 1 mismatches 1,336,272 ( 4.35%)
...
Read Length 35 0 mismatches 137,601 ( 0.45%)
Read Length 35 1 mismatches 40,989 ( 0.13%)
...
Down to a read length of 25.
westerman is offline   Reply With Quote
Old 11-03-2010, 09:32 AM   #20
Michael.James.Clark
Senior Member
 
Location: Palo Alto

Join Date: Apr 2009
Posts: 213
Default

Quote:
Originally Posted by poisson200 View Post
Thanks for the correction; I am interested to know where the error trends are modelled? Is that in a publication somewhere and/or do some mapping packages correct/know about these errors when mapping reads and correct for it?

To clarify; you say that colourspace reads are more accurate than base space "theoretically", but because of chemistry issues, in reality they are not. Is that what you mean?
Not exactly. Sorry, that may have been a little unclear. You can see any of the tech specs for SOLiD to help understand how the base correction works. Check the ABI site--many publications listed here: http://www.appliedbiosystems.com/abs...equencing.html

Basically, colorspace reads are more accurate than base space reads because of the ability to correct colorspace errors. There are (IIRC) five repeat resets during SOLiD sequencing, so a single read is observed five separate times at different ligation starting positions. Due to the increased number of observations, one can correct away errors thanks to knowing how the colorspace-to-basespace translation would be affected by a specific mismatching colorspace read.

The chemistry gets messy at the 3' end, so the accuracy start falling off, but it doesn't get as bad as single base sequencing after 35-bases--more like the last five bases or so. So 2-base encoded colorspace reads are more accurate than base space reads generally, but along the length of the read, they may be less accurate at the very 3' end. However, that inaccuracy at the end has zero impact on gapped alignment anchored by masking at the 5' end.

Nils Homer has a nice article about two-base encoding and how it works to improve accuracy here: http://www.biomedcentral.com/1471-2105/10/175
__________________
Mendelian Disorder: A blogshare of random useful information for general public consumption. [Blog]
Breakway: A Program to Identify Structural Variations in Genomic Data [Website] [Forum Post]
Projects: U87MG whole genome sequence [Website] [Paper]

Last edited by Michael.James.Clark; 11-03-2010 at 09:36 AM.
Michael.James.Clark is offline   Reply With Quote
Reply

Tags
coverage, mapping, solid, trimmed reads

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 10:39 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO