SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Bowtie: Ultrafast and memory-efficient alignment of short reads to the human genome Ben Langmead Literature Watch 2 03-04-2013 03:06 AM
The best short read aligner Deutsche Bioinformatics 4 04-14-2011 08:12 PM
Short Read Micro re-Aligner Paper nilshomer Literature Watch 0 10-29-2010 10:59 AM
New Short Read Aligner sparks Bioinformatics 48 08-26-2009 09:01 AM
Very Short Read aligner Rupinder Bioinformatics 1 06-02-2009 08:10 PM

Reply
 
Thread Tools
Old 12-12-2008, 02:13 AM   #41
lh3
Senior Member
 
Location: Boston

Join Date: Feb 2008
Posts: 693
Default

Ben, you are welcome. I was looking at 200x36x36-071113_EAS56_0053-s_1_?.fastq.gz from SRA000271. Soap2 is also available and so you can try by yourself. Let me know if I was doing something wrong. As for SV detection, a naive but most widely used way is to cluster, based on coordinates, read pairs that are mapped with excessively large/small insert size or across chromosomes. I think the experience is we need to look at reads mapped confidently. bwa's behaviour is mostly preferred. Soap2's is also ok, but we need to be more careful to filter out wrong alignments. I a bit worry about the default bowtie behaviour on such applications, but of course we can only be sure when it gets evaluated.

[PS: Just read your reply to ewingad. I can see that bowtie will stop searching the reverse strand if it finds a hit on the forward strand. Suppose there is a segmental duplication with one copy on the forward strand and the other on the reverse. Bowtie will almost always report reads from either copy as unique and map all of them to the forward strand only (if the two copies are almost identical). This will incur higher SNP error rate and confuse SV detection. Note that your proposed method in your previous reply would not solve this case which may actually happen frequently. Always searching both strands will largely alleviate, though not completely solve, this problem.

In addition, from your reply I just notice this --nostrata thing. I wonder whether bowtie keeps multiple best strata. Say the best hit contains one mismatch and two strata may yield two different 1-mismatch alignments (e.g. the mismatch occurs at 10bp and 20bp respectively). When I use --best -a, does bowtie report the alignments in both strata? I suppose it does. It is not very clear from the manual.]

Last edited by lh3; 12-12-2008 at 03:11 AM.
lh3 is offline   Reply With Quote
Old 12-12-2008, 06:53 PM   #42
Cole Trapnell
Senior Member
 
Location: Boston, MA

Join Date: Nov 2008
Posts: 212
Default

Hi lh3,

I first wanted to say thanks for your compliments and constructive criticisms on Bowtie. To follow up on the discussion about reporting hits for multireads: would it not be better to penalize or even ignore alignments to predefined, annotated repeats in the reference as a post-processing step?

In your experiment, I wonder how many of the mapped reads fall entirely within RepeatMasker regions, for example.
Cole Trapnell is offline   Reply With Quote
Old 12-13-2008, 02:09 AM   #43
lh3
Senior Member
 
Location: Boston

Join Date: Feb 2008
Posts: 693
Default

Cole, pre-masking the reference genome is definitely worth trying, although I have not got much time to follow this line. Most algorithms spend a lot of time on repeats. Suffix array/tree based methods are fast largely because they collapse exact repeats and save time. However, using repeatmasker to mask genome is not the way to go. Repeatmasker does not guarantees to mask sequences having multiple copies. The sequences it masks may also be unique. Biological repeats are not necessarily sequences with multiple copies. What we more like to do is to simulate reads from each position on the reference, map them back, and then calculate some statistics indicating whether the region is a repeat. Illumina's "sequence-ability" is such a statistics, although it is not frequently use. I think a better statistics would also consider whether a read from the reference can be mapped elsewhere when there is an additional mismatch (bwa behaviour).

A potential problem with premasking is this strategy (at least the one I envision) does not work well with paired end mapping. Reads from some masked regions can be mapped correctly when pairing is in use. I can vaguely see possible solutions, but maybe it is too vague to say here. I am sure people can come up with better solutions.

[PS: I am thinking a better replacement with "sequence-ability". Let S be the reference. For all position x, we extract a read S[x,x+31] and map it back to S. Define a statistics R[x] such that it equals 0 if S[x,x+31] can be mapped to multiple places; equals 1 if S[x,x+31] has an 1-difference (mismatch+indel) hit elsewhere and 2 otherwise. Suppose a 32bp read is mapped to y. We can discard the hits if R[y]=0; or downweight it in SNP calling if R[y]=1. Approximation has to be made for read length longer/shorter than 32bp. A simplified the version would just set R[x] as 0 or 2.]

Last edited by lh3; 12-13-2008 at 02:32 AM.
lh3 is offline   Reply With Quote
Old 12-16-2008, 07:57 AM   #44
joa_ds
Member
 
Location: belgium

Join Date: Dec 2008
Posts: 52
Default

Hi,

first of all, we here at Ghent were very impressed with the bowtie results... We have ordered, but are still waiting for delivery, both a Solexa and a 454 sequencer and our group will be responsible for the data analysis and server installation etc.

Some people here were testing the classis programs such as GMAP and BLAT and a Solexa human genome test-set would take 2-3-4 days on our 8core 32G RAM server.

I went on a search and found novocraft. Taking only 6h to do the same job.

Now i found bowtie and it only takes 1h i guess? I started it before the lunch break and it was finished when i came back. Very impressive (without the -p flag!). (we have a 16core 64G coming... and i will use the -p flag... mapping the genome in 10 mins amazing.


I have some questions though:
  • I don't quite understand what the 'nostrattum' flag does
  • I am only interested in rather unique maps, the rest can go to another file and i can have a look at it later. the --unfa flag moves unmapped seqs to a file, but when i use -m 3 i will discard seqs that map more than 3 times, right? Those go that same file? or they are lost forever? The idea is that i want to do a preliminary analyses fast and i can remap those multimaps overnight or during a weekend when the server is not used.
  • If i use the -k 3 flag, i want to report 3 maps, will it take the first 3 it encounters? And if i use the --best flag, will it go find all the possible maps and only report the best 3?

./bowtie -k 3 -m 10 --best --unfa MSC_bowtie_unal_fasta human_genome ../files/file.111.fastq MSC_bowtie

is the commando i want to use. I hope it will find max 10 maps per sequence and report the best 3 (combining -k 3 and --best) Will this work? Just experimenting... In a later stage i will map everything(even x100 repeats) and output it to a db, so it doesnt really matter if it doesnt work, just trying to understand the program completely.

greetz,
Joachim
joa_ds is offline   Reply With Quote
Old 01-07-2009, 09:36 AM   #45
xwu
Junior Member
 
Location: Los Angeles

Join Date: Dec 2007
Posts: 9
Default

I am a new user of Bowtie, and I got one question. Is there a way to output the reads that CAN NOT be aligned to the reference genome?
xwu is offline   Reply With Quote
Old 01-07-2009, 09:42 AM   #46
florian
Junior Member
 
Location: Edinburgh, UK

Join Date: Nov 2008
Posts: 7
Default

hi there

as of version 0.9.8 bowtie comes with the --unfa <filename> / --unfq <filename> options, which should be doing exactly what you are looking for

http://bowtie-bio.sourceforge.net/manual.html#algn_unfa

cheers,

florian
florian is offline   Reply With Quote
Old 01-07-2009, 09:48 AM   #47
xwu
Junior Member
 
Location: Los Angeles

Join Date: Dec 2007
Posts: 9
Default

Florian, thanks a lot for your reply. I was using v0.9.7, and did not know there is a new version available. It is a very helpful function. Thanks.
xwu is offline   Reply With Quote
Old 01-07-2009, 09:53 AM   #48
florian
Junior Member
 
Location: Edinburgh, UK

Join Date: Nov 2008
Posts: 7
Default

Yeh, I agree. However, there seems to be a problem with multi-threading, so be advised NOT to use the -p option in conjunction with it. Ben (the main developer) has told me he was already working on a solution, though, so this will only be a intermittent drawback.
florian is offline   Reply With Quote
Old 01-07-2009, 10:35 AM   #49
Ben Langmead
Senior Member
 
Location: Baltimore, MD

Join Date: Sep 2008
Posts: 200
Default

Thanks Florian! - Indeed, version 0.9.8.1 of Bowtie was just released minutes ago and it fixes all known issues with the --unfa and --unfq options. It's highly recommended that you use that version.

Thanks,
Ben
Ben Langmead is offline   Reply With Quote
Old 01-07-2009, 10:50 AM   #50
Ben Langmead
Senior Member
 
Location: Baltimore, MD

Join Date: Sep 2008
Posts: 200
Default

Hello Joachim,

Sorry for the delay in replying! Please see responses below.

Quote:
Originally Posted by joa_ds View Post
[*]I don't quite understand what the 'nostrattum' flag does
Without the --nostratum flag, Bowtie will only report alignments for the best "stratum" where alignments were found. By "best stratum", I mean the best category of alignment, categorized by number of mismatches in the seed region. Say you use -k 3 and a given read aligns once with 1 mismatch in the seed and twice with 2 mismatches in the seed. If you do *not* specify --nostratum (the default), then Bowtie will only report the single 1-mismatch hit. If you do specify --nostratum, Bowtie will report all 3 hits.

I'll make a note to add a clear example in the documentation for future releases.

Quote:
[*]I am only interested in rather unique maps, the rest can go to another file and i can have a look at it later. the --unfa flag moves unmapped seqs to a file, but when i use -m 3 i will discard seqs that map more than 3 times, right? Those go that same file? or they are lost forever? The idea is that i want to do a preliminary analyses fast and i can remap those multimaps overnight or during a weekend when the server is not used.
(Let me answer your question with respect to the just released 0.9.8.1 version of Bowtie, since version 0.9.8 had issues with --unfa and --unfq.)

As of 0.9.8.1 Bowtie supports --maxfa/--maxfq options that dump reads that exceed the -m limit to a separate file. If --maxfa/--maxfq is not specified but --unfa/--unfq is, then these reads are dumped to the same file as the reads that don't align at all.

Quote:
[*]If i use the -k 3 flag, i want to report 3 maps, will it take the first 3 it encounters? And if i use the --best flag, will it go find all the possible maps and only report the best 3?

./bowtie -k 3 -m 10 --best --unfa MSC_bowtie_unal_fasta human_genome ../files/file.111.fastq MSC_bowtie

is the commando i want to use. I hope it will find max 10 maps per sequence and report the best 3 (combining -k 3 and --best) Will this work? Just experimenting... In a later stage i will map everything(even x100 repeats) and output it to a db, so it doesnt really matter if it doesnt work, just trying to understand the program completely.
Based on what you say, yes, that command will do what you intend. -k 3 --best should guarantee that you get up to three alignments of the "best" kind (best in terms of # of mismatches in the seed) and -m 10 will ensure that no alignments are reported for a read that aligns to more than 10 places. If you don't care whether the alignments come from the same strata, then you should also use "--nostrata".

Hope that helps.

Thanks,
Ben
Ben Langmead is offline   Reply With Quote
Old 01-12-2009, 09:33 AM   #51
xwu
Junior Member
 
Location: Los Angeles

Join Date: Dec 2007
Posts: 9
Default

I should have read Ben's post earlier. I spent a lot of time to find out that the sequences are all reversed, which caused weird results. I will try out the 0.9.8.1 now.
xwu is offline   Reply With Quote
Old 01-12-2009, 09:57 AM   #52
doxologist
Member
 
Location: USA

Join Date: Jan 2009
Posts: 96
Default

Does Bowtie work with colorspace data?
doxologist is offline   Reply With Quote
Old 01-12-2009, 01:39 PM   #53
Ben Langmead
Senior Member
 
Location: Baltimore, MD

Join Date: Sep 2008
Posts: 200
Default

Quote:
Originally Posted by doxologist View Post
Does Bowtie work with colorspace data?
Not yet - that's on the TODO list but we're going to tackle paired-end alignment and gapped alignments first.

Thanks,
Ben
Ben Langmead is offline   Reply With Quote
Old 01-12-2009, 01:52 PM   #54
doxologist
Member
 
Location: USA

Join Date: Jan 2009
Posts: 96
Default

thanks. looking forward to it.
doxologist is offline   Reply With Quote
Old 01-13-2009, 04:34 AM   #55
zee
NGS specialist
 
Location: Malaysia

Join Date: Apr 2008
Posts: 249
Default

HI Ben/Florian

I was using bowtie and I wanted to compare the mapping qualities when converted to .map format. Would these mapping quality scores be comparable to that for MAQ results?
zee is offline   Reply With Quote
Old 01-13-2009, 10:09 AM   #56
florian
Junior Member
 
Location: Edinburgh, UK

Join Date: Nov 2008
Posts: 7
Default

hi zee

sorry, if i gave you the wrong impression, but i'm actually not a developer of bowtie. i cannot claim any of this fame -- unfortunately :-D.
as to the question, i'll better leave the answer to ben, as i'm not sure about the answer either.

cheers,

f
florian is offline   Reply With Quote
Old 01-13-2009, 01:25 PM   #57
Cole Trapnell
Senior Member
 
Location: Boston, MA

Join Date: Nov 2008
Posts: 212
Default

Quote:
Originally Posted by zee View Post
I was using bowtie and I wanted to compare the mapping qualities when converted to .map format. Would these mapping quality scores be comparable to that for MAQ results?
Bowtie's mapping qualities are only a rough approximation of Maq's. Maq computes mapping quality like this:

Q = min {q_2 - q_1 - 4.343log(n_2), 4 + (3-k')(q_bar - 14) - 4.343log(p_1(3-k,28)) }

Where:

q_2 is the quality-weighted Hamming distance of the best hit, q_2 is the quality-weighted Hamming distance of the second best hit, q_bar is the average quality value on the 5' end of the read, and "p1(k,28) is the probability that a perfect hit and a k-mismatch hit coexists given a 28bp sequence which can be estimated during alignment" (from the Maq paper).

Bowtie, as discussed previously in this thread, doesn't guarantee that it will find the best hit, and by default, won't even continue searching for the second best one. So Bowtie can't really compute Mapping quality this way. Instead, our Maq converter (which was derived from Heng Li's ELAND converter) calculates mapping qualities as follows:

Q = (3 - k) * 25 - log(# of other equally good occurances found by Bowtie).

Where k is the number of mismatches in the seed region of the alignment. This is not as nice as Maq's method. However, it works without forcing Bowtie to be used in one of it's slower modes (--all, for example). In our tests so far, Maq's assembler handles qualities computed this way pretty well and produces good SNP calls.
Cole Trapnell is offline   Reply With Quote
Old 01-13-2009, 09:30 PM   #58
zee
NGS specialist
 
Location: Malaysia

Join Date: Apr 2008
Posts: 249
Default

Cole,

Thanks for that clarification. I found that some work I was doing with bowtie seemed to produce a lot more results than what I expected, even when I did use the '--best' option for reporting hits. When I looked at how it stacked up against some other aligners, I felt that there was some kind of overestimation when using the mapping quality score to evaluate good quality SNPs, correct alignments.
I would just need to be cautious how I interpret bowtie's results with respect to metric's derived from other aligners.
zee is offline   Reply With Quote
Old 01-20-2009, 04:50 PM   #59
foram
Junior Member
 
Location: California

Join Date: Mar 2008
Posts: 3
Default

Does anyone have any recommendations for scoring params when mapping long (76bp Illumina) reads?

Also, my reads are PE -- any chance this will be supported soon?
foram is offline   Reply With Quote
Old 01-21-2009, 06:47 AM   #60
Ben Langmead
Senior Member
 
Location: Baltimore, MD

Join Date: Sep 2008
Posts: 200
Default

Hi Foram,

I would try upping both the seed length (-l) and the error tolerance (-e). Others may have better suggestions, though. If you find parameters you're happy with, please do post them back here since that will help others.

I'm working on paired-end support currently. Expect it in a few weeks or so.

Thanks,
Ben

Quote:
Originally Posted by foram View Post
Does anyone have any recommendations for scoring params when mapping long (76bp Illumina) reads?

Also, my reads are PE -- any chance this will be supported soon?
Ben Langmead is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 03:40 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO