![]() |
|
|
#41 |
|
Senior Member
Join Date: Feb 2008
Location: Boston
Posts: 331
|
Ben, you are welcome. I was looking at 200x36x36-071113_EAS56_0053-s_1_?.fastq.gz from SRA000271. Soap2 is also available and so you can try by yourself. Let me know if I was doing something wrong. As for SV detection, a naive but most widely used way is to cluster, based on coordinates, read pairs that are mapped with excessively large/small insert size or across chromosomes. I think the experience is we need to look at reads mapped confidently. bwa's behaviour is mostly preferred. Soap2's is also ok, but we need to be more careful to filter out wrong alignments. I a bit worry about the default bowtie behaviour on such applications, but of course we can only be sure when it gets evaluated.
[PS: Just read your reply to ewingad. I can see that bowtie will stop searching the reverse strand if it finds a hit on the forward strand. Suppose there is a segmental duplication with one copy on the forward strand and the other on the reverse. Bowtie will almost always report reads from either copy as unique and map all of them to the forward strand only (if the two copies are almost identical). This will incur higher SNP error rate and confuse SV detection. Note that your proposed method in your previous reply would not solve this case which may actually happen frequently. Always searching both strands will largely alleviate, though not completely solve, this problem. In addition, from your reply I just notice this --nostrata thing. I wonder whether bowtie keeps multiple best strata. Say the best hit contains one mismatch and two strata may yield two different 1-mismatch alignments (e.g. the mismatch occurs at 10bp and 20bp respectively). When I use --best -a, does bowtie report the alignments in both strata? I suppose it does. It is not very clear from the manual.] Last edited by lh3; 12-12-2008 at 02:11 AM. |
|
|
|
|
|
#42 |
|
Senior Member
Join Date: Nov 2008
Location: Berkeley, CA
Posts: 154
|
Hi lh3,
I first wanted to say thanks for your compliments and constructive criticisms on Bowtie. To follow up on the discussion about reporting hits for multireads: would it not be better to penalize or even ignore alignments to predefined, annotated repeats in the reference as a post-processing step? In your experiment, I wonder how many of the mapped reads fall entirely within RepeatMasker regions, for example. |
|
|
|
|
|
#43 |
|
Senior Member
Join Date: Feb 2008
Location: Boston
Posts: 331
|
Cole, pre-masking the reference genome is definitely worth trying, although I have not got much time to follow this line. Most algorithms spend a lot of time on repeats. Suffix array/tree based methods are fast largely because they collapse exact repeats and save time. However, using repeatmasker to mask genome is not the way to go. Repeatmasker does not guarantees to mask sequences having multiple copies. The sequences it masks may also be unique. Biological repeats are not necessarily sequences with multiple copies. What we more like to do is to simulate reads from each position on the reference, map them back, and then calculate some statistics indicating whether the region is a repeat. Illumina's "sequence-ability" is such a statistics, although it is not frequently use. I think a better statistics would also consider whether a read from the reference can be mapped elsewhere when there is an additional mismatch (bwa behaviour).
A potential problem with premasking is this strategy (at least the one I envision) does not work well with paired end mapping. Reads from some masked regions can be mapped correctly when pairing is in use. I can vaguely see possible solutions, but maybe it is too vague to say here. I am sure people can come up with better solutions. [PS: I am thinking a better replacement with "sequence-ability". Let S be the reference. For all position x, we extract a read S[x,x+31] and map it back to S. Define a statistics R[x] such that it equals 0 if S[x,x+31] can be mapped to multiple places; equals 1 if S[x,x+31] has an 1-difference (mismatch+indel) hit elsewhere and 2 otherwise. Suppose a 32bp read is mapped to y. We can discard the hits if R[y]=0; or downweight it in SNP calling if R[y]=1. Approximation has to be made for read length longer/shorter than 32bp. A simplified the version would just set R[x] as 0 or 2.] Last edited by lh3; 12-13-2008 at 01:32 AM. |
|
|
|
|
|
#44 |
|
Member
Join Date: Dec 2008
Location: belgium
Posts: 48
|
Hi,
first of all, we here at Ghent were very impressed with the bowtie results... We have ordered, but are still waiting for delivery, both a Solexa and a 454 sequencer and our group will be responsible for the data analysis and server installation etc. Some people here were testing the classis programs such as GMAP and BLAT and a Solexa human genome test-set would take 2-3-4 days on our 8core 32G RAM server. I went on a search and found novocraft. Taking only 6h to do the same job. Now i found bowtie and it only takes 1h i guess? I started it before the lunch break and it was finished when i came back. Very impressive (without the -p flag!). (we have a 16core 64G coming... and i will use the -p flag... mapping the genome in 10 mins amazing. I have some questions though:
./bowtie -k 3 -m 10 --best --unfa MSC_bowtie_unal_fasta human_genome ../files/file.111.fastq MSC_bowtie is the commando i want to use. I hope it will find max 10 maps per sequence and report the best 3 (combining -k 3 and --best) Will this work? Just experimenting... In a later stage i will map everything(even x100 repeats) and output it to a db, so it doesnt really matter if it doesnt work, just trying to understand the program completely. greetz, Joachim |
|
|
|
|
|
#45 |
|
Junior Member
Join Date: Dec 2007
Location: Los Angeles
Posts: 7
|
I am a new user of Bowtie, and I got one question. Is there a way to output the reads that CAN NOT be aligned to the reference genome?
|
|
|
|
|
|
#46 |
|
Junior Member
Join Date: Nov 2008
Location: Edinburgh, UK
Posts: 5
|
hi there
as of version 0.9.8 bowtie comes with the --unfa <filename> / --unfq <filename> options, which should be doing exactly what you are looking for http://bowtie-bio.sourceforge.net/manual.html#algn_unfa cheers, florian |
|
|
|
|
|
#47 |
|
Junior Member
Join Date: Dec 2007
Location: Los Angeles
Posts: 7
|
Florian, thanks a lot for your reply. I was using v0.9.7, and did not know there is a new version available. It is a very helpful function. Thanks.
|
|
|
|
|
|
#48 |
|
Junior Member
Join Date: Nov 2008
Location: Edinburgh, UK
Posts: 5
|
Yeh, I agree. However, there seems to be a problem with multi-threading, so be advised NOT to use the -p option in conjunction with it. Ben (the main developer) has told me he was already working on a solution, though, so this will only be a intermittent drawback.
|
|
|
|
|
|
#49 |
|
Senior Member
Join Date: Sep 2008
Location: Baltimore, MD
Posts: 184
|
Thanks Florian! - Indeed, version 0.9.8.1 of Bowtie was just released minutes ago and it fixes all known issues with the --unfa and --unfq options. It's highly recommended that you use that version.
Thanks, Ben |
|
|
|
|
|
#50 | ||
|
Senior Member
Join Date: Sep 2008
Location: Baltimore, MD
Posts: 184
|
Hello Joachim,
Sorry for the delay in replying! Please see responses below. Without the --nostratum flag, Bowtie will only report alignments for the best "stratum" where alignments were found. By "best stratum", I mean the best category of alignment, categorized by number of mismatches in the seed region. Say you use -k 3 and a given read aligns once with 1 mismatch in the seed and twice with 2 mismatches in the seed. If you do *not* specify --nostratum (the default), then Bowtie will only report the single 1-mismatch hit. If you do specify --nostratum, Bowtie will report all 3 hits. I'll make a note to add a clear example in the documentation for future releases. Quote:
As of 0.9.8.1 Bowtie supports --maxfa/--maxfq options that dump reads that exceed the -m limit to a separate file. If --maxfa/--maxfq is not specified but --unfa/--unfq is, then these reads are dumped to the same file as the reads that don't align at all. Quote:
Hope that helps. Thanks, Ben |
||
|
|
|
|
|
#51 |
|
Junior Member
Join Date: Dec 2007
Location: Los Angeles
Posts: 7
|
I should have read Ben's post earlier. I spent a lot of time to find out that the sequences are all reversed, which caused weird results. I will try out the 0.9.8.1 now.
|
|
|
|
|
|
#52 |
|
Member
Join Date: Jan 2009
Location: USA
Posts: 96
|
Does Bowtie work with colorspace data?
|
|
|
|
|
|
#53 |
|
Senior Member
Join Date: Sep 2008
Location: Baltimore, MD
Posts: 184
|
|
|
|
|
|
|
#54 |
|
Member
Join Date: Jan 2009
Location: USA
Posts: 96
|
thanks. looking forward to it.
|
|
|
|
|
|
#55 |
|
Senior Member
Join Date: Apr 2008
Location: Malaysia
Posts: 130
|
HI Ben/Florian
I was using bowtie and I wanted to compare the mapping qualities when converted to .map format. Would these mapping quality scores be comparable to that for MAQ results? |
|
|
|
|
|
#56 |
|
Junior Member
Join Date: Nov 2008
Location: Edinburgh, UK
Posts: 5
|
hi zee
sorry, if i gave you the wrong impression, but i'm actually not a developer of bowtie. i cannot claim any of this fame -- unfortunately :-D. as to the question, i'll better leave the answer to ben, as i'm not sure about the answer either. cheers, f |
|
|
|
|
|
#57 | |
|
Senior Member
Join Date: Nov 2008
Location: Berkeley, CA
Posts: 154
|
Quote:
Q = min {q_2 - q_1 - 4.343log(n_2), 4 + (3-k')(q_bar - 14) - 4.343log(p_1(3-k,28)) } Where: q_2 is the quality-weighted Hamming distance of the best hit, q_2 is the quality-weighted Hamming distance of the second best hit, q_bar is the average quality value on the 5' end of the read, and "p1(k,28) is the probability that a perfect hit and a k-mismatch hit coexists given a 28bp sequence which can be estimated during alignment" (from the Maq paper). Bowtie, as discussed previously in this thread, doesn't guarantee that it will find the best hit, and by default, won't even continue searching for the second best one. So Bowtie can't really compute Mapping quality this way. Instead, our Maq converter (which was derived from Heng Li's ELAND converter) calculates mapping qualities as follows: Q = (3 - k) * 25 - log(# of other equally good occurances found by Bowtie). Where k is the number of mismatches in the seed region of the alignment. This is not as nice as Maq's method. However, it works without forcing Bowtie to be used in one of it's slower modes (--all, for example). In our tests so far, Maq's assembler handles qualities computed this way pretty well and produces good SNP calls. |
|
|
|
|
|
|
#58 |
|
Senior Member
Join Date: Apr 2008
Location: Malaysia
Posts: 130
|
Cole,
Thanks for that clarification. I found that some work I was doing with bowtie seemed to produce a lot more results than what I expected, even when I did use the '--best' option for reporting hits. When I looked at how it stacked up against some other aligners, I felt that there was some kind of overestimation when using the mapping quality score to evaluate good quality SNPs, correct alignments. I would just need to be cautious how I interpret bowtie's results with respect to metric's derived from other aligners. |
|
|
|
|
|
#59 |
|
Junior Member
Join Date: Mar 2008
Location: California
Posts: 3
|
Does anyone have any recommendations for scoring params when mapping long (76bp Illumina) reads?
Also, my reads are PE -- any chance this will be supported soon? |
|
|
|
|
|
#60 |
|
Senior Member
Join Date: Sep 2008
Location: Baltimore, MD
Posts: 184
|
Hi Foram,
I would try upping both the seed length (-l) and the error tolerance (-e). Others may have better suggestions, though. If you find parameters you're happy with, please do post them back here since that will help others. I'm working on paired-end support currently. Expect it in a few weeks or so. Thanks, Ben |
|
|
|
![]() |
| Thread Tools | |
|
|