Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #31
    Hello Ryan,

    FYI, I made multiple improvements in Bowtie 0.9.8 that will help prevent such problems for people in the future. First of all, the Mac binary available from sourceforge is now "Universal" for i386 and x86_64, whereas it was previously i386 only, which forced 64-Mac users to start from source. Also, by default bowtie-build will now automatically look for values for the --bmax, --dcv and --packed parameters that fit into the memory of the computer it's running on. This obviates the tedious trial-and-error of trying larger and larger --bmaxdivn. Also, bowtie-build and bowtie-build-packed are now both "in" bowtie-build - they are no longer separate binaries. Packed mode is activated by passing -p/--packed to bowtie-build.

    I hope these improvements help you and others.

    Thanks,
    Ben

    Comment


    • #32
      Hi, is there a way to find out how many equally good alignments exist for a given read using bowtie (other than using the -a option and counting them up)? Is this what field #7 ('Reserved') in the bowtie aligner output is?

      Comment


      • #33
        Hi ewingad,

        That's *almost* what field #7 is. What this field really tells you is how many other rows there were in the range of suffix-array rows (equivalently: Burrows-Wheeler matrix rows) from which the alignment was randomly selected and reported. To put it another way, that field tells you how many other alignments there are with the same reference string. To put it yet another way: that field gives an estimate of how many other places the read aligns, but it can be an *underestimate* (in rare cases, it can be a large underestimate). If you want to know exactly how many places it aligns then, yes, giving bowtie the -a option and counting up the alignments is the best thing to do for now.

        Everything's a tradeoff! The current Bowtie default maximizes performance but doesn't expend any of the additional effort needed to get a more meaningful result for field #7. Bowtie with the -a options minimizes performance but tells you exactly how many alignments there are and what they are. In the future, it's possible for me to add an option that doesn't necessarily *report* all the alignments, but which expends the effort needed to count them up and reports the right number in field #7. This will be faster than bowtie -a, but probably only a little faster. Is that a useful feature for you, or are the current options sufficient?

        Thanks,
        Ben

        Comment


        • #34
          Hi Ben,

          is column #7 the (minimum) number of equally good hits? Would it be possible to report also the number of hits with an additional mismatch?

          Comment


          • #35
            To Ben:

            First thank you for Bowtie. It is really amazing. I have not read Bowtie codes, but from README, I think all the BWT based aligners (together with SOAP2 and mine) share a lot in common. Some comments:

            1. I forget how much time is spent on building BWT index, but my impression is BWT-SW (also used in soap2 and bwa) is much faster and more light-weighted. Maybe it is worth having a look at the publications by Tam's group. Of course, building index is an once-for-all process. Just remind you of a possibly better algorithm.

            2. My main concern about bowtie is actually related to the column 7. I think by default (no --best), bowtie just outputs the first group of hits it meets. Users would not know whether it is the best or whether it is a repeat or not. I think (maybe wrong) this behaviour is only useful for screening human contaminations. With "--best", user would know the output is the best hit, but whether it is a repeat is still unknown in some cases. I know the "unknown" cases should be rare, but it would be necessary to convince users that the rare cases would not affect accuracy. Only with "--best -k 2", a user may know whether it is a repeat or not, although he/she would not know the number of occurrences. I think the "--best -k2" is the most desired behaviour and should become the default. Bowtie is fast enough. Slowing it down by a factor 3 will still make most users quite happy (see also below). Also quoting the speed under the default option would be unfair to others.

            3. I think it is worth metioning that the speed of soap/bowtie/novoalign is sensitive to error rate, while eland/zoom/rmap not. On bowtie's home page, it is claimed that bowtie is 35X faster than maq and 350X than soap. However, to my experiences, on high quality data, maq and soap is about the same speed. We can only see this big difference when base quality is low. Actually on my small experiment, bowtie is 170X faster than maq (cmd option: -f -v 2; I know using quality may be a bit slower), but I guess the difference would be smaller if the error rate is higher.

            4. Comparison of BWT based aligners. So far as I know, there are three BWT based short read aligners: bowtie, soap2 and bwa. Soap2 gives the number of occurences of the best hits and bwa also reports the number of hits with additional mismatches if the best hit is unique (like what chipper is asking for). bwa is the only one that finds short indels at present, although I am sure this is not hard for soap2/bowtie. In addition, soap2 only finds hits with up to 2 mismatches. I think they have to do so according to a brief description of its algorithm. On speed, bowtie-f-v2 is 3X faster than soap2 (bowtie--best-v2-f-k2 is similar to soap2 in speed) and soap2 is 2.5X faster than bwa. On memory, both bwa and bowtie use 2.3GB while soap2 uses 5.3GB.

            5. back to how the alignments are reported. I think the bwa behaviour is useful if people do not care too much about speed. Knowing the number of suboptimal hits would help us to decide which alignments are reliable. I know this is important to some (not all) SV detection algorithms. If you think the bwa behaviour is costly (possibly it is), I would recommend the soap2's one. Frequently, we may want to know the exact number of occurences (no need to output the detailed aligments). I am sure having the soap2 behaviour would make bowtie more popular.

            [PS: soap2 has binary available; bwa is released under GPL and so source codes are available.]

            To swbarnes2:

            I do think referees found by Nature thought maq indel caller is rubish. As for the maq output, I wonder whether mapview (not maqview) suits your goal?

            To Dmamartin:

            I think your benchmark may not be very fair. On one hand, you use chr1 only, but BWT-based method only shines when the reference genome is longer. To this end, the comparison unfair to bowtie and vmatch. On the other hand, ZOOM is only efficient when you feed several million reads in a batch. You will find that the speed on aligning 20k or 40k reads is about the same for ZOOM.
            Last edited by lh3; 12-11-2008, 03:20 AM.

            Comment


            • #36
              Thanks for the reply Ben, and thanks for bowtie! I was introduced to it via a talk by Lior Pachter, he speaks highly of you and Cole. In my short experience with it so far it has been extremely fast and accurate when re-mapping reads from the 1000 genomes project.

              I agree with lh3 that it may be vital to know whether a given alignment is to a repetitive sequence i.e. if other equally good alignments exist - I suppose that if there are multiple _identical_ alignments they should all be coming from the same BW rows? If this is provable then it is possible to reassure users about lh3's point #2.

              I also agree that something like lh3's point #5 is useful - when aligning short reads it is often a useful quality measure to know how many alignments would result from a 1-bp change in the query sequence. So, if for a given read, if I know that even if I changed one base, the reported alignment would not change, I feel more confident about that alignment.

              Thanks again!

              -Adam

              Comment


              • #37
                Hey lh3,

                This discussion is very helpful - thanks, and thanks for checking out Bowtie.

                Originally posted by lh3 View Post
                To Ben:

                First thank you for Bowtie. It is really amazing. I have not read Bowtie codes, but from README, I think all the BWT based aligners (together with SOAP2 and mine) share a lot in common. Some comments:

                1. I forget how much time is spent on building BWT index, but my impression is BWT-SW (also used in soap2 and bwa) is much faster and more light-weighted. Maybe it is worth having a look at the publications by Tam's group. Of course, building index is an once-for-all process. Just remind you of a possibly better algorithm.
                I agree; it's entirely possible that there's something faster, e.g., what BWT-SW does. The biggest advantage of the algorithm Bowtie uses as I've implemented it is that it's quite flexible with how much memory it uses. However, I too suspect that something faster could be done. That's on my TODO list (though it's pretty far down, since, as you say, that cost is typically amortized away).

                2. My main concern about bowtie is actually related to the column 7. I think by default (no --best), bowtie just outputs the first group of hits it meets. Users would not know whether it is the best or whether it is a repeat or not. I think (maybe wrong) this behaviour is only useful for screening human contaminations. With "--best", user would know the output is the best hit, but whether it is a repeat is still unknown in some cases. I know the "unknown" cases should be rare, but it would be necessary to convince users that the rare cases would not affect accuracy. Only with "--best -k 2", a user may know whether it is a repeat or not, although he/she would not know the number of occurrences. I think the "--best -k2" is the most desired behaviour and should become the default. Bowtie is fast enough. Slowing it down by a factor 3 will still make most users quite happy (see also below). Also quoting the speed under the default option would be unfair to others.
                Your characterization that Bowtie "outputs the first group of hits it meets" is right - specifically, it will randomly select k distinct alignments from that range of BWT rows. (Of course, if k is greater than the number of alignments in that range, it has to go and search for another range.)

                First, let me reemphasize that I think of Bowtie's target application as mammalian resequencing - that's how I characterize it in the manual and that's what we spend our time trying to optimize it for. I also mention in the manual that Bowtie should not be considered a general-purpose alignment tool, because there are definitely alignment scenarios where Bowtie won't give the exact information needed, or may not give it all that much faster than other tools. Given that, I want to argue that the "unknown" cases are not a significant concern, and so I disagree that quoting the speed under the default option is unfair.

                First, the "unknown" cases. In resequencing, the result of the alignment step is essentially a multiple alignment across the whole genome with relatively deep coverage. Looking for, e.g., a SNP variant involves walking along the columns of the multiple alignment, looking at the alignments that span that column and combining the relevant evidence from each of the alignments to see if there's a call to be made. As described in the Maq paper (are you familiar with it? ), there's a need to discount evidence from alignments whose placement was ambiguous. That's what field #7 does (or will do, when we've settled on what should actually go there; we've marked it as "reserved" for now). The good news is that we can reasonably assume deep coverage, which means that we don't have to rely on the value in field #7 for just one read to know how reliable its evidence is. We can, for example, take the max of the values for field #7 for all the reads that span the column. That's just an example; probably something smarter than max is needed. In short, I argue that A) the case where field #7 is very wrong is rare, and B) in cases where it is very wrong, it can probably correct it by looking at nearby aligned reads. So I think the problem is manageable.

                3. I think it is worth metioning that the speed of soap/bowtie/novoalign is sensitive to error rate, while eland/zoom/rmap not. On bowtie's home page, it is claimed that bowtie is 35X faster than maq and 350X than soap. However, to my experiences, on high quality data, maq and soap is about the same speed. We can only see this big difference when base quality is low. Actually on my small experiment, bowtie is 170X faster than maq (cmd option: -f -v 2; I know using quality may be a bit slower), but I guess the difference would be smaller if the error rate is higher.
                Those stats are helpful - thank you. I have noticed that the maq/bowtie speed ratio varies substantially based on CPU architecture and whether the input is filtered for poly-As. 35x (rounded down from about 38x) is the most Maq-favorable ratio I obtained in a set of experiments using filtered human read data from 1000 Genomes on a few different machines.

                I do mention in the manual that Bowtie is faster for higher-quality input, though I could make that point more prominent.

                4. Comparison of BWT based aligners. So far as I know, there are three BWT based short read aligners: bowtie, soap2 and bwa. Soap2 gives the number of occurences of the best hits and bwa also reports the number of hits with additional mismatches if the best hit is unique (like what chipper is asking for). bwa is the only one that finds short indels at present, although I am sure this is not hard for soap2/bowtie. In addition, soap2 only finds hits with up to 2 mismatches. I think they have to do so according to a brief description of its algorithm. On speed, bowtie-f-v2 is 3X faster than soap2 (bowtie--best-v2-f-k2 is similar to soap2 in speed) and soap2 is 2.5X faster than bwa. On memory, both bwa and bowtie use 2.3GB while soap2 uses 5.3GB.
                Those stats are very good to know. I admit that I haven't had much time to look at soap2 and bwa.

                Note that Bowtie uses a bit more than half that amount of member when -z is specified. -z precludes the use of some options, but the default alignment mode works fine.

                5. back to how the alignments are reported. I think the bwa behaviour is useful if people do not care too much about speed. Knowing the number of suboptimal hits would help us to decide which alignments are reliable. I know this is important to some (not all) SV detection algorithms. If you think the bwa behaviour is costly (possibly it is), I would recommend the soap2's one. Frequently, we may want to know the exact number of occurences (no need to output the detailed aligments). I am sure having the soap2 behaviour would make bowtie more popular.
                These are helpful suggestions. The TODO list for Bowtie grows at an alarming rate! In the short run, we're working on supporting paired-end alignment and indels, but we'll spend some time thinking about the best way to accomplish some of these "field #7" improvements too.

                Many thanks,
                Ben

                Comment


                • #38
                  I have just done a comparison between bowtie and soap2. I am looking at 3.9 million alignments reported by both soap2 and bowtie. Soap2 says that there are 649k reads can be placed more than once, while 171k of them with the 7th col equal to 0 in bowtie. This means bowtie will claim 26% (171/649) of repetitive alignments as unique. This ratio seems a bit high to me. (cmd: soap2, default; bowtie -f -v2) I agree that not all of these wrong alignments may yield wrong SNPs, but SV detection will be affected a lot. Of course we need further evaluation on real data.

                  Comment


                  • #39
                    That's a good experiment - can I recreate that? Are these public reads that I can access? This will help us think about the issue of what the contents of field #7 should actually be.

                    Can you explain a little more (or point us somewhere that explains) the SV detection issue? Is there a paper that describes the algorithm you have in mind?

                    Thanks again,
                    Ben

                    Comment


                    • #40
                      Originally posted by ewingad View Post
                      I agree with lh3 that it may be vital to know whether a given alignment is to a repetitive sequence i.e. if other equally good alignments exist - I suppose that if there are multiple _identical_ alignments they should all be coming from the same BW rows? If this is provable then it is possible to reassure users about lh3's point #2.
                      Yes, that's true. The alignments included in a single range or rows returned by Bowtie's search routine are "identical" in the sense that the reference characters aligned to are the same. This is provable owing to the properties of the matrix.

                      I also agree that something like lh3's point #5 is useful - when aligning short reads it is often a useful quality measure to know how many alignments would result from a 1-bp change in the query sequence. So, if for a given read, if I know that even if I changed one base, the reported alignment would not change, I feel more confident about that alignment.
                      That's good to know - thanks; for now, using -a or -k 2 (or whatever other -k you prefer) together with --nostrata should tell you what you need to know.

                      Thanks,
                      Ben

                      Comment


                      • #41
                        Ben, you are welcome. I was looking at 200x36x36-071113_EAS56_0053-s_1_?.fastq.gz from SRA000271. Soap2 is also available and so you can try by yourself. Let me know if I was doing something wrong. As for SV detection, a naive but most widely used way is to cluster, based on coordinates, read pairs that are mapped with excessively large/small insert size or across chromosomes. I think the experience is we need to look at reads mapped confidently. bwa's behaviour is mostly preferred. Soap2's is also ok, but we need to be more careful to filter out wrong alignments. I a bit worry about the default bowtie behaviour on such applications, but of course we can only be sure when it gets evaluated.

                        [PS: Just read your reply to ewingad. I can see that bowtie will stop searching the reverse strand if it finds a hit on the forward strand. Suppose there is a segmental duplication with one copy on the forward strand and the other on the reverse. Bowtie will almost always report reads from either copy as unique and map all of them to the forward strand only (if the two copies are almost identical). This will incur higher SNP error rate and confuse SV detection. Note that your proposed method in your previous reply would not solve this case which may actually happen frequently. Always searching both strands will largely alleviate, though not completely solve, this problem.

                        In addition, from your reply I just notice this --nostrata thing. I wonder whether bowtie keeps multiple best strata. Say the best hit contains one mismatch and two strata may yield two different 1-mismatch alignments (e.g. the mismatch occurs at 10bp and 20bp respectively). When I use --best -a, does bowtie report the alignments in both strata? I suppose it does. It is not very clear from the manual.]
                        Last edited by lh3; 12-12-2008, 03:11 AM.

                        Comment


                        • #42
                          Hi lh3,

                          I first wanted to say thanks for your compliments and constructive criticisms on Bowtie. To follow up on the discussion about reporting hits for multireads: would it not be better to penalize or even ignore alignments to predefined, annotated repeats in the reference as a post-processing step?

                          In your experiment, I wonder how many of the mapped reads fall entirely within RepeatMasker regions, for example.

                          Comment


                          • #43
                            Cole, pre-masking the reference genome is definitely worth trying, although I have not got much time to follow this line. Most algorithms spend a lot of time on repeats. Suffix array/tree based methods are fast largely because they collapse exact repeats and save time. However, using repeatmasker to mask genome is not the way to go. Repeatmasker does not guarantees to mask sequences having multiple copies. The sequences it masks may also be unique. Biological repeats are not necessarily sequences with multiple copies. What we more like to do is to simulate reads from each position on the reference, map them back, and then calculate some statistics indicating whether the region is a repeat. Illumina's "sequence-ability" is such a statistics, although it is not frequently use. I think a better statistics would also consider whether a read from the reference can be mapped elsewhere when there is an additional mismatch (bwa behaviour).

                            A potential problem with premasking is this strategy (at least the one I envision) does not work well with paired end mapping. Reads from some masked regions can be mapped correctly when pairing is in use. I can vaguely see possible solutions, but maybe it is too vague to say here. I am sure people can come up with better solutions.

                            [PS: I am thinking a better replacement with "sequence-ability". Let S be the reference. For all position x, we extract a read S[x,x+31] and map it back to S. Define a statistics R[x] such that it equals 0 if S[x,x+31] can be mapped to multiple places; equals 1 if S[x,x+31] has an 1-difference (mismatch+indel) hit elsewhere and 2 otherwise. Suppose a 32bp read is mapped to y. We can discard the hits if R[y]=0; or downweight it in SNP calling if R[y]=1. Approximation has to be made for read length longer/shorter than 32bp. A simplified the version would just set R[x] as 0 or 2.]
                            Last edited by lh3; 12-13-2008, 02:32 AM.

                            Comment


                            • #44
                              Hi,

                              first of all, we here at Ghent were very impressed with the bowtie results... We have ordered, but are still waiting for delivery, both a Solexa and a 454 sequencer and our group will be responsible for the data analysis and server installation etc.

                              Some people here were testing the classis programs such as GMAP and BLAT and a Solexa human genome test-set would take 2-3-4 days on our 8core 32G RAM server.

                              I went on a search and found novocraft. Taking only 6h to do the same job.

                              Now i found bowtie and it only takes 1h i guess? I started it before the lunch break and it was finished when i came back. Very impressive (without the -p flag!). (we have a 16core 64G coming... and i will use the -p flag... mapping the genome in 10 mins amazing.


                              I have some questions though:
                              • I don't quite understand what the 'nostrattum' flag does
                              • I am only interested in rather unique maps, the rest can go to another file and i can have a look at it later. the --unfa flag moves unmapped seqs to a file, but when i use -m 3 i will discard seqs that map more than 3 times, right? Those go that same file? or they are lost forever? The idea is that i want to do a preliminary analyses fast and i can remap those multimaps overnight or during a weekend when the server is not used.
                              • If i use the -k 3 flag, i want to report 3 maps, will it take the first 3 it encounters? And if i use the --best flag, will it go find all the possible maps and only report the best 3?


                              ./bowtie -k 3 -m 10 --best --unfa MSC_bowtie_unal_fasta human_genome ../files/file.111.fastq MSC_bowtie

                              is the commando i want to use. I hope it will find max 10 maps per sequence and report the best 3 (combining -k 3 and --best) Will this work? Just experimenting... In a later stage i will map everything(even x100 repeats) and output it to a db, so it doesnt really matter if it doesnt work, just trying to understand the program completely.

                              greetz,
                              Joachim

                              Comment


                              • #45
                                I am a new user of Bowtie, and I got one question. Is there a way to output the reads that CAN NOT be aligned to the reference genome?

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Advancing Precision Medicine for Rare Diseases in Children
                                  by seqadmin




                                  Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
                                  12-16-2024, 07:57 AM
                                • seqadmin
                                  Recent Advances in Sequencing Technologies
                                  by seqadmin



                                  Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

                                  Long-Read Sequencing
                                  Long-read sequencing has seen remarkable advancements,...
                                  12-02-2024, 01:49 PM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, 12-17-2024, 10:28 AM
                                0 responses
                                33 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 12-13-2024, 08:24 AM
                                0 responses
                                48 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 12-12-2024, 07:41 AM
                                0 responses
                                34 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 12-11-2024, 07:45 AM
                                0 responses
                                46 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X