Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Oh, that's kind of irritating. BBMap as currently structured has a maximum reference sequence length of 500Mbp. I designed it that way because I was unaware of any chromosomes longer than that, and I believed the reason to be that 500Mbp was above the maximum stable length of an individual chromosome... looks like I may have been wrong!

    I'll have to think about how to resolve this; there's no simple setting for it. Thanks for bringing it to my attention.

    Comment


    • Thank you Brian for the quick response.
      We would really appreciate your thoughts/inputs on how to work around our issue.

      Comment


      • Purely speculating. Don't know where the centromere is in this chromosome but you could split it in a region where there are long stretches of N's (and the pieces remain smaller than 500 mb) that way chances of reads needing to map across this break would be small.

        Comment


        • Just because I sometimes stumble over that issue in tutorials (which don't seem to bother) and also saw it again in the recent question....

          I once was thaugt (and got a deduction of points in a test for not knowing it) that using even k-mer sizes is frowned upon. The comprehensible rationale behind is, that only odd k-mer sizes ensure a kmer can never be its own reverse complement in the de Bruijn Graph. Such ambiguity created by palindromic k-mers in the de Bruijn graph supposedly make its resolution difficult.

          So to settle that question once and for good: Does it really have an impact on mapping efficiency, if I chose an even or its neighboring odd k-mer?

          Comment


          • No. The longer the kmer, the greater the speed (and memory consumption); even versus odd is not important.

            Additionally, I don't see that even-length kmers cause problems in assembly, either. Genomic palindromes of kmer length or longer cause problems whether you are using an even or odd kmer length. These palindromes always have an even length, but - say you have a genomic palindrome of length 22. Using K=22, you will not (trivially) be able to resolve it. Nor will you with K=21. You will with K=23, and you will with K=24. It's not clear to me in this situation why K=23 would be preferable of K=24 with regards to palindromes, but K=24 can resolve longer repeats than K=23.

            Comment


            • Actually, an odd k-mer ensures that the strand orientation can be determined, since the central nucleotide cannot be identical due to complementarity (an even k-mer can be a perfect palindrome in both orientations).

              But the point about longer k-mers is spot-on.

              Comment


              • Thanks a lot for your answers! Your exemplified replies were really helpful for some more insight.

                Comment


                • Hi I have a couple questions on the terminology used for retaining ambiguous sites using bbmap.

                  If "ambiguous=best" this means that if there are a bunch of reads all the with the sam score only the first match will be retained? Or does it mean that of all the reads mapping above a score cutoff the first one will be picked?

                  Along the same lines - for "ambiguous=all" does this mean that if say 5 locations all share the same highest score that they will be reported or does it mean that all locations above the score cutoff will be retained?

                  Comment


                  • "ambiguous=best" is a bit misleading, but it means the genomically first location with a maxmimum score will be used. "ambiguous=all" will report all locations within the ambiguity threshold of the first. This does not mean they need exactly the same score; it means that they are very close, so much so that none can be confidently determined to be the correct mapping location. Normally they're identical, but if for example one mapping had a single 1bp deletion and another mapping had two 1bp substitutions, the scores would be different, but would be close enough to be both reported. But if there was a third potential mapping with, say, 5 substitutions, that would be excluded. This can be controlled with the "secondarysitescoreratio" flag; if you set it to 1.0, only mappings with identical scores to the best score will be reported.

                    Comment


                    • Hi, Brian

                      We recently increased our PacBio amplicon size from ~1100 to 3kb. With the smaller amplicon size we were able to map reads to our allele reference sequence library of non-full length allele sequences using "semiperfectmode" to allow for soft-clipping. Im now looking to map ~3kb read sequences obtained from gDNA sequencing to exon reference sequences of ~270 bases a piece and not able to tune the settings to get any mapping results. Is there a way to tune mapPacBio.sh to get hits for regions within long reads to short exon sequences that perfectly match?

                      Comment


                      • Hi,

                        couldn't you just do it the other way around, Have the pacbio as ref and may your short refs to it?

                        Although I don't understand why you refs are so short.


                        S.

                        Comment


                        • I agree with Susan. BBMap is a global aligner, and not really designed to map reads to substantially shorter reference sequences. But you could try with the flags "minid=0 local", which might work. Note that "semiperfectmode" will not allow a single mismatch or indel, so it's really only useful in special situations; "local" is more appropriate in this situation.

                          Comment


                          • @lankage: You don't have to align to the short amplicon regions. You could align to the genome (and find out if you have any non-specific amplification along the way).

                            Comment


                            • @moistplus: If you were to use bbmap.sh to do the alignments then you would get that information in the alignment report along with the bam file (as long as you have samtools available in $PATH).

                              Comment


                              • Hi Brian,

                                Since I saw increased activity lately again, I was wondering if you might have thought about the issue we discussed back in January (~post #300). It was about dedupe not writing out exact matched and contained sequence identifiers.

                                As mentioned before, solving this would make this tool very competitive to existing ones, due to the immense speed-up.

                                Thanks for your consideration!

                                Best wishes,
                                Shini

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Current Approaches to Protein Sequencing
                                  by seqadmin


                                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                  04-04-2024, 04:25 PM
                                • seqadmin
                                  Strategies for Sequencing Challenging Samples
                                  by seqadmin


                                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                  03-22-2024, 06:39 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, 04-11-2024, 12:08 PM
                                0 responses
                                23 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 10:19 PM
                                0 responses
                                24 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 09:21 AM
                                0 responses
                                21 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-04-2024, 09:00 AM
                                0 responses
                                52 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X