Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • I think the answer to "which is best" is nontrivial and maybe there isn't a real answer at all. In STAR I remember reading the author created kind of a hard-coded logic where that aligner will prefer a zero-mismatch spliced alignment over a continuous alignment with mismatches. What STAR is trying to overcome there is the pseudogene problem where you have that exact situation of a read that can align spliced to the gene but continuous with a mismatch or two to the related pseudogene. I think it's probably best for you to keep your scoring system and then for people using your tools to just get a good understanding of it and use the tools wisely. Power users will always want to understand this little explanation of bbmap's decisions making process for spliced alignments since that is a particularly nasty part of RNA-seq (but obviously one of the greatest powers of RNA-seq).

    Again using STAR as a reference (which is a RNA specific aligner), we are allowed to manipulate the penalties for different kinds of events. There are so-called 'gap open' and extension penalties but then there are separate penalties for introns. So in STAR I'm able to tell it NOT to penalize for introns at all leaving the alignment score alone aside from any mismatches. So by STAR logic the 50M20000N50M alignment with zero mismatches would have 100% identity. I think this is a very important aspect of RNA-seq alignment because we generally don't want to punish for finding introns.

    Since BBMap provides that little switch (via the setting on maxindel and intronlen) would it be at all possible to incorporate a switch in the scoring mechanism that when a gapped alignment contains a space long enough to be switched to 'N' from 'D' the %identity calculation ignores that region in the alignment length or applies some standard penalty (like making it equal to a mismatch or two)? Otherwise I'm afraid the identity filtering logic as applied to RNA-seq is going to severely bias away from reporting spliced alignments since I don't want to allow continuous alignments with < 10% identity however a spliced alignment with perfect base for base matches could easily have < 10% identity and I'd want to keep those.

    Am I understanding this correctly?
    /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
    Salk Institute for Biological Studies, La Jolla, CA, USA */

    Comment


    • To clarify, I was not planning on changing BBMap's default scoring system, just adding an additional score matrix (with, essentially, different penalties for substitutions and gap opening/extension). You can already switch between selecting the Illumina and PacBio score matrices at runtime with the "msa" flag; e.g. "msa=MultiStateAligner9PacBio"

      The current choices are:
      MultiStateAligner11ts (Designed for Illumina and generic use)
      MultiStateAligner9PacBio (Designed for PacBio and Nanopore)
      MultiStateAligner9Flat (Designed for producing scores close to % identity)

      In my testing on RNA-seq data, BBMap does an excellent good job of deciding whether to place a read with a long splice and no mismatches, or with a few mismatches and no splice, even in the presence of intronless pseudogenes. And it would probably do an even better job with a new RNA-seq-specific score matrix. But your suggestion of changing the way mismatches are penalized based on whether an alignment is spliced seems interesting; I will have to investigate whether it is practical.

      Comment


      • Java error - threads error

        Hi,

        so far loving bbmap, but got this error today.

        Here's my command line (I have 30 cores but 20 available atm, so set it to 18 because at first it tried to use 30, but still get the error below)

        ~/bin/bbmap/bbmap.sh threads=18 outputunmapped=f local=t maxindel=50 nodisk=f ref=plasmids.fna in=~/data/019/clip_fastq/1clip.fastq outu=1.fasta

        errors:
        BBMap version 33.90
        Set threads to 18
        Set DONT_OUTPUT_UNMAPPED_READS to true
        Retaining first best site only for ambiguous mappings.
        NOTE: Ignoring reference file because it already appears to have been processed.
        NOTE: If you wish to regenerate the index, please manually delete ref/genome/1/summary.txt
        Set genome to 1

        Loaded Reference: 0.084 seconds.
        Loading index for chunk 1-1, build 1
        Generated Index: 1.347 seconds.
        Analyzed Index: 4.701 seconds.
        Started output stream: 0.154 seconds.
        Cleared Memory: 0.316 seconds.
        Processing reads in paired-ended mode.
        Started read stream.
        Started 18 mapping threads.
        Exception in thread "Thread-8" java.lang.AssertionError: Read 8202, length 621, exceeds the limit of 600
        You can map the reads in chunks by reformatting to fasta, then mapping with the setting 'fastareadlen=600'
        at align2.AbstractMapThread.run(AbstractMapThread.java:284)
        Exception in thread "Thread-6" java.lang.AssertionError: Read 17996, length 604, exceeds the limit of 600
        You can map the reads in chunks by reformatting to fasta, then mapping with the setting 'fastareadlen=600'
        at align2.AbstractMapThread.run(AbstractMapThread.java:284)
        Exception in thread "Thread-11" java.lang.AssertionError: Read 28626, length 629, exceeds the limit of 600
        You can map the reads in chunks by reformatting to fasta, then mapping with the setting 'fastareadlen=600'
        at align2.AbstractMapThread.run(AbstractMapThread.java:313)
        Exception in thread "Thread-17" java.lang.AssertionError: Read 32533, length 601, exceeds the limit of 600
        You can map the reads in chunks by reformatting to fasta, then mapping with the setting 'fastareadlen=600'
        at align2.AbstractMapThread.run(AbstractMapThread.java:284)
        Exception in thread "Thread-7" java.lang.AssertionError: Read 33313, length 622, exceeds the limit of 600
        You can map the reads in chunks by reformatting to fasta, then mapping with the setting 'fastareadlen=600'
        at align2.AbstractMapThread.run(AbstractMapThread.java:284)
        Exception in thread "Thread-19" java.lang.AssertionError: Read 39168, length 620, exceeds the limit of 600
        You can map the reads in chunks by reformatting to fasta, then mapping with the setting 'fastareadlen=600'
        at align2.AbstractMapThread.run(AbstractMapThread.java:284)
        Exception in thread "Thread-20" java.lang.AssertionError: Read 40488, length 923, exceeds the limit of 600
        You can map the reads in chunks by reformatting to fasta, then mapping with the setting 'fastareadlen=600'
        at align2.AbstractMapThread.run(AbstractMapThread.java:284)
        Exception in thread "Thread-9" java.lang.AssertionError: Read 40793, length 695, exceeds the limit of 600
        You can map the reads in chunks by reformatting to fasta, then mapping with the setting 'fastareadlen=600'
        at align2.AbstractMapThread.run(AbstractMapThread.java:313)
        Detecting finished threads: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17

        **************************************************************************

        Warning! 8 mapping threads did not terminate normally.
        Please check the error log; the output may be corrupt or incomplete.

        **************************************************************************




        ------------------ Results ------------------

        Genome: 1
        Key Length: 13
        Max Indel: 50
        Minimum Score Ratio: 0.56
        Mapping Mode: normal
        Reads Used: 81634 (18891086 bases)

        Mapping: 2.708 seconds.
        Reads/sec: 15073.10
        kBases/sec: 6976.19


        Pairing data: pct reads num reads pct bases num bases

        mated pairs: 0.0000% 0 0.0000% 0
        bad pairs: 0.0000% 0 0.0000% 0
        insert size avg: NaN


        Read 1 data: pct reads num reads pct bases num bases

        mapped: 0.1127% 46 0.1021% 9627
        unambiguous: 0.1127% 46 0.1021% 9627
        ambiguous: 0.0000% 0 0.0000% 0
        low-Q discards: 0.0000% 0 0.0000% 0

        perfect best site: 0.0000% 0 0.0000% 0
        semiperfect site: 0.0000% 0 0.0000% 0
        rescued: 0.0000% 0

        Match Rate: NA NA 85.6567% 8283
        Error Rate: 100.0000% 46 12.9162% 1249
        Sub Rate: 100.0000% 46 12.1613% 1176
        Del Rate: 34.7826% 16 0.4447% 43
        Ins Rate: 28.2609% 13 0.3102% 30
        N Rate: 34.7826% 16 1.4271% 138


        Read 2 data: pct reads num reads pct bases num bases

        mapped: 0.0759% 31 0.0610% 5768
        unambiguous: 0.0759% 31 0.0610% 5768
        ambiguous: 0.0000% 0 0.0000% 0
        low-Q discards: 0.0000% 0 0.0000% 0

        perfect best site: 0.0000% 0 0.0000% 0
        semiperfect site: 0.0000% 0 0.0000% 0
        rescued: 0.0000% 0

        Match Rate: NA NA 86.4074% 5022
        Error Rate: 100.0000% 31 12.2161% 710
        Sub Rate: 100.0000% 31 11.0117% 640
        Del Rate: 29.0323% 9 0.7571% 44
        Ins Rate: 29.0323% 9 0.4474% 26
        N Rate: 48.3871% 15 1.3765% 80
        Exception in thread "main" java.lang.RuntimeException: BBMap terminated in an error state; the output may be corrupt.
        at align2.BBMap.testSpeed(BBMap.java:469)
        at align2.BBMap.main(BBMap.java:34)
        all29c@aahl-02-mel:~/data/019/19.8$ ~/bin/bbmap/bbmap.sh threads=18 outputunmapped=f local=t maxindel=50 nodisk=f ref=plasmids.fna in=~/data/019/clip_fastq/1clip.fastq outu=1.fasta^C

        Comment


        • Susan,

          BBMap only supports reads up to 600bp in normal mode. The PacBio version (mapPacBio.sh) supports reads up to 6kbp, and all of the syntax is identical. So if 6kbp is long enough, just use that version.

          You can optionally shred long reads into shorter pieces with the flag 'maxlen', e.g. 'maxlen=600' will break all reads longer than 600bp into 600bp pieces and map them individually. So a 900bp read named "X" would be mapped as 600bp read "X_1" and 300bp read "X_2".

          I plan to overcome the 6kbp limitation, but it's not trivial so it will take a while.

          Comment


          • Hi,

            yes, I just realized that, thanks. I'm using 454 data so yes, I'll try the PacBio version.

            Thanks very much,

            S.


            Originally posted by Brian Bushnell View Post
            Susan,

            BBMap only supports reads up to 600bp in normal mode. The PacBio version (mapPacBio.sh) supports reads up to 6kbp, and all of the syntax is identical. So if 6kbp is long enough, just use that version.

            You can optionally shred long reads into shorter pieces with the flag 'maxlen', e.g. 'maxlen=600' will break all reads longer than 600bp into 600bp pieces and map them individually. So a 900bp read named "X" would be mapped as 600bp read "X_1" and 300bp read "X_2".

            I plan to overcome the 6kbp limitation, but it's not trivial so it will take a while.

            Comment


            • Another very quick question.. How can I ouput the output stats, as printed to screen, to a file?

              Thanks.

              Comment


              • You can do that with the redirect operator. BBTools print status info to standard err, not standard out, so you redirect it like this:

                bbmap.sh arguments 2>log.txt

                For reference, 1>x.txt 2>y.txt will capture both stdout and stderr to different files in case you use the 'out=stdout.fa' flag.

                Comment


                • ahhh. Thanks.. I was trying to just use '>'.

                  S.

                  Comment


                  • basecov problem

                    Hi,

                    I am trying to use the basecov option to output per base coverage so I can graph the results. However, the input file I am using is a pseudomolecule with genes separated by 150 Ns. When the output bam files is viewed, the coverage looks fine, but it does not match up with the basecov file and seems 'shifted' 3' by roughly 150bp.

                    Any ideas on what I am dong wrong?

                    Thanks,

                    S.

                    Comment


                    • Susan,

                      How big are these files? Would it be possible for you to send them to me? I am unaware of this ever happening and it sounds more like a bug than user error, but it's hard to know for sure without seeing the data...

                      Comment


                      • Brian, nevermind, my mistake, I have realised its just the coverage file includes the local alignment overlap into the 'Ns' whereas my bam viewer (Geneious) trims the overlap and doesn't show it as coverage. Problem solved. Thanks anyway.

                        S.

                        Comment


                        • OK, good to hear!

                          Comment


                          • Originally posted by Brian Bushnell View Post
                            Now, as for scoring - the affine score matrix is deeply embedded in BBMap. A 100bp read that maps perfectly with only matches has a 100% identity and BBMap score of 9970 (100 points per match, but only 70 points for the first match). If a 100bp read can map somewhere with 1 mismatch (99% identity, BBMap score of 9713: 9970+1*(-127-100-100+70)), that site will always be chosen over a site with 50bp match, 20k intron, 50bp match (4.7% identity, BBMap score of roughly 8841). But if the 100bp read can map somewhere with 5 mismatches (95% identity, BBMap score of 8685: 9970 +5*(-127-100-100+70)), the spliced alignment will be chosen, because 8841>8685. However they are very close so it would get a low mapping score.
                            Brian, do you mind explaining how the 50bp match, 20k intron, 50bp match ends up with a rough score of 8841? And in terms of the above examples that means it would have a final score ratio of about 0.887, right? thanks.
                            /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
                            Salk Institute for Biological Studies, La Jolla, CA, USA */

                            Comment


                            • The scoring uses affine transforms, so consecutive cigar string symbols are scored differently according to how many there are. These are the scores:

                              Code:
                              POINTS_NOREF=0;
                              POINTS_NOCALL=0;
                              POINTS_MATCH=70;
                              POINTS_MATCH2=100;
                              POINTS_SUB=-127;
                              POINTS_SUB2=-51;
                              POINTS_SUB3=-25;
                              POINTS_INS=-395;
                              POINTS_INS2=-39;
                              POINTS_INS3=-23;
                              POINTS_INS4=-8;
                              POINTS_DEL=-472;
                              POINTS_DEL2=-33;
                              POINTS_DEL3=-9;
                              POINTS_DEL4=-1;
                              POINTS_DEL5=-1;
                              
                              LIMIT_FOR_COST_2=1;
                              LIMIT_FOR_COST_3=5;
                              LIMIT_FOR_COST_4=20;
                              LIMIT_FOR_COST_5=80;
                              So, the first consecutive match is 70 points, and subsequent ones are 100 points. N is 0 points (whether in the reference or in the read). The first consecutive substitution is -127 points; subsequent ones are -51 points; and after 5 in a row, it drops to -25 points each.

                              The "LIMIT_FOR_COST_X" terms indicate how many consecutive symbols are needed before the cost drops. So, after 80 consecutive symbols (LIMIT_FOR_COST_5), the cost will drop to the lowest possible level. For deletions, this is POINTS_DEL5=-1; other types don't have a level-5 cost so just use the lowest available (e.g. POINTS_INS4 for insertions).

                              However, these costs are still not low enough to allow jumbo deletions. So I compress the deleted portions of the reference; every 128 bases are replaced by a single symbol, allowing effectively -1/128 point cost per deleted base, once the deletion is sufficiently long (though with present settings it costs -2/128).

                              Since it's pretty confusing to calculate the cost of a gap, you can currently do that with the main() function of one of the classes:

                              java -cp /path/to/current/ align2.MultiStateAligner11ts 20000

                              ...will print out "-1129", which is the cost of a 20kbp gap. For a 100bp read, that would be added to the score from matching reads (2*70 + 98*100 = 9940) for a total of 8811.

                              Before I had said 8841, because I forgot about the additional 30-point penalty due to 2 bases being in the MATCH1 state, whereas in a perfect read, only the first base is in that state.

                              Comment


                              • Holy smokes. And thanks!
                                Last edited by sdriscoll; 12-02-2014, 05:00 PM. Reason: forgot to say thanks.
                                /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
                                Salk Institute for Biological Studies, La Jolla, CA, USA */

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Current Approaches to Protein Sequencing
                                  by seqadmin


                                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                  04-04-2024, 04:25 PM
                                • seqadmin
                                  Strategies for Sequencing Challenging Samples
                                  by seqadmin


                                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                  03-22-2024, 06:39 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, 04-11-2024, 12:08 PM
                                0 responses
                                22 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 10:19 PM
                                0 responses
                                24 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 09:21 AM
                                0 responses
                                19 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-04-2024, 09:00 AM
                                0 responses
                                52 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X