Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #91
    I hope to have it submitted by the end of this year.

    -Brian

    Comment


    • #92
      Hi Brian,
      Thank you for previous answer through email which I am posting:
      "Daniel,

      Unfortunately, there is not currently a master document, though I certainly want to make one. BBMap is documented in /docs/readme.txt while everything else is just documented in its shellscript. But I do have individual threads on SeqAnswers about the most commonly-used tools:

      Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc


      (first post contains links to the other threads)

      For metagenomics, I consider bbmap.sh, bbsplit.sh, bbduk.sh, bbnorm.sh, pileup.sh, dedupe.sh, and stats.sh to be the most directly relevant.

      As for your specific questions:

      Is there any documentation on the mapping output summary? Ie: How do numbers in mapped pairs vs read 1 mapped differ?

      To be considered proper pairs, by default, the reads must map to opposite strands, on the same scaffold, with an insert size of under 32kbp. Read 1 will typically have a higher mapping rate than the pairing rate, because sometimes the reads map discordantly, and sometimes only one read maps. This is especially true in a metagenome which may have very short scaffolds.

      Can we output the mapping results to .fast.gz directly?

      All tools in the BBMap package can input and output gzipped files, fasta or fastq. It detects this based on filename, so a gzipped fastq file should be named "*.fq.gz" or "*.fastq.gz".

      Is it possible to bin out mappings where both reads map vs either read maps vs unmapped through command line without bitwise comparison of SAM flags?

      Yes. BBMap supports a "po" or "pairedonly" flag; you can for example do this:

      bbmap.sh in1=read1.fq in2=read2.fq po=t outm=mapped.fq outu=unmapped.fq

      ...which will split the reads into proper pairs with both mapped (outm stream) and pairs where either one or both didn't map, or they did not map in the proper orientation (outu stream). With "po=f" (pairedonly=false), outm will accept anything where at least one read mapped, and outu will only get pairs in which neither mapped."

      I am trying to find a value in mapping summary to would correspond to bowtie2's '% overall alignment rate'

      Also, which BBMap parameters should I tweak if I want to accurately want to reproduce bowtie2 results obtained using "-k 1 --fast-local" ?

      Comment


      • #93
        Daniel,

        Originally posted by dbrami View Post
        I am trying to find a value in mapping summary to would correspond to bowtie2's '% overall alignment rate'

        Also, which BBMap parameters should I tweak if I want to accurately want to reproduce bowtie2 results obtained using "-k 1 --fast-local" ?
        For paired reads, the overall alignment rate is ((percent mapped for read 1)+(percent mapped for read 2))/2. Or in other words half of the sum of the two numbers in red below.

        Code:
        Pairing data:           pct reads       num reads       pct bases          num bases
        
        mated pairs:             99.5206%            3114        99.5206%             934200
        bad pairs:                0.3196%              10         0.3196%               3000
        insert size avg:          321.31
        
        
        Read 1 data:            pct reads       num reads       pct bases          num bases
        
        mapped:                  [B][COLOR="Red"]99.9361%[/COLOR][/B]            3127        99.9361%             469050
        unambiguous:             98.6897%            3088        98.6897%             463200
        ambiguous:                1.2464%              39         1.2464%               5850
        low-Q discards:           0.0000%               0         0.0000%                  0
        
        perfect best site:        1.9175%              60         1.9175%               9000
        semiperfect site:         1.9175%              60         1.9175%               9000
        rescued:                  0.0000%               0
        
        Match Rate:                   NA               NA        62.0582%             441661
        Error Rate:              96.5462%            3019        37.6163%             267711
        Sub Rate:                87.4320%            2734         2.3412%              16662
        Del Rate:                42.8846%            1341        34.0933%             242638
        Ins Rate:                49.8881%            1560         1.1818%               8411
        N Rate:                  49.4084%            1545         0.3254%               2316
        
        
        Read 2 data:            pct reads       num reads       pct bases          num bases
        
        mapped:                  [B][COLOR="Red"]99.9041%[/COLOR][/B]            3126        99.9041%             468900
        unambiguous:             98.6897%            3088        98.6897%             463200
        ambiguous:                1.2144%              38         1.2144%               5700
        low-Q discards:           0.0000%               0         0.0000%                  0
        
        perfect best site:        1.6619%              52         1.6619%               7800
        semiperfect site:         1.6619%              52         1.6619%               7800
        rescued:                  0.0000%               0
        
        Match Rate:                   NA               NA        61.2099%             441930
        Error Rate:              96.7370%            3024        38.4685%             277739
        Sub Rate:                88.2278%            2758         2.2755%              16429
        Del Rate:                43.9859%            1375        35.0546%             253091
        Ins Rate:                49.1683%            1537         1.1384%               8219
        N Rate:                  50.7358%            1586         0.3216%               2322
        Flags equivalent to bowtie2's "-k 1 --fast-local" would be something like:

        "fast local maxindel=10"

        Comment


        • #94
          de-interleave fastq outtput

          Can BBMap de-interleave the fastq output for reads 1 and reads 2?
          Ie: if I Use "pairedonly=f outu=${B}.noplant.fastq.gz", can BBMap put the unmapped forward reads ${B}_R1.noplant.fastq.gz and the unmapped reverse reads in ${B}_R2.noplant.fastq.gz.
          Or should I just de-interleave the reads as a post process?

          Comment


          • #95
            Daniel,

            You can set "outu1=" and "outu2=" or "outm1=" and "outm2=" for output in paired files.

            Comment


            • #96
              Issues with BBMap

              Hi
              I am a newbie to NGS data analysis- basically a bench scientist trying to do analysis.
              I have sequenced human tissue samples on the PacBio RSII and am trying to align to human genome using the bbmap tool.

              As mentioned in the forum, I am trying to use the MapPacBio.sh script for the alignment. Previously I was getting an error where it could not find the class align2.MapPacBio . I hard coded the path in the file and then

              Here is the new error I am getting:

              java -Djava.library.path=/opt/compsci/bbmap/33.73/jni/ -ea -Xmx51798m -cp BBMapPacBio build=1 overwrite=true minratio=0.40 fastareadlen=6000 ambiguous=best minscaf=100 startpad=10000 stoppad=10000 midpad=6000 in=/data/*****/PacBio/****/consensus_isoforms.fasta out=/data/***/pac88 ref=/data/shared/research_pipelines_reference_data/human/RNA/genome.fa
              Error: Could not find or load main class build=1

              Java version is "1.7.0_51"
              OpenJDK Runtime Environment (rhel-2.4.4.1.el6_5-x86_64 u51-b02)
              OpenJDK 64-Bit Server VM (build 24.45-b08, mixed mode)


              Could you please help?

              Thanks
              Prashant

              Comment


              • #97
                @prashant_singh. I am hardly an expert on BBTools. Brian will probably log in soon -- he is on USA west coast time as far as I know -- and so will give a better answer. But I happened to start up a mapPacBio run just as your post came in. Comparing your Java to my Java reveals:

                Yours ...
                Code:
                java 
                  -Djava.library.path=/opt/compsci/bbmap/33.73/jni/ 
                  -ea -Xmx51798m 
                  -cp BBMapPacBio 
                  build=1 
                  overwrite=true 
                  minratio=0.40 
                  fastareadlen=6000 
                  ambiguous=best 
                  minscaf=100 
                  startpad=10000 
                  stoppad=10000 
                  midpad=6000 
                  in=/data/*****/PacBio/****/consensus_isoforms.fasta 
                  out=/data/***/pac88  
                  ref=/data/shared/research_pipelines_reference_data/human/RNA/genome.fa
                Mine ...
                Code:
                java 
                  -Djava.library.path=/group/bioinfo/apps/apps/BBMap-33.57/jni/ 
                  -ea -Xmx126058m 
                  -cp /group/bioinfo/apps/apps/BBMap-33.57/current/ 
                  align2.BBMapPacBio 
                  build=1 
                  overwrite=true 
                  minratio=0.40 
                  fastareadlen=6000 
                  ambiguous=best 
                  minscaf=100 
                  startpad=10000 
                  stoppad=10000 
                  midpad=6000 
                  ref=kmer60-10.fa 
                  in=all_spades_2000plus.fa 
                  out=LR_vs_scaffolds.sam
                The major different is the 'cp' where I have the complete path plus the line after the 'cp' which shows the program to run.

                You mentioned:

                Previously I was getting an error where it could not find the class align2.MapPacBio . I hard coded the path in the file and then
                ....
                Error: Could not find or load main class build=1
                I suspect that you mis-coded the path and probably erased the part which states the program to run. I suggest starting anew and if you really need to hard-code the path then be careful in doing so.

                Comment


                • #98
                  That's correct, and yes, I'm on West Coast time The command line should probably be:

                  java -ea -Xmx51798m -cp /opt/compsci/bbmap/33.73/current/ align2.BBMapPacBio build=1 overwrite=true minratio=0.40 fastareadlen=6000 ambiguous=best minscaf=100 startpad=10000 stoppad=10000 midpad=6000 in=/data/*****/PacBio/****/consensus_isoforms.fasta out=/data/***/pac88.sam ref=/data/shared/research_pipelines_reference_data/human/RNA/genome.fa

                  Note both of the things in bold: "/opt/compsci/bbmap/33.73/current/" is probably correct, but maybe not, depending on how it was installed. I'm guessing that somewhere under "/opt/compsci/bbmap/33.73/" is a subdirectory named "current" - that is what you want to point it to.

                  For "out=/data/***/pac88.sam", I just added the ".sam" suffix. All of the programs in BBTools are extension-sensitive, so if you don't include an extension, it will just output the default format (which in this case is sam, but it's always best to specify it).

                  Comment


                  • #99
                    I apologize if this has been covered already but I've just downloaded and run bbmap a few times (seems great so far and I like the control provided via the available command line options) and I have a quick question. When building the index is it possible for bbmap to ignore parts of the reference names past the first blank space? For example I'm using the Ensembl human release and the FASTA reference names look something like this:

                    >1 dna_sm:chromosome chromosome:GRCh37:1:1:249250621:1 REF

                    Currently when I run alignments on this FASTA file that entire name shows up in the SAM file while instead I should only see "1". Other aligners (STAR, bowtie, bwa, etc) throw out everything after the first space when indexing leaving just the main reference name. This is particularly important for using aligners as part of a gene expression pipeline using something like eXpress since some transcriptome extraction tools (programs that parse a GTF and extract transcripts from an accompanying genome FASTA file) usually name the transcripts as >TRANSCRIPT_ID GENE_SYMBOL. eXpress will throw out everything after the first space however if the alignments contain the full reference names then no alignments are matched up to transcripts in the output.
                    /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
                    Salk Institute for Biological Studies, La Jolla, CA, USA */

                    Comment


                    • Yes, I put in an option for that because some programs require it. The flag is trd=t (or trimreaddescriptions=true), which will truncate all names (both read names and reference names) at the first whitespace symbol.

                      Comment


                      • rad. thanks for that.
                        /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
                        Salk Institute for Biological Studies, La Jolla, CA, USA */

                        Comment


                        • OK one more question. What would the correct alignment option setting be to get alignments with 0 mismatches and still allow splices? If it's not possible I can just do downstream filtering.

                          Thanks.
                          /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
                          Salk Institute for Biological Studies, La Jolla, CA, USA */

                          Comment


                          • There is no option for that, unfortunately.

                            Comment


                            • OK, thanks. Also thanks for the aligner.
                              /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
                              Salk Institute for Biological Studies, La Jolla, CA, USA */

                              Comment


                              • You're welcome!

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Strategies for Sequencing Challenging Samples
                                  by seqadmin


                                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                  03-22-2024, 06:39 AM
                                • seqadmin
                                  Techniques and Challenges in Conservation Genomics
                                  by seqadmin



                                  The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                  Avian Conservation
                                  Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                  03-08-2024, 10:41 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, Yesterday, 06:37 PM
                                0 responses
                                8 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, Yesterday, 06:07 PM
                                0 responses
                                8 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-22-2024, 10:03 AM
                                0 responses
                                49 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-21-2024, 07:32 AM
                                0 responses
                                67 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X