Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • I am using version v34.08. Another pair of reads that failed are:

    R0266248:292:C5C0TACXX:4:1101:3508:3710 97 3 193355806 42 19=1X77= = 96255219 -97100684 AATATTCCCAAGAGGATCTTGGGAGATGATGACACGTTCTCCAGTTAAGGTAAGAACATAGGCCGTCTCAGTGAGGTTCCTTAGGAGAGTAACTGCT CCCFFFFFHHHHFHIGHIJIIJIIJJJJGIIJJJJJJJJJJJIHHIIGGJFHIIHIJJJGGIJJJJJHHFHHHDF@?AEECCCEDDC??>CCDECDD NM:i:1 AM:i:42 RG:Z:AS_N#1#4#AS_N_1
    R0266248:292:C5C0TACXX:4:1101:3508:3710 145 11 96255219 44 98= = 193355806 97100684 GTAGCCATGCAGAGTTCCCAGATTTCTTCCCCTTCAGCCCAGCTTCTGTGTCTTCCCTCTGTCCACTTGGCATCTTCCCTGTGAAGATCTGTTAAAAG DDCDDDDDDDDDDDDBDDEDDDDECDABFHHECAJJIJIJHHHDFIHFJJIIGHIJIIIIJJJIGHIJJHHE8JJIJIJIJJJJIHFGHHFFFFFCCC NM:i:0 AM:i:44 RG:Z:AS_N#1#4#AS_N_1
    I used the following command to run bbmap:

    bbmap_cmd="${bbmap} \
    in1=${duk1} \
    in2=${duk2} \
    ambiguous=toss \
    rgid=${ID} \
    rgsm=${SM} \
    rgpl=${PL} \
    rglb=${LB} \
    out=${bb_prefix}.sam \
    pairlen=1000 \
    threads=10 \
    k=11 \
    trimreaddescriptions=t \
    reads=1000 \
    slow=t"
    Thanks!


    Originally posted by Brian Bushnell View Post
    Sounds like a bug, and I agree, it seems as though the mate is mapped to a different (longer) contig. What version of BBMap are you using? Also, can you give me the complete command line and, if possible, the text of the second read? Thanks!

    Comment


    • Hey Guys,

      If I have Java 8, does that mean I can't use BBMap?

      Thank you. Merry Christmas

      Comment


      • Originally posted by michaellim View Post
        Hey Guys,

        If I have Java 8, does that mean I can't use BBMap?

        Thank you. Merry Christmas
        It works fine with Java 7 or 8 Java is backwards-compatible.

        Comment


        • Thanks for fixing the chimeric alignment thing [v34.30] ! I have another issue now while running GATK. It is now quitting using the following error:

          ##### ERROR MESSAGE: SAM/BAM file SAMFileReader{test.bam} is malformed: Read R0266248:292:C5C0TACXX:4:2101:19666:82211 is missing the read group (RG) tag, which is required by the GATK.
          The SAM records for the corresponding read are:
          R0266248:292:C5C0TACXX:4:2101:19666:82211 73 1 6500339 44 98= = 6500339 0 GGTGCCACGCCCCTGTACCTGGCGTGCCAGGAGGGCCACCTGGAGGTGACCCAGTACCTGGTGCAGGAATGCGGCGCAGACCCGCACGCGCGCGCCCA @@@FFFFFHGHHHIBFHIJIGIJJIIIJJJGIIIIIIJJJJJIGHI@GHHHHHFBDFFE>B.;>CDDDDDDDDDDBDDBDDDDDBDDDDDDDDBDDDD RG:Z:AS_N#1#4#AS_N_1 AM:i:0 NM:i:0
          R0266248:292:C5C0TACXX:4:2101:19666:82211 133 1 6500339 0 * = 6500339 0 GAATTGAGGACATACAGCACAAAGAAGTTTCTGAGAATGCTTCTGTCTAGATTTAATATGAAGATAACCCGTTTCCAATGAAATCCTCAAAGCTATCC @CCFFFFFHHHHHIJJIJIJJIJJJJIHHIIJJIJHIGGJJIJJJIJJIJIJJJJIJJJGIIJJIJJJIJIHIJJJHHGEHHFFDEDDCE;ACEDCDD
          I am using paired end reads and it looks like for this particular read pair the read group info is missing for one of the ends. Can you please fix this small issue too ? Thanks a lot for your help !

          bye,
          Rahul

          Comment


          • Rahul,

            Thanks again! This is now fixed in 34.33. I was skipping the generation of optional tags for unmapped reads, which caused this problem.

            -Brian

            Comment


            • dedupe.sh question

              I recently have been experimenting with read clustering tools using pacbio data. When running dedupe.sh without any of the clustering parameters, I noticed it outputs significantly fewer reads than are input despite saying it found 0 duplicates, 0 contained sequences and 0 invalid entries. I went from a fastq file containing ~3000 reads to a file containing 130. Im wondering, what filtering is occurring behind the scenes here?

              dhowifi-19acbio4 mgraham$ wc -l 0002_Forward--0002_Reverse.fastq
              12460 0002_Forward--0002_Reverse.fastq

              dhowifi-19:shortread mgraham$ bbmap34_26/dedupe.sh in=pacbio4/0002_Forward--0002_Reverse.fastq -Xmx10g
              java -Djava.library.path=/Users/mgraham/Desktop/dev/shortread/bbmap34_26/jni/ -ea -Xmx10g -Xms10g -cp /Users/mgraham/Desktop/dev/shortread/bbmap34_26/current/ jgi.Dedupe in=pacbio4/0002_Forward--0002_Reverse.fastq -Xmx10g
              Executing jgi.Dedupe [in=pacbio4/0002_Forward--0002_Reverse.fastq, -Xmx10g]

              Initial:
              Memory: free=10129m, used=161m

              Found 0 duplicates.
              Finished exact matches. Time: 0.070 seconds.
              Memory: free=9807m, used=483m

              Found 0 contained sequences.
              Finished containment. Time: 0.036 seconds.
              Memory: free=9538m, used=752m

              Removed 0 invalid entries.
              Finished invalid removal. Time: 0.017 seconds.
              Memory: free=9538m, used=752m

              Input: 130 reads 125193 bases.
              Duplicates: 0 reads (0.00%) 0 bases (0.00%) 0 collisions.
              Containments: 0 reads (0.00%) 0 bases (0.00%) 1539 collisions.
              Result: 130 reads (100.00%) 125193 bases (100.00%)

              Time: 0.131 seconds.
              Reads Processed: 130 0.99k reads/sec
              Bases Processed: 125k 0.96m bases/sec

              Comment


              • There should not be any filtering, so I don't know what the problem is. I imagine that all of my tools would perceive that file as containing 130 reads. Would it be possible for you to email me the file, or at least the first 2000 lines of it? It's possible that it is misformatted, or contains a control character in an unexpected place, so that the program quits reading it early.

                Comment


                • Question about bbmapskimmer.sh. so it says in the readme this version is designed to find all mapping locations with alignment score above a certain threshold. So does that mean when you run it and say set minid=0.9 then you get all mappings at that score and above or do you need to set something else? I'm asking because the options appear to be identical to standard bbmap (i.e. with the ambig, maxsites2, secondary, maxsites options).

                  With all of these options identical to bbmap I don't understand how this version can output all mappings above a threshold.
                  /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
                  Salk Institute for Biological Studies, La Jolla, CA, USA */

                  Comment


                  • The options are identical as much as possible; the code itself is different. bbmapskimmer.sh does not run BBMap, but a different program based on BBMap, but substantially different - BBMap gains a lot of speed by only looking for mapping sites similar to or better than the current best discovered site, and by limiting the minimum score in the dynamic programming alignment to a little below the best known score. These heuristics and early exits were removed from skimmer, so it is generally slower than BBMap. I tried to keep the interface the same for convenience.

                    Some else recently asked me roughly the same question via email, and I'll post my response here:

                    Maxsites controls the number of alignments printed for a read that maps to multiple locations, and has no function unless ambig=all. Maxsites2 controls the maximum number of potential mapping locations that are considered for alignment, which is computationally expensive; it's really a heuristic that affects speed versus accuracy. If there are more than "maxsites2" seed hits in the reference, the lowest-scoring ones will be discarded prior to alignment. The default is 800 - normally, there is no reason to change it unless you have a situation where you actually expect reads to have hundreds of correct mapping locations, and need all of them reported (like when aligning a primer sequence to a 16S database, for example). maxsites2 has an effect whether or not ambig=all.

                    With ambig=all, BBMap will only report alignments that are equal-scoring, or quite close, to the best alignment - so if the best alignment was perfect, and the second best had 4 substitutions, the second-best would NOT be reported. If you actually want every single alignment above some identity threshold, then you should use BBMapSkimmer (bbmapskimmer.sh) instead. The syntax is the same, but rather than looking for the best mapping location and then reporting it and anything else with a similar score, it is designed to report ALL alignments above a minimum quality.

                    For example:

                    bbmapskimmer.sh in=reads.fq out=mapped.sam ambig=all idtag minid=0.80 idfilter=0.9 maxsites=40

                    This will look for alignments with an approximate identity of at least 80%. Once it finds them, anything with an actual identity of under 90% will be removed (and the actual identity will be printed in the sam line). So the net result is that you will end up with everything that has at least 90% identity, up to 40 per read. The reason minid<idfilter is because minid is a heuristic based on approximate score, which acts before the true identity is known (which requires an expensive alignment), while idfilter is exact and acts AFTER alignment.

                    Comment


                    • Nice. See I use tools like RSEM and eXpress and both of those tools like to get all possible alignments up to some allowed alignment threshold. That's sort of the default output of something like bowtie or bowtie2 with the -a flag (all alignments). So it sounds like bbmapskimmer.sh outputs what I'm looking for when it comes to running those expression quantification tools. Thanks.
                      /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
                      Salk Institute for Biological Studies, La Jolla, CA, USA */

                      Comment


                      • BBTools programs stop reading in fastq reads when encountering 0 length sequences

                        Brian, found the issue. Our PacBio data contained 0 length sequences that were causing dedupe/bbmap to quit reading in sequences. I removed these and everything runs fine.
                        Example:

                        @m141217_232404_42136_c100683242550000001823136602221530_s1_X0/45415/ccs 0.9 18

                        +

                        @m141217_232404_42136_c100683242550000001823136602221530_s1_X0/12786/ccs 0.99 27

                        Comment


                        • OK, thanks for the report. I will modify it to process those correctly.

                          Comment


                          • Hey Brian,

                            Considering the very nice multi-threaded set of tools you have thus far, would it be out of the question to add in a variant caller for SNP/INDEL calling? It seems that those things tend to NOT be multi-threaded.
                            /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
                            Salk Institute for Biological Studies, La Jolla, CA, USA */

                            Comment


                            • I wrote a variant caller, which is indeed multithreaded, but I don't think it works right now because of massive changes I've made to related code over the last couple years. But it would be useful in a lot of scenarios (including error-correcting assemblies) so I have been meaning to resurrect it.

                              I will do so if I have time; that's really the limiting reagent.

                              Comment


                              • Cool. That would be a useful tool. There is a multithreaded caller in the Subread package however it is a little too barebones.
                                /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
                                Salk Institute for Biological Studies, La Jolla, CA, USA */

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Current Approaches to Protein Sequencing
                                  by seqadmin


                                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                  04-04-2024, 04:25 PM
                                • seqadmin
                                  Strategies for Sequencing Challenging Samples
                                  by seqadmin


                                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                  03-22-2024, 06:39 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, 04-11-2024, 12:08 PM
                                0 responses
                                25 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 10:19 PM
                                0 responses
                                28 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 09:21 AM
                                0 responses
                                24 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-04-2024, 09:00 AM
                                0 responses
                                52 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X