Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #61
    Hello Brian

    I am trying to compare some mapping tools, and I am also using BBMap
    I execute 4 tests using PE readed preprocessed with bbduck, cutadapt, fastx and trimmomatic respectively
    With the reads preprocessed using bbduck and trimmomatic the program is able to finish without any error

    When I run the reads preprocessed with cutadapt or fastx bbmap crashes...

    do you have any idea why is this happening?

    Here is the parameters I used and the stde output:
    overwrite=true fastareadlen=500 -Xmx200g threads=96 outputunmapped=f sam=1.4

    BBMap version 32.15
    Set overwrite to true
    Set threads to 96
    Set DONT_OUTPUT_UNMAPPED_READS to true
    Retaining first best site only for ambiguous mappings.
    Set genome to 1

    Loaded Reference: 9.674 seconds.
    Loading index for chunk 1-7, build 1
    Generated Index: 15.609 seconds.
    Analyzed Index: 22.798 seconds.
    Started output stream: 0.345 seconds.
    Started output stream: 0.014 seconds.
    Cleared Memory: 0.146 seconds.
    Processing reads in paired-ended mode.
    Started read stream.
    Started 96 mapping threads.
    Exception in thread "Thread-110" java.lang.IndexOutOfBoundsException: Index: 172, Size: 172
    at java.util.ArrayList.rangeCheck(ArrayList.java:571)
    at java.util.ArrayList.get(ArrayList.java:349)
    at stream.ConcurrentGenericReadInputStream.pair(ConcurrentGenericReadInputStream.java:484)
    at stream.ConcurrentGenericReadInputStream.readLists(ConcurrentGenericReadInputStream.java:361)
    at stream.ConcurrentGenericReadInputStream.run(ConcurrentGenericReadInputStream.java:205)
    at java.lang.Thread.run(Thread.java:679)

    Comment


    • #62
      Salvatore,

      That error is because there are unequal numbers of reads in the read1 and read2 files (I'll add a better error message that explains it). This happens when a read-trimming tools throws away one read in a pair because it's too short, but not the other. This also ruins the pairing order, so mapping paired reads will no longer work correctly. BBDuk and Trimmomatic both keep paired reads together correctly.

      This is a known problem with fastx; you basically should not use it with paired reads. I think cutadapt is able to handle pairs correctly when you use the "-p" flag, but I'm not sure.

      You can re-pair the broken files with my repair.sh script, in the latest release of BBTools (32.27), like this:
      repair.sh in1=r1.fq in2=r2.fq out1=fixed1.fq out2=fixed2.fq outsingle=singletons.fq

      Comment


      • #63
        Originally posted by WhatsOEver View Post
        Sorry, but I don't get it completely.

        What is your definition of an "unambiguously mapped read" if it has secondary alignments? If secondary alignments are printed as a result of a more liberal threshold, the read is no longer unambiguous under the used threshold, is it? This should be independent of how bad your sec alignments are, the chosen threshold shouldn't change the terminology, should it?
        So in my case I have ~16K reads which are "unambiguous" according to your definition but have seconday alignments?
        I wrote a function that determines whether or not I classify a read as ambiguous. It takes into account various factors:
        1) how high the best score is
        2) the difference there is between the best score and the second best score
        3) how many additional sites are within a certain range of the best score
        It's not a closed-form equation, and it's completely subjective (that's what I meant by subjective, not the SAM spec); I optimized it empirically by iteratively tweaking it to minimize false-positives and maximize true-positives. And that alone decides whether or not I classify a read as "ambiguous".

        When "ambig=all" is chosen, all sites with a score of at least 95% of the best score are printed, whether or not the best site is considered ambiguous. This is a bit different. I think I'll change it to make this number a parameter, and perhaps only print secondary sites for reads that are considered ambiguous, which would make more sense.

        But being aware of that also helps, thanks again Brian for a fast and helpful response. I will now stop my work on mapper-evaluation for my projects and focus on the next steps - so I won't bother you with additional things in the next time
        Feel free to post whatever questions you have; they may be shared by other people, and may indicate that I could be doing something better or more clearly.

        Comment


        • #64
          thank you

          Originally posted by Brian Bushnell View Post
          Salvatore,

          That error is because there are unequal numbers of reads in the read1 and read2 files (I'll add a better error message that explains it). This happens when a read-trimming tools throws away one read in a pair because it's too short, but not the other. This also ruins the pairing order, so mapping paired reads will no longer work correctly. BBDuk and Trimmomatic both keep paired reads together correctly.

          This is a known problem with fastx; you basically should not use it with paired reads. I think cutadapt is able to handle pairs correctly when you use the "-p" flag, but I'm not sure.

          You can re-pair the broken files with my repair.sh script, in the latest release of BBTools (32.27), like this:
          repair.sh in1=r1.fq in2=r2.fq out1=fixed1.fq out2=fixed2.fq outsingle=singletons.fq
          Thank you a lot Brian
          Your explanation was as always very simple and clear
          I could trimm in PE mode using Cutadapt and then was able to map the reads
          Nothing to do for Fastx

          thanks again

          Salvatore

          Comment


          • #65
            Brian,

            I'm wondering if you can point or provide me with some recommended parameters for running shRNAseq datasets on BBMap.

            My initial efforts using BBMap using the default parameters did yield a sam file, but when I converted it to a sorted bam files and attempted to use it on IGV, it said "does not contain any sequence names which match the current genome."

            I'm pretty sure I'm overlooking something. I did create the indices with the miRBase mature.fa file. That seemed to work fine, but noticed in the output that said:

            No index available; generating from reference genome:

            Any help would be appreciated.

            Comment


            • #66
              coryfunk,

              IGV is very sensitive to the names of scaffolds; they must be identical in the fasta used for mapping and the fasta used by IGV. When I encounter that error, "does not contain any sequence names which match the current genome", it's usually because I'm using the wrong reference. so if, for example, you are mapping to a fasta of RNAs and trying to display the results in IGV against the full genome, that won't work.

              The methodology would be something like this:

              index:
              bbmap.sh ref=mature.fa

              map:
              bbmap.sh in=reads.fq out=mapped.sam bs=script.sh

              translate to bam:
              sh script.sh
              (that creates and runs a shellscript that creates a sorted indexed bam file from the sam output, if samtools is installed, though of course you can do it manually too)

              Now, run IGV and choose "load genome" and point it to the same fasta file that you used for indexing, then load the bam file. If it still does not work, look at the header of the sam file, and verify that the names of the scaffolds in IGV are the same as the names in the sam header. Or, actually, that's what you should do first.

              As for specifics about good settings to use for shRNAseq, I have no experience with it. Normally for RNAseq I suggest setting the "maxindel" flag to an appropriate value for the organism's expected intron length (say, 200000 for vertebrates or 8000 for fungi and plants with short introns). But if shRNAseq involves only short, unspliced RNAs, then you can just go with the defaults.

              -Brian

              Comment


              • #67
                rel. 33.04 fails

                Hi Brian
                I constantly update bbmap and now ran into problems when using the newest release 33.04 - it performs the mapping, but then fails during the Results statistics reporting, when 32.23 reports perfectly fine for exactly the same command:

                ------------------ Results ------------------

                Genome: 1
                Key Length: 13
                Max Indel: 10
                Minimum Score Ratio: 0.65
                Mapping Mode: normal
                Reads Used: 7525278 (1888844778 bases)

                Mapping: 96,483 seconds.
                Reads/sec: 38997,80
                kBases/sec: 19576,90


                Pairing data: pct reads num reads pct bases num bases

                mated pairs: 0,0231% 869 0,0231% 436238
                bad pairs: 0,0037% 139 0,0037% 69778
                insert size avg: 675,02
                Exception in thread "main" java.lang.NullPointerException
                at align2.ReadStats.mergeAll(ReadStats.java:140)
                at align2.AbstractMapper.printOutput(AbstractMapper.java:1473)
                at align2.BBMap.testSpeed(BBMap.java:432)
                at align2.BBMap.main(BBMap.java:31)
                What is going on here?
                Cheers
                Harald
                Last edited by HGV; 07-09-2014, 03:47 AM. Reason: forgot to mention last working release

                Comment


                • #68
                  Harald,

                  Thanks for finding that! I added a new histogram, "qahist" (quality accuracy histogram), which correlates the claimed quality of bases with their actual quality as measured by error rate. But if you use the "qhist" flag and not the "qahist" flag, that crash occurred because an array was not getting initialized. It's fixed now; I'll upload the fixed version later today.

                  -Brian

                  Comment


                  • #69
                    I've uploaded the fixed version (33.08) to sourceforge.

                    Comment


                    • #70
                      Repair.sh

                      Hello,

                      I got the same error saying my files had a different number of reads and tried the suggestion of running repair.sh per this thread's reply and the documentation within the program. However, when I run repair.sh, it tells me to there are different number of reads and to run repair.sh.

                      I don't understand why it is telling me to run the program I am trying to run.

                      Any ideas?
                      Attached Files

                      Comment


                      • #71
                        Originally posted by easoncm View Post
                        Hello,

                        I got the same error saying my files had a different number of reads and tried the suggestion of running repair.sh per this thread's reply and the documentation within the program. However, when I run repair.sh, it tells me to there are different number of reads and to run repair.sh.

                        I don't understand why it is telling me to run the program I am trying to run.

                        Any ideas?
                        Heh, that's kind of a funny error message. I wrote repair.sh with the assumption people would use it on interleaved files... I will fix that bug. But for now, you should be able to run it like this:

                        cat file1.fq file2.fq | repair.sh in=stdin.fq out1=fixed1.fq out2=fixed2.fq outsingle=singletons.fq overwrite=t

                        Comment


                        • #72
                          Hi Brian,

                          I am trying to create a tool for identification of viral genetic material in a sample. It's supposed to work for RNA viruses, so RNA is reversed transcribed and the resulting DNA is sequenced. I end up with fastq files, where majority of reads comes from the host, but a significant minority comes from virus. I want to map the reads to a database of known RNA viral sequences, and I'm considering using BBMap for this purpose. I need a mapper which would be able to do local as well as end-to-end alignment... I tried bowtie2 at first, but the local version didn't work properly and I didn't manage to find anyone who'd know what's wrong, so I want to try another mapper now. The files with reads can be quite large. Do you think BBMap is a good tool for this type of purpose? Also, I looked through mapping parameters in the README file, but couldn't find seed length. Is there any reason for that?

                          Thank you very much.

                          Comment


                          • #73
                            Andrej,

                            BBMap should work well for this purpose, as it can handle very low identity alignments which are typical of cross-species mapping. You can adjust the seed length with the "k" parameter. The default is 13, and I don't recommend going below about 7 as it becomes exponentially slower with a shorter seed length (speed is inversely proportional to 4^k). For high sensitivity, you might try something like this:

                            bbmap.sh in=reads.fq out=mapped.sam ref=virus.fa k=11 minratio=0.1 maxindel=200 slow

                            ...or drop k even lower. But you still may not get any hits to your viral database unless it contains a relative of the virus in question.

                            If you have an assembly of the host genome, you can also remove reads that map to it like this, to reduce the volume of data you're dealing with:

                            bbmap.sh in=reads.fq outu=unmapped.fq ref=host.fa

                            -Brian

                            Comment


                            • #74
                              Hi Brian,

                              I have some questions about the optional histogram tables:

                              1) mhist: It displays per base matches/sequencing errors, doesn't it?
                              What is included in the "Other" column?
                              How are multiple insertions counted? Assume the following:
                              Code:
                              12345--6789
                              ACTAG--CATC
                              ACTAGGGCATC
                              Will it count for this read 2 insertions at position 5?
                              I have different read length from 30bp - 1.2kbp. Is the error rate somehow adjusted to read length?

                              2) qhist: Assuming that Read_linear is the average base quality, what is Read_log?

                              3) ihist: This one is always empty?!

                              4) scafStats: Ambiguous reads are counted multiple times, aren't they? This means the overall ambiguousMB is in fact much smaller, isn't it?

                              Thanks!

                              Comment


                              • #75
                                Originally posted by WhatsOEver View Post
                                Hi Brian,

                                I have some questions about the optional histogram tables:

                                1) mhist: It displays per base matches/sequencing errors, doesn't it?
                                Yes.
                                What is included in the "Other" column?
                                That's for any other cigar string symbols, which in practice means soft-clipping. If you don't use the 'local' flag, only reads that go off the end of scaffolds will be clipped.
                                How are multiple insertions counted? Assume the following:
                                Code:
                                12345--6789
                                ACTAG--CATC
                                ACTAGGGCATC
                                Will it count for this read 2 insertions at position 5?
                                All counts are relative to the read position. But what I call an insertion is extra bases in the query that are not present in the reference, so I would classify that as a deletion, assuming that the top line (with '--') is the read and the bottom line is the reference. That deletion would be counted exactly once, regardless of the length, at position 5. An insertion of length L will be counted L times, once at each read base it covers. So, for example:

                                Code:
                                pos   123456789
                                read  ACTAGGGCATC
                                ref   ACTAG--CATC
                                That would be classified as an insertion and counted once at position 6 and once at position 7.
                                I have different read length from 30bp - 1.2kbp. Is the error rate somehow adjusted to read length?
                                Yes, the sum of all columns should always be 1 for any line; it represents the fraction of cigar symbols at that position, so if there is 1 read of length 100 and 2 reads of length 200, then an error at position 50 would be represented as a 0.33 substitution rate, while an error at position 150 would be represented as a 0.5 substitution rate. As a result the graph gets more noisy going toward the right, with variable-length reads. Be sure you are using PacBio mode (mapPacBio8k.sh) for long reads; bbmap.sh has a read-length limit of 600bp, and reads longer than that will be broken into pieces.
                                2) qhist: Assuming that Read_linear is the average base quality, what is Read_log?
                                The linear average sums the quality values, and divides by the number of quality values. The log average converts the phred quality score into a probability of error, sums the probabilities of error, divides that by the number of quality values, then transforms the average probability of error back into a phred score. As a result, the linear average (though commonly used) is kind of meaningless, while the logarithmic average can actually be used to calculate the expected error rate at that position. For example, if you had 200 reads and the log average at position 5 was 20, then you would expect there to be 2 total errors at position 5. However, if the linear average is 20, that doesn't actually tell you anything useful, but I include it anyway since that is what most programs plot.
                                3) ihist: This one is always empty?!
                                "ihist" will contain the insert size distribution of properly-paired reads, if the mapping is done paired. It will be empty if no reads were proper pairs, or if reads were mapped single-ended. By default, BBMap expects reads in the Illumina fragment library orientation where reads are on opposite strands, pointing inward. You can override this with the "rcs=f" flag (requirecorrectstrand=false) which will allow any orientation, or with the "rcompmate" flag if both reads are on the same strand. Also, only reads that map to the same scaffold can be considered proper pairs.

                                You can, alternately, use the 'lhist' flag to just plot the read lengths, regardless of mapping.
                                4) scafStats: Ambiguous reads are counted multiple times, aren't they? This means the overall ambiguousMB is in fact much smaller, isn't it?
                                Yes, unambiguously-mapped reads (and megabases) are counted once and ambiguously-mapped reads are counted at least twice.
                                Thanks!
                                You're welcome!

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Strategies for Sequencing Challenging Samples
                                  by seqadmin


                                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                  03-22-2024, 06:39 AM
                                • seqadmin
                                  Techniques and Challenges in Conservation Genomics
                                  by seqadmin



                                  The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                  Avian Conservation
                                  Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                  03-08-2024, 10:41 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, Yesterday, 06:37 PM
                                0 responses
                                8 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, Yesterday, 06:07 PM
                                0 responses
                                8 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-22-2024, 10:03 AM
                                0 responses
                                49 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-21-2024, 07:32 AM
                                0 responses
                                66 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X