Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #31
    Originally posted by luc View Post
    Hi Brian,

    I can't seem to find any documentation on how to implement any specific error models for the read simulation of Illumina or PacBio? Perhaps I am misunderstanding something though.
    Luc,

    I released a new version (32.14) which now has RandomReads documented (run randomreads.sh with no arguments).

    By default it will use an Illumina error model, which assigns qualities in a pattern that looks roughly like the average quality histogram of Illumina reads (peaking at about position 20, and sloping down toward the head and tail). Then some bases will by randomly changed based on the quality score. To generate Illumina-like reads from e.coli:
    randomreads.sh ref=ecoli.fa reads=100000 length=150 maxq=40 midq=30 minq=10 out=synth.fq
    It can also generate paired reads if desired.

    To generate PacBio-like reads:
    randomreads.sh ref=ecoli.fa reads=100000 minlength=200 maxlength=4000 pacbio pbminrate=0.13 pbmaxrate=0.17 out=synth.fq

    That will add substitutions and single-base-pair indels with per-base probability ranging from 0.13 to 0.17.

    There are also a lot of other options for adding specific numbers or lengths of insertions, deletions, substitutions, and Ns, in the help menu.

    Comment


    • #32
      Thanks a lot for the randomreads instructions Brian!

      Comment


      • #33
        Originally posted by Brian Bushnell View Post
        bbmap.sh in1=read1.fq.gz in2=read2.fq.gz ref=zv9.fa out=mapped.sam maxindel=200000
        @ Brian: Is maxindel the only bbmap parameter you recommend to change for mapping reads for subsequent Cufflinks analysis?

        Because I interpetreted the Cufflinks Doc in a way, that it will use the xmtag to lower the read weight by the amount of sites it maps to and therefore decided to set ambig=all.

        I therefore chose to set for my RNAseq:

        Code:
        maxindel=200000 intronlen=10 ambig=all xstag=unstranded xmtag=T
        Was I misled?

        Comment


        • #34
          Thias,

          For analysis of vertebrate RNA-seq data with Cufflinks, I do recommend all of those parameters:
          maxindel=200000 intronlen=10 ambig=all xstag=unstranded xmtag=t

          Although "ambig=all" or "ambig=random" would both be OK. The "intronlen", "xstag" and "xmtag" are things I added specifically for Cufflinks compatibility (or the tuxedo package in general) so they are not needed if you do downstream analysis with other tools. The only ones that affect the actual mapping are "ambig" and "maxindel". xstag and xmtag just add optional tags, and intronlen changes the "D" symbol in cigar strings to "N" for deletions longer than the specified length. There's no reason why any of them should be necessary, but for whatever reason, Cufflinks needs them.

          Note that if you your data is strand-specific, and you know which protocol it was, then it's better to set "xstag=firststrand" or "xstag=secondstrand" than unstranded.
          Last edited by Brian Bushnell; 04-26-2014, 09:13 AM.

          Comment


          • #35
            Thanks Brian for your response almost at warp speed.

            Unfortunately am still a bit lost - probably also due to my ignorance regarding the detail SAM Format Specification. This is, how I assumed the setting takes effect:
            • ambig=all:
              All best alignments are reported. All but one are flagged with the secondary alignment bit (0x100) and the xmtag (an integer ?) additionally reports the number of best alignments for the tuxedo package.
            • ambig=random:
              From the best alignments, one is randomly chosen. This alignment is reported as if there would not have been any secondary alignments. Only the xmtag > 1 indicates, that there would have been other best alignments as well, but the coordinates of those alignments are not present in the output.


            I admit, I could easily have checked, whether ambig=random really does not report secondary alignments, but I was pretty confident to have understood the logic - until I read your post this afternoon.

            Thanks also for emphasizing the xstag's importance. However the data is indeed unstranded unpaired phred64-age low-coverage dusty archaeological heritage from a previous PhD student .

            Comment


            • #36
              Originally posted by Thias View Post
              Thanks Brian for your response almost at warp speed.

              Unfortunately am still a bit lost - probably also due to my ignorance regarding the detail SAM Format Specification. This is, how I assumed the setting takes effect:
              • ambig=all:
                All best alignments are reported. All but one are flagged with the secondary alignment bit (0x100) and the xmtag (an integer ?) additionally reports the number of best alignments for the tuxedo package.
              • ambig=random:
                From the best alignments, one is randomly chosen. This alignment is reported as if there would not have been any secondary alignments. Only the xmtag > 1 indicates, that there would have been other best alignments as well, but the coordinates of those alignments are not present in the output.


              I admit, I could easily have checked, whether ambig=random really does not report secondary alignments, but I was pretty confident to have understood the logic - until I read your post this afternoon.

              Thanks also for emphasizing the xstag's importance. However the data is indeed unstranded unpaired phred64-age low-coverage dusty archaeological heritage from a previous PhD student .
              That sounds correct. ambig=all reports a randomly selected primary site and all other equal-scoring sites as secondary. ambig=random only reports a primary site, selected at random from top sites. If you use "random" then possibly (probably?) you shouldn't include the xmtag, as that might confuse cufflinks. Also, ambiguously-mapped reads get the "XT:A:R" tag. Note that any tag that starts with "X" is not part of the official sam specification.

              Comment


              • #37
                Originally posted by Brian Bushnell View Post
                If you use "random" then possibly (probably?) you shouldn't include the xmtag, as that might confuse cufflinks.
                This in in accordance with my gut feeling, since I interpet the Cufflinks docs in the way, that the algorithm adjusts the read weight based on the xmtag:

                By default, Cufflinks will uniformly divide each multi-mapped read to all of the positions it maps to. In other words, a read mapping to 10 positions will count as 10% of a read at each position.
                However Cufflinks in a later stage also needs the secondary alignment positions for the fine grained abundance estimation:

                Cufflinks will then re-estimate the abundances dividing each multi-mapped read probabalistically based on the initial abundance estimation of the genes it maps to
                But thanks Brian for working this out!
                Last edited by Thias; 04-28-2014, 12:31 AM. Reason: Typo

                Comment


                • #38
                  Dear Brian,

                  I tested bbmap successfully on a small dataset and so far I'm very happy with the results
                  However, I'm running into some memory problems using my real dataset:


                  My RAM looks currently as follows:
                  total used free shared buffers cached
                  Mem: 66066316 42861240 23205076 0 3700 42478768
                  When I now run bbmap with different -Xmx setting or also without the parameter, it always quits with:

                  Max Memory = 28633 MB (or other, depending on the value I used)
                  Available Memory = 10830 MB (ranging from 8Gb - 12Gb)
                  Reducing threads from 16 to 15 due to low system memory.
                  java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space
                  To me it looks like a bug within the memory detection or am I missing something?

                  Comment


                  • #39
                    Originally posted by WhatsOEver View Post
                    Dear Brian,

                    I tested bbmap successfully on a small dataset and so far I'm very happy with the results
                    However, I'm running into some memory problems using my real dataset:

                    ...

                    To me it looks like a bug within the memory detection or am I missing something?
                    WhatsOEver,

                    Normally, "Reducing threads from 16 to 15 due to low system memory." only occurs if you are right at the edge of the memory limits. It would be helpful to know:
                    1) The amount of physical memory available (it looks like 64GB, is that correct?)
                    2) The size of the reference you are using.
                    3) Whether you are mapping with bbmap.sh or mapPacBio8k.sh
                    4) The complete stack trace that comes after the error message "java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space". Actually, the entire standard error output would be helpful.

                    You can reduce the amount of memory BBMap needs by reducing the number of threads (e.g. "t=8" to drop it to 8 threads). But that normally will only help with pacbio modes which need much more memory per thread (100s of megs), not with standard bbmap which only uses around 25MB per thread.

                    Assuming you do have 64GB, I would try with the flag -Xmx58g and see if that works. But regardless, it would help if you could post the above details.

                    Thanks,
                    Brian

                    Comment


                    • #40
                      Originally posted by Brian Bushnell View Post
                      WhatsOEver,

                      Normally, "Reducing threads from 16 to 15 due to low system memory." only occurs if you are right at the edge of the memory limits. It would be helpful to know:
                      1) The amount of physical memory available (it looks like 64GB, is that correct?)
                      2) The size of the reference you are using.
                      3) Whether you are mapping with bbmap.sh or mapPacBio8k.sh
                      4) The complete stack trace that comes after the error message "java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space". Actually, the entire standard error output would be helpful.

                      You can reduce the amount of memory BBMap needs by reducing the number of threads (e.g. "t=8" to drop it to 8 threads). But that normally will only help with pacbio modes which need much more memory per thread (100s of megs), not with standard bbmap which only uses around 25MB per thread.

                      Assuming you do have 64GB, I would try with the flag -Xmx58g and see if that works. But regardless, it would help if you could post the above details.

                      Thanks,
                      Brian
                      Hi Brian,

                      it was indeed increasing Xmx to 58g which did the trick. The threads (t) parameter, however, didn't help at all. So, when I run with -Xmx58g everything works fine (independent of t=8 or t=16 or leaving t=auto). When I run with -Xmx30g and t=auto the prog quits with the following error:

                      concerning your questions:
                      1) The amount of physical memory available (it looks like 64GB, is that correct?)
                      yes
                      2) The size of the reference you are using.
                      its the human genome assembly 19 from UCSC (~3.1Gb)
                      3) Whether you are mapping with bbmap.sh or mapPacBio8k.sh
                      mapPacBio8k.sh
                      4) The complete stack trace that comes after the error message "java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap
                      see quotes below

                      java -ea -Xmx30g -cp /projects/bbmap/current/ align2.BBMapPacBio build=1 overwrite=true minratio=0.40 fastareadlen=6000 ambiguous=best minscaf=100 startpad=10000 stoppad=10000 midpad=6000 k=14 in=./121128_1.fastq out=/projects/testData/bbmap_121128_1_mapped.sam xstag=unstranded xmtag=t ambig=all maxindel=200000 qin=33 -Xmx30g
                      Executing align2.BBMapPacBio [build=1, overwrite=true, minratio=0.40, fastareadlen=6000, ambiguous=best, minscaf=100, startpad=10000, stoppad=10000, midpad=6000, k=14, in=./121128_1.fastq, out=/projects/testData/bbmap_121128_1_mapped.sam, xstag=unstranded, xmtag=t, ambig=all, maxindel=200000, qin=33, maxindel=150000, -Xmx30g]

                      BBMap version 32.14
                      Set overwrite to true
                      Set MINIMUM_ALIGNMENT_SCORE_RATIO to 0.400
                      Set fastq input ASCII offset to 33
                      Retaining all best sites for ambiguous mappings.
                      Set genome to 1

                      Loaded Reference: 6.143 seconds.
                      Loading index for chunk 1-7, build 1
                      Generated Index: 9.693 seconds.
                      Analyzed Index: 24.885 seconds.
                      Started output stream: 0.038 seconds.
                      Cleared Memory: 0.530 seconds.

                      Max Memory = 28633 MB
                      Available Memory = 10826 MB
                      Reducing threads from 16 to 15 due to low system memory.
                      java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space
                      at align2.MultiStateAligner9PacBio.<init>(MultiStateAligner9PacBio.java:73)
                      at align2.MSA.makeMSA(MSA.java:43)
                      at align2.AbstractMapThread.<init>(AbstractMapThread.java:130)
                      at align2.BBMapThreadPacBio.<init>(BBMapThreadPacBio.java:100)
                      at align2.BBMapPacBio.testSpeed(BBMapPacBio.java:378)
                      at align2.BBMapPacBio.main(BBMapPacBio.java:33)
                      Detecting finished threads: 0, 1, 2, 3, 4, 5, 6

                      **************************************************************************

                      Warning! 15 mapping threads did not terminate normally.
                      Please check the error log; the output may be corrupt or incomplete.

                      **************************************************************************


                      Exception in thread "main" java.lang.RuntimeException: Aborting due to prior error.
                      at align2.AbstractMapper.abort(AbstractMapper.java:69)
                      at align2.BBMapPacBio.testSpeed(BBMapPacBio.java:381)
                      at align2.BBMapPacBio.main(BBMapPacBio.java:33)
                      When I run it with -Xmx30g and t=8 I get the same heap space error, although its not complaining that there is not enough memory available

                      java -ea -Xmx30g -cp /projects/bbmap/current/ align2.BBMapPacBio build=1 overwrite=true minratio=0.40 fastareadlen=6000 ambiguous=best minscaf=100 startpad=10000 stoppad=10000 midpad=6000 k=14 t=8 in=./121128_1.fastq out=/projects/testData/bbmap_121128_1_mapped.sam xstag=unstranded xmtag=t ambig=all maxindel=200000 qin=33 -Xmx30g
                      Executing align2.BBMapPacBio [build=1, overwrite=true, minratio=0.40, fastareadlen=6000, ambiguous=best, minscaf=100, startpad=10000, stoppad=10000, midpad=6000, k=14, t=8, in=./121128_1.fastq, out=/projects/testData/bbmap_121128_1_mapped.sam, xstag=unstranded, xmtag=t, ambig=all, maxindel=200000, qin=33, maxindel=150000, -Xmx30g]

                      BBMap version 32.14
                      Set overwrite to true
                      Set MINIMUM_ALIGNMENT_SCORE_RATIO to 0.400
                      Set threads to 8
                      Set fastq input ASCII offset to 33
                      Retaining all best sites for ambiguous mappings.
                      Set genome to 1

                      Loaded Reference: 5.773 seconds.
                      Loading index for chunk 1-7, build 1
                      Generated Index: 9.148 seconds.
                      Analyzed Index: 27.428 seconds.
                      Started output stream: 0.015 seconds.
                      Cleared Memory: 0.511 seconds.
                      java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space
                      at align2.MultiStateAligner9PacBio.<init>(MultiStateAligner9PacBio.java:73)
                      at align2.MSA.makeMSA(MSA.java:43)
                      at align2.AbstractMapThread.<init>(AbstractMapThread.java:130)
                      at align2.BBMapThreadPacBio.<init>(BBMapThreadPacBio.java:100)
                      at align2.BBMapPacBio.testSpeed(BBMapPacBio.java:378)
                      at align2.BBMapPacBio.main(BBMapPacBio.java:33)
                      Detecting finished threads: 0, 1, 2, 3, 4, 5

                      **************************************************************************

                      Warning! 8 mapping threads did not terminate normally.
                      Please check the error log; the output may be corrupt or incomplete.

                      **************************************************************************


                      Exception in thread "main" java.lang.RuntimeException: Aborting due to prior error.
                      at align2.AbstractMapper.abort(AbstractMapper.java:69)
                      at align2.BBMapPacBio.testSpeed(BBMapPacBio.java:381)
                      at align2.BBMapPacBio.main(BBMapPacBio.java:33)

                      Comment


                      • #41
                        With bbmap.sh, HG19 needs about 23 GB. I had never tested HG19 with mapPacBio8k.sh but it looks like I need to refine the memory prediction. Thanks for the report!

                        Comment


                        • #42
                          Dear Brian,

                          I have some questions about the optional fields (TAG:TYPE:VALUE stuff) that can be added in the SAM file by bbmap. Take the following multiple-mapped read as example

                          readNameHere 16 chr11 106018977 55 6M3I17M * 0 0 GCTTTGAAAAATTTTCTGCATGCCCA /4122:...00....=<<<=<<4449 XT:A:R NM:i:5 AM:i:55 XM:i:2 NH:i:20
                          readNameHere 272 chrX 119093161 50 18M5I3M * 0 0 * * NM:i:6 AM:i:50 XM:i:2 NH:i:20
                          readNameHere 272 chrX 45789727 49 1M3I22M * 0 0 * * NM:i:6 AM:i:49 XM:i:2 NH:i:20
                          readNameHere 272 chr20 23542459 49 18M5I3M * 0 0 * * NM:i:6 AM:i:49 XM:i:2 NH:i:20
                          readNameHere 256 chr3 151169846 47 1M3I3M3I16M * 0 0 * * NM:i:6 AM:i:47 XM:i:2 NH:i:20
                          readNameHere 256 chr4 27414930 47 1M1I5M12N17M1I1M * 0 0 * * NM:i:2 AM:i:47 XM:i:2 XS:A:+ NH:i:20
                          readNameHere 272 chrX 89494814 46 26M * 0 0 * * NM:i:8 AM:i:46 XM:i:2 NH:i:20
                          readNameHere 272 chrY 3677157 46 26M * 0 0 * * NM:i:8 AM:i:46 XM:i:2 NH:i:20
                          readNameHere 272 chr12 126969193 46 4M2D22M * 0 0 * * NM:i:6 AM:i:46 XM:i:2 NH:i:20
                          readNameHere 256 chr21 31585993 45 5M1I20M * 0 0 * * NM:i:5 AM:i:45 XM:i:2 NH:i:20
                          readNameHere 256 chr5 80894028 45 7M1I18M * 0 0 * * NM:i:6 AM:i:45 XM:i:2 NH:i:20
                          readNameHere 272 chr6 85600169 45 26M * 0 0 * * NM:i:7 AM:i:45 XM:i:2 NH:i:20
                          readNameHere 272 chr9 81596079 44 18M1I7M * 0 0 * * NM:i:6 AM:i:44 XM:i:2 NH:i:20
                          readNameHere 256 chr3 170762912 43 1M3I22M * 0 0 * * NM:i:7 AM:i:43 XM:i:2 NH:i:20
                          readNameHere 256 chr13 20507692 42 26M * 0 0 * * NM:i:6 AM:i:42 XM:i:2 NH:i:20
                          readNameHere 256 chr12 28169470 42 5M2I19M * 0 0 * * NM:i:6 AM:i:42 XM:i:2 NH:i:20
                          readNameHere 256 chr7 42988607 41 26M * 0 0 * * NM:i:7 AM:i:41 XM:i:2 NH:i:20
                          readNameHere 256 chr3 103868859 41 26M * 0 0 * * NM:i:8 AM:i:41 XM:i:2 NH:i:20
                          readNameHere 272 chrX 45152661 40 26M * 0 0 * * NM:i:7 AM:i:40 XM:i:2 NH:i:20
                          readNameHere 272 chr18 60932877 39 26M * 0 0 * * NM:i:8 AM:i:39 XM:i:2 NH:i:20
                          Concerning XM: From the readme it says "Indicates number of best alignments". But what are these exactly? I expected all 26M from my example to be best alignments?! Or does it mean that there are mismatches in some of these alignments which I don't see as I didn't change the samversion parameter to 1.4 (btw: I have samtools 0.1.19 installed. This is able to handle '=' and 'X', isn't it?)

                          Concerning XS: I run bbmap using -xstag=unstranded as I read it somewhere in this thread and I want to use cufflinks afterwards. I thought that the '+' and '-' as values refer to the strand in this field and I also have '+' and '-' values set in different reads. If this is true, where is the difference to -xstag=true?

                          Concerning YS: This stores the end position of a read? So its pos + the alignment length of the read?

                          Concerning YI: The identity is simple the number of (mismatches + indels) / read-length?


                          Thanks a lot in advance for your help and this amazing amount of possibilities !

                          Comment


                          • #43
                            Dear Brian,

                            I have yet another issue

                            Looking through the readme I saw that the standard state for stoptag and idtag is true. However, looking at my sam file I couldn't find any YS or YI fields. I therefore tried to set the parameter explicitly in the start command, but this resulted in an error:

                            /projects/bbmap$ ./bbmap ../bbAlign/DNAseq_140306.fastq
                            java -ea -Xmx56g -cp /projects/bbmap/current/ align2.BBMapPacBio build=1 overwrite=true minratio=0.40 fastareadlen=6000 ambiguous=best minscaf=100 startpad=10000 stoppad=10000 midpad=6000 k=14 t=16 in=../bbAlign/DNAseq_140306.fastq out=/projects/bbAlign/bbmap_DNAseq_140306_mapped.sam xstag=unstranded xmtag=t stoptag=t idtag=t ambig=all maxindel=10 qin=33 samversion=1.4 qhist=/projects/bbAlign/bb_qhist.txt mhist=/projects/bbAlign/bb_mhist.txt qahist=/projects/bbAlign/qahist.txt ihist=/projects/bbAlign/ihist.txt bhist=/projects/bbAlign/bhist.txt scafstats=/projects/bbAlign/scafStats refStats=/projects/bbAlign/refStats.txt -Xmx56g
                            Executing align2.BBMapPacBio [build=1, overwrite=true, minratio=0.40, fastareadlen=6000, ambiguous=best, minscaf=100, startpad=10000, stoppad=10000, midpad=6000, k=14, t=16, in=../bbAlign/DNAseq_140306.fastq, out=/projects/bbAlign/bbmap_DNAseq_140306_mapped.sam, xstag=unstranded, xmtag=t, stoptag=t, idtag=t, ambig=all, maxindel=10, qin=33, samversion=1.4, qhist=/projects/bbAlign/bb_qhist.txt, mhist=/projects/bbAlign/bb_mhist.txt, qahist=/projects/bbAlign/qahist.txt, ihist=/projects/bbAlign/ihist.txt, bhist=/projects/bbAlign/bhist.txt, scafstats=/projects/bbAlign/scafStats, refStats=/projects/bbAlign/refStats.txt, -Xmx56g]

                            BBMap version 32.14
                            Set overwrite to true
                            Set MINIMUM_ALIGNMENT_SCORE_RATIO to 0.400
                            Set threads to 16
                            Set fastq input ASCII offset to 33
                            Set quality histogram output to /projects/bbAlign/bb_qhist.txt
                            Set match histogram output to /projects/bbAlign/bb_mhist.txt
                            Set quality accuracy histogram output to /projects/bbAlign/qahist.txt
                            Set insert size histogram output to /projects/bbAlign/ihist.txt
                            Set base content histogram output to /projects/bbAlign/bhist.txt
                            Scaffold statistics will be written to /projects/bbAlign/scafStats
                            Reference set statistics will be written to /projects/bbAlign/refStats.txt
                            Retaining all best sites for ambiguous mappings.
                            Set genome to 1

                            Loaded Reference: 6.389 seconds.
                            Loading index for chunk 1-7, build 1
                            Generated Index: 8.479 seconds.
                            Analyzed Index: 23.584 seconds.
                            Started output stream: 0.014 seconds.
                            Creating scaffold statistics table: 0.001 seconds.
                            Cleared Memory: 0.506 seconds.
                            Processing reads in single-ended mode.
                            Started read stream.
                            Started 16 mapping threads.
                            Exception in thread "Thread-11" java.lang.NullPointerException
                            at stream.SamLine.<init>(SamLine.java:1130)
                            at stream.ReadStreamByteWriter.run2(ReadStreamByteWriter.java:154)
                            at stream.ReadStreamByteWriter.run(ReadStreamByteWriter.java:19)
                            Exception in thread "Thread-24" java.lang.RuntimeException: Writing to a terminated thread.
                            at stream.RTextOutputStream3.write(RTextOutputStream3.java:121)
                            at stream.RTextOutputStream3.addDisordered(RTextOutputStream3.java:116)
                            at stream.RTextOutputStream3.add(RTextOutputStream3.java:90)
                            at align2.AbstractMapThread.writeList(AbstractMapThread.java:394)
                            at align2.AbstractMapThread.run(AbstractMapThread.java:320)
                            Exception in thread "Thread-21" java.lang.RuntimeException: Writing to a terminated thread.
                            at stream.RTextOutputStream3.write(RTextOutputStream3.java:121)
                            at stream.RTextOutputStream3.addDisordered(RTextOutputStream3.java:116)
                            at stream.RTextOutputStream3.add(RTextOutputStream3.java:90)
                            at align2.AbstractMapThread.writeList(AbstractMapThread.java:394)
                            at align2.AbstractMapThread.run(AbstractMapThread.java:320)
                            Exception in thread "Thread-19" java.lang.RuntimeException: Writing to a terminated thread.
                            at stream.RTextOutputStream3.write(RTextOutputStream3.java:121)
                            at stream.RTextOutputStream3.addDisordered(RTextOutputStream3.java:116)
                            at stream.RTextOutputStream3.add(RTextOutputStream3.java:90)
                            at align2.AbstractMapThread.writeList(AbstractMapThread.java:394)
                            at align2.AbstractMapThread.run(AbstractMapThread.java:320)
                            Exception in thread "Thread-12" java.lang.RuntimeException: Writing to a terminated thread.
                            at stream.RTextOutputStream3.write(RTextOutputStream3.java:121)
                            at stream.RTextOutputStream3.addDisordered(RTextOutputStream3.java:116)
                            at stream.RTextOutputStream3.add(RTextOutputStream3.java:90)
                            at align2.AbstractMapThread.writeList(AbstractMapThread.java:394)
                            at align2.AbstractMapThread.run(AbstractMapThread.java:320)
                            Detecting finished threads: 0Exception in thread "Thread-25" java.lang.RuntimeException: Writing to a terminated thread.
                            at stream.RTextOutputStream3.write(RTextOutputStream3.java:121)
                            at stream.RTextOutputStream3.addDisordered(RTextOutputStream3.java:116)
                            at stream.RTextOutputStream3.add(RTextOutputStream3.java:90)
                            at align2.AbstractMapThread.writeList(AbstractMapThread.java:394)
                            at align2.AbstractMapThread.run(AbstractMapThread.java:320)
                            Exception in thread "Thread-18" java.lang.RuntimeException: Writing to a terminated thread.
                            at stream.RTextOutputStream3.write(RTextOutputStream3.java:121)
                            at stream.RTextOutputStream3.addDisordered(RTextOutputStream3.java:116)
                            at stream.RTextOutputStream3.add(RTextOutputStream3.java:90)
                            at align2.AbstractMapThread.writeList(AbstractMapThread.java:394)
                            at align2.AbstractMapThread.run(AbstractMapThread.java:320)
                            Exception in thread "Thread-23" java.lang.RuntimeException: Writing to a terminated thread.
                            at stream.RTextOutputStream3.write(RTextOutputStream3.java:121)
                            at stream.RTextOutputStream3.addDisordered(RTextOutputStream3.java:116)
                            at stream.RTextOutputStream3.add(RTextOutputStream3.java:90)
                            at align2.AbstractMapThread.writeList(AbstractMapThread.java:394)
                            at align2.AbstractMapThread.run(AbstractMapThread.java:320)
                            Exception in thread "Thread-27" java.lang.RuntimeException: Writing to a terminated thread.
                            at stream.RTextOutputStream3.write(RTextOutputStream3.java:121)
                            at stream.RTextOutputStream3.addDisordered(RTextOutputStream3.java:116)
                            at stream.RTextOutputStream3.add(RTextOutputStream3.java:90)
                            at align2.AbstractMapThread.writeList(AbstractMapThread.java:394)
                            at align2.AbstractMapThread.run(AbstractMapThread.java:320)
                            Exception in thread "Thread-15" java.lang.RuntimeException: Writing to a terminated thread.
                            at stream.RTextOutputStream3.write(RTextOutputStream3.java:121)
                            at stream.RTextOutputStream3.addDisordered(RTextOutputStream3.java:116)
                            at stream.RTextOutputStream3.add(RTextOutputStream3.java:90)
                            at align2.AbstractMapThread.writeList(AbstractMapThread.java:394)
                            at align2.AbstractMapThread.run(AbstractMapThread.java:320)
                            Exception in thread "Thread-17" java.lang.RuntimeException: Writing to a terminated thread.
                            at stream.RTextOutputStream3.write(RTextOutputStream3.java:121)
                            at stream.RTextOutputStream3.addDisordered(RTextOutputStream3.java:116)
                            at stream.RTextOutputStream3.add(RTextOutputStream3.java:90)
                            at align2.AbstractMapThread.writeList(AbstractMapThread.java:394)
                            at align2.AbstractMapThread.run(AbstractMapThread.java:320)
                            Exception in thread "Thread-20" java.lang.RuntimeException: Writing to a terminated thread.
                            at stream.RTextOutputStream3.write(RTextOutputStream3.java:121)
                            at stream.RTextOutputStream3.addDisordered(RTextOutputStream3.java:116)
                            at stream.RTextOutputStream3.add(RTextOutputStream3.java:90)
                            at align2.AbstractMapThread.writeList(AbstractMapThread.java:394)
                            at align2.AbstractMapThread.run(AbstractMapThread.java:320)
                            Exception in thread "Thread-13" java.lang.RuntimeException: Writing to a terminated thread.
                            at stream.RTextOutputStream3.write(RTextOutputStream3.java:121)
                            at stream.RTextOutputStream3.addDisordered(RTextOutputStream3.java:116)
                            at stream.RTextOutputStream3.add(RTextOutputStream3.java:90)
                            at align2.AbstractMapThread.writeList(AbstractMapThread.java:394)
                            at align2.AbstractMapThread.run(AbstractMapThread.java:320)
                            , 1Exception in thread "Thread-26" java.lang.RuntimeException: Writing to a terminated thread.
                            at stream.RTextOutputStream3.write(RTextOutputStream3.java:121)
                            at stream.RTextOutputStream3.addDisordered(RTextOutputStream3.java:116)
                            at stream.RTextOutputStream3.add(RTextOutputStream3.java:90)
                            at align2.AbstractMapThread.writeList(AbstractMapThread.java:394)
                            at align2.AbstractMapThread.run(AbstractMapThread.java:320)
                            Exception in thread "Thread-22" java.lang.RuntimeException: Writing to a terminated thread.
                            at stream.RTextOutputStream3.write(RTextOutputStream3.java:121)
                            at stream.RTextOutputStream3.addDisordered(RTextOutputStream3.java:116)
                            at stream.RTextOutputStream3.add(RTextOutputStream3.java:90)
                            at align2.AbstractMapThread.writeList(AbstractMapThread.java:394)
                            at align2.AbstractMapThread.run(AbstractMapThread.java:320)
                            Exception in thread "Thread-14" java.lang.RuntimeException: Writing to a terminated thread.
                            at stream.RTextOutputStream3.write(RTextOutputStream3.java:121)
                            at stream.RTextOutputStream3.addDisordered(RTextOutputStream3.java:116)
                            at stream.RTextOutputStream3.add(RTextOutputStream3.java:90)
                            at align2.AbstractMapThread.writeList(AbstractMapThread.java:394)
                            at align2.AbstractMapThread.run(AbstractMapThread.java:320)
                            , 2, 3Exception in thread "Thread-16" java.lang.RuntimeException: Writing to a terminated thread.
                            at stream.RTextOutputStream3.write(RTextOutputStream3.java:121)
                            at stream.RTextOutputStream3.addDisordered(RTextOutputStream3.java:116)
                            at stream.RTextOutputStream3.add(RTextOutputStream3.java:90)
                            at align2.AbstractMapThread.writeList(AbstractMapThread.java:394)
                            at align2.AbstractMapThread.run(AbstractMapThread.java:320)
                            , 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15

                            **************************************************************************

                            Warning! 16 mapping threads did not terminate normally.
                            Please check the error log; the output may be corrupt or incomplete.

                            **************************************************************************




                            ------------------ Results ------------------

                            Genome: 1
                            Key Length: 14
                            Max Indel: 10
                            Minimum Score Ratio: 0.4
                            Mapping Mode: normal
                            Reads Used: 850 (150375 bases)

                            Mapping: 16.779 seconds.
                            Reads/sec: 50.66
                            kBases/sec: 8.96


                            Read 1 data: pct reads num reads pct bases num bases

                            mapped: 71.8824% 611 84.5406% 127128
                            unambiguous: 63.6471% 541 75.8251% 114022
                            ambiguous: 8.2353% 70 8.7155% 13106
                            low-Q discards: 0.0000% 0 0.0000% 0

                            perfect best site: 0.0000% 0 0.0000% 0
                            semiperfect site: 0.0000% 0 0.0000% 0

                            Match Rate: NA NA 24.8088% 109654
                            Error Rate: 73.7923% 611 75.1826% 332304
                            Sub Rate: 72.9469% 604 1.8880% 8345
                            Del Rate: 59.6618% 494 71.2368% 314864
                            Ins Rate: 71.9807% 596 2.0577% 9095
                            N Rate: 4.1063% 34 0.0086% 38
                            Splice Rate: 14.8551% 123 (splices at least 10 bp)
                            Exception in thread "main" java.lang.RuntimeException: BBMap terminated in an error state; the output may be corrupt.
                            at align2.BBMapPacBio.testSpeed(BBMapPacBio.java:407)
                            at align2.BBMapPacBio.main(BBMapPacBio.java:33)
                            When I run it without both parameters, everything works fine. My dataset includes 1.6M single end 454 reads if this is important.

                            Comment


                            • #44
                              WhatSoEver,

                              For the first problem, try setting the flag "sam=1.4". By default, cigar strings are presented in Sam 1.3 format, which uses the 'M' symbol. The does not mean match - it means match OR mismatch! Sam format 1.4 fixed this bad design decision by using '=' for match and 'X' for mismatch, but by default I output sam format 1.3 because many downstream tools (such as old versions of Samtools) cannot process format 1.4. The lastest Samtools can, though. If you set "sam=1.4" then it will be more obvious how many substitutions the reads have. The NM tag tells you the edit distance, but BBMap does not determine score directly from edit distance; it uses affine transforms. So 4 consecutive substitutions will be penalized less than 4 non-adjacent substitutions, for example.

                              Please note that neither XM nor XS are defined in the sam spec; both are bowtie/tophat flags, and XM particularly is poorly defined. For that, I output the number of alignments with scores EXACTLY equal to the top score, even though when you set "ambig=all" I will output sites with scores that are within a certain threshold of the top score. So there can be more alignments printed than the value of XM.

                              For the XS tag, you can set it to "firststrand", "secondstrand", or "unstranded". But internally, "true" is equivalent to "firststand" which is equivalent to "unstranded"; all produce identical output. "secondstrand" produces the opposite sign of "firststrand". For firststrand/unstranded, if read 1 maps to the plus strand, it gets "XS:A:+"; for secondstrand, it gets "XT:A:-". Read 2 always gets the opposite of read 1.

                              The YS tag is the stop position - specifically, the reference location to which the rightmost base maps. So yes, it is pos+alignment length.

                              The YI tag is a bit more complex. Identity can be calculated various ways; unfortunately, many ways overly penalize long deletions. So actually it is:
                              (matches)/(matches+substitutions+Ns+insertions+f(deletions)). If f(deletions) was just the number of deletions, then it would be matches/(total number of cigar operations). But then a 100bp read with 50bp match, 400bp deletion, 50bp match would get an identity of 100/(100+400) = 20%. 20% identity implies worse than even the alignment of two completely random sequences (which is expected to yield over 25%), so it's misleading, because a read with two nearby 50bp consecutive matches to another string is actually a very good alignment that could not have arisen by chance. So, I actually take the square root of the length of each individual deletion event. In this case, you would get:
                              (100)/(100+sqrt(400)) = 100/120= 83.3% identity.
                              If there were 50 individual single-base-pair deletions, however, the identity would be:
                              (100)/(100+50*sqrt(1)) = 100/150 = 66.7% identity, which is lower, because it's more likely to have arisen by chance.
                              This method of identity calculation is not perfect, of course, and incompatible with most other methods, but there is no single standard of which I am aware, and I think the results of this method are more informative than others, which are biased toward edit distance rather than modelling biological events.

                              For the crash bug, that's my fault for not testing the interaction between "ambig=all" and "stoptag"; they don't work together. I'll fix that. In the mean time, you can set "saa=f" (secondaryalignmentasterisks=false) which will circumvent the problem.

                              Thanks for finding that bug, and I'm glad you're finding BBMap useful! One suggestion, though - if you are doing RNA-seq on organisms with introns, I suggest you set the "intronlength" and "maxindel" flags. I normally use "intronlength=10". This does not change mapping at all, it just changes cigar strings so that deletions longer than 10bp will be reported using the symbol 'N' instead of 'D'. That may make a difference to some downstream processing tools, if they only interpret 'N' as intron. As for maxindel, introns shorter than maxindel will not be searched for. In plants or fungus, I suggest setting this to at least 16000 (that's the default for bbmap, but the pacbio modes have a much shorter default of 100); for vertebrates, I suggest 200000. I'm not really sure about invertebrates. But try to set it at around the 99th percentile of intron sizes in that organism, if you have some idea of what that might be. The lower the better (faster, more accurate) as long as it encompasses the vast majority of introns.

                              Comment


                              • #45
                                Thanks Brian, that helps a lot! Many thanks!!

                                There are just two little things I'd like to clearify:

                                ---------------------------------------------------------------------------------------

                                The YI tag is a bit more complex. Identity can be calculated various ways; unfortunately, many ways overly penalize long deletions. So actually it is:
                                (matches)/(matches+substitutions+Ns+insertions+f(deletions)). If f(deletions) was just the number of deletions, then it would be matches/(total number of cigar operations). But then a 100bp read with 50bp match, 400bp deletion, 50bp match would get an identity of 100/(100+400) = 20%. 20% identity implies worse than even the alignment of two completely random sequences (which is expected to yield over 25%), so it's misleading, because a read with two nearby 50bp consecutive matches to another string is actually a very good alignment that could not have arisen by chance. So, I actually take the square root of the length of each individual deletion event. In this case, you would get:
                                (100)/(100+sqrt(400)) = 100/120= 83.3% identity.
                                If there were 50 individual single-base-pair deletions, however, the identity would be:
                                (100)/(100+50*sqrt(1)) = 100/150 = 66.7% identity, which is lower, because it's more likely to have arisen by chance.
                                This method of identity calculation is not perfect, of course, and incompatible with most other methods, but there is no single standard of which I am aware, and I think the results of this method are more informative than others, which are biased toward edit distance rather than modelling biological events.
                                I'm not sure if I understand the meaning of f(deletions). Does it just mean, that all consecutive deletions are square rooted and the results summed up?
                                Like in the following example from my data:

                                read1
                                length: 202 (+11 Deletions)
                                cigar: 5=6X1=1X2=1X1=1I4=10D2=1D2=1X1=1X6=1I9=1I19=1I79=1I56=

                                Matches: 5+1+2+1+4+2+2+1+6+9+19+79+56=187
                                Mismatches: 6+1+1+1+1=10
                                Insertions: 1+1+1+1+1=5
                                Deletions: 10+1=11

                                Identity: 187/(187+10+5+11) = 87.8% (this is the "standard" calculation)
                                Identity: 187/(187+10+5+sqrt(10)+sqrt(1))= 90.7% (this is your calculation?)
                                If the former is true, why don't you treat insertions in the same way? Like in:

                                read2
                                length: 264 (+2 Deletions)
                                cigar: 1X2=2X1=2X4=3X3=20I8=1D19=1D157=1X28=1X3=1X1=3I3=1X

                                Matches: 2+1+4+3+8+19+157+28+3+1+3=229
                                Mismatches: 1+2+2+3+1+1+1+1=12
                                Insertions: 20+3=23
                                Deletions: 1+1=2

                                Identity: 229/(229+12+23+2*sqrt(1)) = 86.1% (This is your calculation?)
                                Identity: 229/(229+12+sqrt(20)+sqrt(3)+2*sqrt(1)) = 91.9% (Why not this way?)
                                ---------------------------------------------------------------------------------------

                                Thanks for finding that bug, and I'm glad you're finding BBMap useful! One suggestion, though - if you are doing RNA-seq on organisms with introns, I suggest you set the "intronlength" and "maxindel" flags. I normally use "intronlength=10". This does not change mapping at all, it just changes cigar strings so that deletions longer than 10bp will be reported using the symbol 'N' instead of 'D'. That may make a difference to some downstream processing tools, if they only interpret 'N' as intron. As for maxindel, introns shorter than maxindel will not be searched for. In plants or fungus, I suggest setting this to at least 16000 (that's the default for bbmap, but the pacbio modes have a much shorter default of 100); for vertebrates, I suggest 200000. I'm not really sure about invertebrates. But try to set it at around the 99th percentile of intron sizes in that organism, if you have some idea of what that might be. The lower the better (faster, more accurate) as long as it encompasses the vast majority of introns.
                                intronlength=10 is the standard value, isn't it? I didn't set it, but can neither find any cigar strings with deletions greater than 10D nor one with N's smaller than 11N.
                                And one last question out of interest as I worked in fungal genomics before I changed to humans: Why would you suggest a maxindel of 16000 for fungi? From what I analyzed (Comparative genomics of 200+ asco- and basidiomycetes) they are rarely longer than 100bp and I never saw one longer than 3kb (though I'm atm not sure if this was even mitochondrial).

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Techniques and Challenges in Conservation Genomics
                                  by seqadmin



                                  The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                  Avian Conservation
                                  Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                  03-08-2024, 10:41 AM
                                • seqadmin
                                  The Impact of AI in Genomic Medicine
                                  by seqadmin



                                  Artificial intelligence (AI) has evolved from a futuristic vision to a mainstream technology, highlighted by the introduction of tools like OpenAI's ChatGPT and Google's Gemini. In recent years, AI has become increasingly integrated into the field of genomics. This integration has enabled new scientific discoveries while simultaneously raising important ethical questions1. Interviews with two researchers at the center of this intersection provide insightful perspectives into...
                                  02-26-2024, 02:07 PM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, 03-14-2024, 06:13 AM
                                0 responses
                                34 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-08-2024, 08:03 AM
                                0 responses
                                72 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-07-2024, 08:13 AM
                                0 responses
                                82 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-06-2024, 09:51 AM
                                0 responses
                                68 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X