Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #91
    Yes, "only matching that sequence" is what I mean. So in seal if you use ambig=all then that read (hit) is added to all targets it matched (if there were multiple best matches)?

    Or say...what happens when a read can match one sequence with 20 kmers and a second with 18 kmers?
    /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
    Salk Institute for Biological Studies, La Jolla, CA, USA */

    Comment


    • #92
      Originally posted by sdriscoll View Post
      So in seal if you use ambig=all then that read (hit) is added to all targets it matched (if there were multiple best matches)?
      That's correct.
      Or say...what happens when a read can match one sequence with 20 kmers and a second with 18 kmers?
      Seal will only consider something ambiguous if it matches multiple sequences with the same number of kmers, which is also to the highest number. However, I could add a flag that changes it to allow any number of kmers.

      Comment


      • #93
        Hey Brian,

        bbduk and seal are certainly great programs and since I have downloaded these tools I've been using them pretty much daily. While most kmer tools ignore the idea of allowing a mismatch (since obviously this complicates hash lookups) I like that you have included this however using it imposes such a severe spike in memory usage that it becomes basically unusable. Case in point - using seal to produce hits to genes in a full transcriptome while allowing a mismatch per kmer. Is there a reason that you put the burden of additional kmers on the reference and not on the read? It seems to me that there would be a performance hit but I would much rather have that than the massive amount of memory needed to produce a kmer index of a full transcriptome allowing even one mismatch. So during matching the read could be kmer'd into all possible single mismatch variants and matched to the normal 0-mismatch kmer index. Or does that bring in additional issues that are not easily resolved?
        /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
        Salk Institute for Biological Studies, La Jolla, CA, USA */

        Comment


        • #94
          I really kind of like the "mm=t" option because it's similar to allowing a mismatch, but with no speed or memory penalty. It's also possible, though, to generally reduce memory consumption when allowing mismatches using the "speed" or "rskip" flag. "speed=8", for example, would hash only 50% of the kmer space, so it would halve memory consumption; "speed=12" would quarter memory consumption. Of course these reduce sensitivity, but "speed=12 hdist=1" would probably be more sensitive than "speed=0 hdist=0" if there are a lot of mismatches.

          There is no reason for me to not generate read kmers with mismatches in them, other than the speed penalty. But, that speed penalty is hefty - with 31mers, allowing 1 mismatch might make the program 93x slower (particularly if most reads do not match the reference; not so much if they mostly do).

          I will plan to add that capability in, though I'm not convinced that it will be viable on large datasets (raw reads). However, it WOULD be viable for matching a handful of sequences to a very large reference.

          Comment


          • #95
            Yeah that last point is totally fair. I'm also not totally sure that for most applications allowing a mismatch really tells me what I want to know anyways.

            On another subject, are you familiar with the program Sailfish which uses kmer matching and the EM algorithm for gene expression? It is pretty much seal with an EM step that probabilistically estimates abundance.
            /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
            Salk Institute for Biological Studies, La Jolla, CA, USA */

            Comment


            • #96
              I have heard of expectation maximization but not actually studied what it does. I should probably read about it to see if it seems useful.

              I was not really aware of Sailfish until today or yesterday, when the author posted on SeqAnswers that it now has a successor named Salmon. I should probably do a comparative benchmark sometime.
              Last edited by Brian Bushnell; 12-19-2014, 05:33 PM.

              Comment


              • #97
                Yeah they came up with some kind of so-called perfect hash so that every kmer lookup is O(1). The slowest part of their program is probably the EM part (it slows down with increasing size of reference). I think these days if a program intends to put out expression levels for a set of references with ambiguity of mapping between features the EM is the standard. Until something better is figured out...
                /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
                Salk Institute for Biological Studies, La Jolla, CA, USA */

                Comment


                • #98
                  Originally posted by Brian Bushnell View Post
                  I'd like to introduce another member of the BBTools package, BBDuk. "Duk" stands for Decontamination Using Kmers. BBDuk is extremely fast, scalable, and memory-efficient, while maintaining greater sensitivity and specificity than other tools. It can do lots of different things, for example -

                  Adapter trimming:
                  bbduk.sh -Xmx1g in=reads.fq out=clean.fq ref=adapters.fa ktrim=r k=28 mink=12 hdist=1

                  ...where "ktrim=r" is for right-trimming (3' adapters), "ktrim=l" is for left-trimming (5' adapters), and "ktrim=N" just masks the kmers with "N". "k" specifies the kmer size to use (must be at most the length of the adapters) while "mink" allows it to use shorter kmers at the ends of the read (for example, k=12 for the last 12 bases). "hdist" means "hamming distance"; this allows one mismatch. Instead of "ref=file" you can alternately (or additionally) say "literal=ACTGGT,TTTGGTG" for those two literal strings. The BBTools package currently includes Truseq and Nextera adapters sequences in /bbmap/resources/truseq.fa.gz and /bbmap/resources/nextera.fa.gz.
                  Hi Brian,

                  Got a bit confused with these commands, so I did RNA sequencing using Truseq Stranded mRNA Kit, 200-bp paired-end.

                  I understand that I'm supposed to trim the adapters off

                  *Read 1: I will have to trim (AGATCGGAAGAGCACACGTCTGAACTCCAGTCA)

                  *Read 2: I will have to trim (AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT)

                  How should the command be? Could you please advise?

                  Thank you. Merry Christmas!

                  Comment


                  • #99
                    Basically something like this:

                    Code:
                    $ replace_windows_OS_specific_bbduk_command_here -Xmx2g in1=/path/to/sample_R1.fastq in2=/path/to/sample_R2.fastq out1=/path/to/sample_R1_no_adapters.fastq out2=/path/to/sample_R2_no_adapters.fastq ref=/path/to/bbmap/resources/truseq.fa.gz ktrim=r [and other options for bbduk to follow]

                    Comment


                    • Originally posted by GenoMax View Post
                      Basically something like this:

                      Code:
                      $ replace_windows_OS_specific_bbduk_command_here -Xmx2g in1=/path/to/sample_R1.fastq in2=/path/to/sample_R2.fastq out1=/path/to/sample_R1_no_adapters.fastq out2=/path/to/sample_R2_no_adapters.fastq ref=/path/to/bbmap/resources/truseq.fa.gz ktrim=r [and other options for bbduk to follow]
                      Hi Genomax,

                      Sorry, but I don't understand what is "/path/to/", if my files are all in C: drive Windows, will it be

                      jave -ea -Xmx2g in1=C:\sequencing\ST131_1_R1.fastq in2=C:\sequencing\ST131_1_R2 out1=??? out2=??? ref=C:\bbmap\resources\truseq.fa.gz ktrim=r k=???? mink=???? hdist=????

                      ??? For out1 and out2 do I need to create some kind of new file?
                      ??? I don't know what to put for k, mink and hdist, should k be 33 because the adapter sequences are 33 bases? not sure about mink and hdist.

                      Thank you. Merry Christmas

                      Comment


                      • Originally posted by michaellim View Post
                        Hi Genomax,

                        Sorry, but I don't understand what is "/path/to/", if my files are all in C: drive Windows, will it be

                        Code:
                        $ java -ea -Xmx2g in1=C:\sequencing\ST131_1_R1.fastq in2=C:\sequencing\ST131_1_R2 out1=??? out2=??? ref=C:\bbmap\resources\truseq.fa.gz ktrim=r
                        ??? For out1 and out2 do I need to create some kind of new file?
                        ??? I don't know what to put for k, mink and hdist, should k be 33 because the adapter sequences are 33 bases? not sure about mink and hdist.

                        Thank you. Merry Christmas
                        Assuming your BBMap directory is at c:\ level the following command gives you a general idea of how to structure your own.

                        out1 and out2 represent two new files names that you provide that will contain the adapter trimmed (cleaned) reads from R1 and R2 read files.

                        Code:
                        $ java -ea -Xmx1g -cp C:\bbmap\current\ jgi.BBDukF in1=C:\sequencing\ST131_1_R1.fastq in2=C:\sequencing\ST131_1_R2.fastq out1=c:\sequencing\ST131_1_R1_cleaned.fastq out2=c:\sequencing\ST131_1_R2_cleaned.fastq ref=C:\bbmap\resources\truseq.fa.gz ktrim=r
                        If is possible that your reads are already adapter trimmed so no reads may get thrown out/trimmed after running BBDuk tool. Look at the stats produced at the end to see what happens.

                        Look at the contents of bbduk.sh file (open it in wordpad) to understand the meaning of optional parameters. You may or may not want/need to provide additional parameters.
                        Last edited by GenoMax; 12-25-2014, 07:15 AM.

                        Comment


                        • Originally posted by sdriscoll View Post
                          Hey Brian,

                          bbduk and seal are certainly great programs and since I have downloaded these tools I've been using them pretty much daily. While most kmer tools ignore the idea of allowing a mismatch (since obviously this complicates hash lookups) I like that you have included this however using it imposes such a severe spike in memory usage that it becomes basically unusable. Case in point - using seal to produce hits to genes in a full transcriptome while allowing a mismatch per kmer. Is there a reason that you put the burden of additional kmers on the reference and not on the read? It seems to me that there would be a performance hit but I would much rather have that than the massive amount of memory needed to produce a kmer index of a full transcriptome allowing even one mismatch. So during matching the read could be kmer'd into all possible single mismatch variants and matched to the normal 0-mismatch kmer index. Or does that bring in additional issues that are not easily resolved?
                          I just released a new version of BBTools (34.48) which adds this capability. You can use it with the "qhdist" flag (queryhammingdistance). It doesn't really slow down very much as long the vast majority of the reads are expected to have kmer matches; but if the majority of reads don't have kmer matches, there will be a ~3*K-fold slowdown. But even then the speed is still acceptable for most purposes.

                          Here's a sensitivity analysis. I generated 100k 100bp reads from E.coli, and gave them each a random number of SNPs from 0 to 20 (probably 15 on average). Then I tested the percentage that were correctly identified as E.coli by BBDuk:

                          bbduk.sh in=20snps.fq.gz ref=ecoli_K12.fa k=31 maxskip=0 mm=f
                          Time: 0.8 seconds
                          Filtered: 33.51%

                          bbduk.sh in=20snps.fq.gz ref=ecoli_K12.fa k=31 maxskip=0 mm=t
                          Time: 0.8 seconds
                          Filtered: 39.83%

                          bbduk.sh in=20snps.fq.gz ref=ecoli_K12.fa k=31 maxskip=0 mm=f hdist=1
                          Time: 30 seconds (29 of which were load time)
                          Filtered: 48.62%

                          bbduk.sh in=20snps.fq.gz ref=ecoli_K12.fa k=31 maxskip=0 mm=f qhdist=1
                          Time: 10 seconds
                          Filtered: 48.62%

                          bbduk.sh in=20snps.fq.gz ref=ecoli_K12.fa k=31 maxskip=0 mm=t qhdist=1
                          Time: 10 seconds
                          Filtered: 53.51%

                          bbduk.sh in=20snps.fq.gz ref=ecoli_K12.fa k=31 maxskip=0 mm=t qhdist=1 hdist=1
                          Time: 42 seconds (29 of which were load time)
                          Filtered: 63.06%

                          So "mm=t" is not really as good as hdist=1 if there is a really high error rate, but it's free.

                          P.S. Incidentally, BBMap mapped around 55% on defaults and 66% with high sensitivity (k=12 minratio=0.1).
                          Last edited by Brian Bushnell; 02-23-2015, 08:13 PM.

                          Comment


                          • @Brian: With paired end reads out1m= and out2m= don't seem to work (only outm=). Is that by design?
                            Last edited by GenoMax; 02-27-2015, 11:31 AM. Reason: Corrected

                            Comment


                            • Hi GenoMax,

                              I just tried it, and it seems to work fine for me. Are you getting an error message, or is one of the files simply not being produced? Also, what version are you running?

                              Comment


                              • Originally posted by Brian Bushnell View Post
                                Hi GenoMax,

                                I just tried it, and it seems to work fine for me. Are you getting an error message, or is one of the files simply not being produced? Also, what version are you running?
                                I am using v. 34.33.

                                Code:
                                Exception in thread "main" java.lang.RuntimeException: Unknown parameter out1m=xxxx_AACCAGCT_L001_R1_001_match.fastq.gz
                                	at jgi.BBDukF.<init>(BBDukF.java:400)
                                	at jgi.BBDukF.main(BBDukF.java:62)

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Strategies for Sequencing Challenging Samples
                                  by seqadmin


                                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                  03-22-2024, 06:39 AM
                                • seqadmin
                                  Techniques and Challenges in Conservation Genomics
                                  by seqadmin



                                  The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                  Avian Conservation
                                  Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                  03-08-2024, 10:41 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, Yesterday, 06:37 PM
                                0 responses
                                12 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, Yesterday, 06:07 PM
                                0 responses
                                10 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-22-2024, 10:03 AM
                                0 responses
                                51 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-21-2024, 07:32 AM
                                0 responses
                                68 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X