Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Hi Jake,

    BBDuk takes exponential time to process the reference when you increase the maximum number of substitutions. Specifically, allowing 3 substitutions with K=27 requires 82^3 or 551,368 times as long as normal for loading the reference, so may be the issue (it depends how big the reference is). During that time it will use roughly 7 threads. Processing 90,000 reads after the reference is loaded would take under 0.1 seconds.

    I don't know all of the flags exposed by Geneious, but from what you've posted, I recommend these settings:

    Adapters
    Trim: Right End Only
    Kmer Length: 23
    Max Substitutions: 1 (I only go to 2 if the reads are very low quality, but never 3 for adapter trimming of paired reads)
    Max Substitutions + INDELs: 0
    Trim partial adapaters with kmer length: Yes, 11

    Trim Low Quality - Yes
    Right End Only
    Minimum Quality: 12 (This varies; with very high coverage, this can be higher)

    Discard Short Reads - Yes
    Minimum Length: 75 bp (Again, this varies)

    Keep Original Order - Yes


    Additionally, you should add the flags "tbo" "tpe" to the custom flags area if they are not exposed in the GUI. So, please try that and let me know if the speed is still odd; with those settings it should take ~1 second, and result in less overtrimming.

    Comment


    • Originally posted by Brian Bushnell View Post
      Hi Jake,

      BBDuk takes exponential time to process the reference when you increase the maximum number of substitutions. Specifically, allowing 3 substitutions with K=27 requires 82^3 or 551,368 times as long as normal for loading the reference, so may be the issue (it depends how big the reference is). During that time it will use roughly 7 threads. Processing 90,000 reads after the reference is loaded would take under 0.1 seconds.

      I don't know all of the flags exposed by Geneious, but from what you've posted, I recommend these settings:

      Adapters
      Trim: Right End Only
      Kmer Length: 23
      Max Substitutions: 1 (I only go to 2 if the reads are very low quality, but never 3 for adapter trimming of paired reads)
      Max Substitutions + INDELs: 0
      Trim partial adapaters with kmer length: Yes, 11

      Trim Low Quality - Yes
      Right End Only
      Minimum Quality: 12 (This varies; with very high coverage, this can be higher)

      Discard Short Reads - Yes
      Minimum Length: 75 bp (Again, this varies)

      Keep Original Order - Yes


      Additionally, you should add the flags "tbo" "tpe" to the custom flags area if they are not exposed in the GUI. So, please try that and let me know if the speed is still odd; with those settings it should take ~1 second, and result in less overtrimming.
      Hi Brian,

      Thanks for the explanation. In general, I have pretty high coverage which is why I had the trimming parameters set so strict - a minimum of ~1000x coverage in 150 x 2 PE reads (I'm sequencing PCR amplicons). The only worry would be trimming non-adapter sequence from my reads by allowing too many substitutions.

      What's a good number for minimum overlap when using the "tbo" flag?

      I'll give this a try later today and see how it performs.

      Thanks again!
      Jake

      Comment


      • The default minoverlap for "tbo" is 14bp, and the default mininsert is 40bp, which are fine except for uRNA (which needs mininsert=17) or a handful of other rare uses.

        Comment


        • Thanks Brian.

          If I intend to merge reads using BBMerge, should I only Adapter Trim, Merge, and then Trim with the other parameters?

          I'm hoping to use "mix=t" with BBMerge, if downstream assemblers (Tadpole, Spades, CLC) are capable of handling the mix of merged and unmerged reads as input. I assume reads that don't merge aren't poor quality - is that true? If unmerged reads are poor quality then I'll leave them unmixed and throw the unmerged reads out...

          Jake

          Comment


          • Hi Jake,

            "mix" is primarily for error-correct via overlap (ecco) mode when you want both overlapping and non-overlapping pairs output in the same file. I don't recommend it for merging because then you get singletons and pairs in the same file which messes up pairing. This won't affect Tadpole's assembly, but it will affect other tools like Spades.

            If you assemble using a pipeline like this:

            Code:
            #Trim adapters
            bbduk.sh in=r1.fq in2=r2.fq out=trimmed.fq ktrim=r k=23 mink=11 hdist=1 ref=/bbmap/resources/adapters.fa tbo tpe maxns=0 trimq=10 qtrim=r maq=12
            
            #Remove small contaminants
            bbduk.sh in=trimmed.fq out=filtered.fq k=31 ref=/bbmap/resources/sequencing_artifacts.fa.gz,/bbmap/resources/phix174_ill.ref.fa.gz
            
            #Error-correct 1
            bbmerge.sh in=filtered.fq out=ecco.fq ecco mix vstrict adapters=default
            
            #Error-correct 2
            clumpify.sh in=ecco.fq out=eccc.fq ecc passes=6
            
            #Error-correct 3
            tadpole.sh in=eccc.fq out=ecct.fq ecc
            
            #Merge
            bbmerge.sh in=eccc.fq out=merged.fq outu=unmerged.fq rem k=62 extend2=50 adapters=default
            
            #Assemble with Spades
            spades.py -k 21,41,71,101,127 -o out -12 unmerged.fq -s merged.fq
            #Or Tadpole
            tadpole.sh in=merged.fq,unmerged.fq k=93 out=tadpole.fa
            
            #Evaluate
            stats.sh in=/out/scaffolds.fasta
            
            #Quantify coverage
            bbmap.sh ref=/out/scaffolds.fasta in=trimmed.fq out=mapped.sam covstats=covstats.txt
            "mix" was used for the first error-correction phase with BBMerge. When BBMerge was used the second time, for merging, "mix" was not used, and instead the output was split into merged.fq (singletons) and unmerged.fq (pairs). These can be fed in to Spades or Tadpole as shown. In this case, if you want to quality-trim to Q20, that should be done only unmerged.fq because quality-trimming the raw reads to a high level prior to merging will reduce the merge rate.

            Reads that don't merge might be low quality. But they might also have a large insert size so that they don't overlap, or come from a region that is low-complexity so it's not clear whether they overlap. Therefore, I recommend NOT throwing away the non-overlapping reads.
            Last edited by Brian Bushnell; 02-28-2017, 10:21 AM.

            Comment


            • Thanks Brian. It doesn't look like clumpify has been built into Geneious as of yet, so I'll have to skip that trimming step.

              Also, I was wondering why there were 71 sequences in the Nextera Adapters set? The only adapter sequence that I know of for Nextera is "CTGTCTCTTATACACATCT". Will BBDuk search for the adapter and trim everything upstream (5') of it? That would take care of the indexes as well.

              Jake

              Comment


              • Nextera adapters are much longer than what you've posted, which I have listed as the "Nextera Transposon". Longer sequences are more specific. The reason I have so many Nextera sequences is that all of Illumina's different indexes are included as well. And adapter trimming goes toward the 3' end, not the 5' end. But yes, once BBDuk identifies an adapter kmer, it trims that and everything downstream.

                Comment


                • Hi Brian,

                  I always mix the placement of the adapters up. I'm not sure what the location I'm referring to is called, but it is the sequence that is added by the transposase, and the location where the primers bind during library amplification. If we use that sequence, shouldn't we be able to trim that and everything 5' of it (that should remove the indexes and the adapter sequence)?

                  Illumina expanded their indexes. You can now run 384 samples per flow cell using Nextera XT Indexes with the original Nextera enzyme. I can share the sequences if you're interested.

                  Jake

                  Comment


                  • Just use the adapters.fa file included in "bbmap/resources" directory. That should cover all common adapters etc.

                    Comment


                    • Originally posted by JVGen View Post
                      If we use that sequence, shouldn't we be able to trim that and everything 5' of it (that should remove the indexes and the adapter sequence)?
                      To be clear, I like to use "left" and "right" rather than 3' and 5'. In a fastq file, reads have a left and right end, so...

                      A read's adapter is to the left of its left end. It does not get sequenced and thus does not need to be trimmed. When the insert size is shorter than read length, a read will run out of genomic sequence and go into the adapter on the opposite end of the molecule. This shows up as adapter sequence on its right side. So, adapter-trimming finds adapter sequence on the right end of a read and trims to the right.

                      Illumina expanded their indexes. You can now run 384 samples per flow cell using Nextera XT Indexes with the original Nextera enzyme. I can share the sequences if you're interested.
                      They're constantly expanding... sigh. Thanks for your offer, but I'll download their "customer service letter" and manually add the new sequences. If Illumina was truly interested in customer service they would provide all of their adapters in a fasta file, but they really are not - they're much more interested in protecting their IP (note that Illumina makes most of its money from selling reagents), so they make it as difficult as possible.

                      Comment


                      • Dear Brian,

                        I would like to try your bioinformatic tools but I have some troubles with the files which are generated.

                        I just want to filter my MiSeq sequences by reads which score at >30 wit BBduk.
                        so i do this command

                        /data/umb/cichocki/bbmap/bbduk.sh in=/data/umb/cichocki/project2/bbduckdu11avril/project2_R2.fastq in=/data/umb/cichocki/project2/bbduckdu11avril/project2_R2.fastq out1=clean11avril.fastq out2=clean211avril.fastq qtrim=rl trimq=30

                        I have a result of 6% of the sequences removed. Fine !

                        The next steps have to be done in MOTHUR... how can I proceed ?

                        my first command would be in MOTHUR

                        make.contigs(ffastq=clean11avril30.fastq, rfastq=clean211avril30.fastq, oligos=myko.txt)

                        The problems arrive few seconds later by a ...not nice comment !

                        M04654_34_000000000-AVVJL_1_1102_21707_1905 is in your forward fastq file and not in your reverse file, please remove it using the remove.seqs command before proceeding.
                        Making contigs...
                        ....(core dumped)
                        and stop here for sure...

                        I thought that your software did the cleaning process in both files R1 and R2 ?

                        am I right or not ?
                        there is a bug ? or did i do something wrong ?

                        thank you in advance for your answer.

                        Best regards,

                        Nicolas

                        Comment


                        • That is strange. Since you trimmed R1/R2 files together they should never go out of sync.

                          It is possible that the files you originally got from your sequence provider are out of sync (since they were likely trimmed on Sequencer or in BaseSpace). I suggest that you first run "repair.sh" on your original files and then do quality trim.

                          By the way, trimming at trimq=30 is very severe. You should not need to do that.

                          Comment


                          • Dear GenoMax,

                            Thank for your answer.

                            I do it and come back here asap.

                            Best regards,

                            Nicolas

                            P.S. Yes actually I try with 25, 27 and 30.

                            Comment


                            • results :
                              java -ea -Xmx13352m -cp /data/umb/cichocki/bbmap/current/ jgi.SplitPairsAndSingles rp in1=/data/umb/cichocki/project2/bbduckdu11avril/project2_R1.fastq in2=/data/umb/cichocki/project2/bbduckdu11avril/project2_R2.fastq out1=fixed1.fastq out2=fixed2.fastq outsingle=singletons.fastq
                              Executing jgi.SplitPairsAndSingles [rp, in1=/data/umb/cichocki/project2/bbduckdu11avril/project2_R1.fastq, in2=/data/umb/cichocki/project2/bbduckdu11avril/project2_R2.fastq, out1=fixed1.fastq, out2=fixed2.fastq, outsingle=singletons.fastq]

                              Set INTERLEAVED to false
                              Started output stream.

                              Input: 43045566 reads 12956715366 bases.
                              Result: 43045566 reads (100.00%) 12956715366 bases (100.00%)
                              Pairs: 43045566 reads (100.00%) 12956715366 bases (100.00%)
                              Singletons: 0 reads (0.00%) 0 bases (0.00%)

                              no errors if I understand it well !

                              I re-trimmed by 27... and now it's apparently working in MOTHUR... let's see if it works well !

                              Thank you anyway for this command which can be useful for me in the future or perhaps for someone else !

                              enjoy your working aftertoon on the east cost !

                              Nico

                              Comment


                              • Interesting. Well as long as it is working. You did not lose any reads in the process either :-)

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Techniques and Challenges in Conservation Genomics
                                  by seqadmin



                                  The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                  Avian Conservation
                                  Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                  03-08-2024, 10:41 AM
                                • seqadmin
                                  The Impact of AI in Genomic Medicine
                                  by seqadmin



                                  Artificial intelligence (AI) has evolved from a futuristic vision to a mainstream technology, highlighted by the introduction of tools like OpenAI's ChatGPT and Google's Gemini. In recent years, AI has become increasingly integrated into the field of genomics. This integration has enabled new scientific discoveries while simultaneously raising important ethical questions1. Interviews with two researchers at the center of this intersection provide insightful perspectives into...
                                  02-26-2024, 02:07 PM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, 03-14-2024, 06:13 AM
                                0 responses
                                32 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-08-2024, 08:03 AM
                                0 responses
                                71 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-07-2024, 08:13 AM
                                0 responses
                                80 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-06-2024, 09:51 AM
                                0 responses
                                68 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X