Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • No idea; those commands should be equivalent, unless the adapters.fa file is different. Can you post the full output of the command so I can see where the bases were lost? Also, you may want to add "stats=stats.txt" which will indicate which adapter sequences are being hit, and should be identical in both cases.

    Comment


    • Perhaps geneious is using an adapters.fa file that is different?

      Comment


      • re: different results from command line and geneious plugin

        Yes, the adapters files are different. Geneious says using All Truseq, Nextera, and PhiX adapters (152) sequences. The count in the resources folder of bbmap is 154 sequences. I don't know which 2 are lacking from geneious, presuming the shared set is 152. I could not find the adapters file geneious is calling... or the stats.txt file I asked it to produce. May need to call in Geneious support to help with this. Among the top stats.txt hits in the linux output are pcr dimer (16% of reads) and pcr primers (15% of reads). I wonder if these are lacking in geneious... Geneious is trimming more sequences by overlap 206,591 vs 157,344 and fewer sequences by ktrim 2,630,578 vs 2,657,654. Now... which one is best? The one trimming more?

        Comment


        • Probably the one trimming more is better. Geneious is probably trimming more by overlap because overlap trimming happens after adapter-sequence trimming, so if some adapter sequences are missing, those will (usually) get overlap-trimmed rather than sequence-trimmed. But it's hard to say. You could align the first 1m reads to the reference (if you have one) and look at the mapping rates and error rates to see which dataset is better.

          Comment


          • Hi Brian,

            I'm trying to do some trimming of primer sequences from amplicons that I sequenced.

            I have a viral genome that was amplified in 5 large-ish overlapping pieces (~2-3kb) and then fragmented and sequenced on a nextseq (1x150 SE).

            I would like to trim only the primer sequences on the end of the read. The problem is I have primers of all different lengths, how would you go about encompassing all of the primers in one round of bbduk trimming? Or would you have to do them individually for each k value?

            I was thinking something like this
            Code:
            bbduk.sh in=$f1 out=${file_base}_ff.fq.gz literal=AGTTGTTAGTCTACGTGGA,AGCTGGAAAAACAAAGAACT,GCCAAGAAACAGGATGTCGTTG,GCCGTAATTGGTATCGATACTGGA,GGACATGGGCAGATTGAC,ACTGTGAGGATGGCTATGA,GCAGACAGAAAGTGGTGTTTT,ACAAAACGTGGGCTTACCATG,AGGCCTAACAAAAGGAGGAC,TTTCCCAGCGTCAATATGCT ktrim=l k=24 mink=15 hdist=2 qtrim=rl trimq=30 restrictleft
            Where the mink=15 encompasses all of the primer shorter than the longest (k=24). Does this seem OK?

            Thanks in advance.
            Last edited by GenoMax; 06-06-2017, 07:46 AM. Reason: Added [CODE] tags

            Comment


            • Since you have 5 different overlapping sections, if you trim the primers all at once with no precautions, you will break your assembly into 5 segments since anything read overlapping a primer region will have the primer region trimmed. You can mostly eliminate this effect, though, using the "restrictleft" flag as in your command, but you need to specify a position - in this case, "restrictleft=24" or maybe slightly more if there is some uncertainty about where, exactly, sequencing starts on the amplification primer (restrictleft=26 should be fine).

              With that change, I think your command will be fine. And certainly the most convenient. So, try it, and see if the virus assembles in one contig.

              If it doesn't, you can refine it more. If you have the data split into 5 libraries with different bar codes, I'd recommend running each one independently with just its own sequences and kmer lengths. But if not, you can still get a slight increase in precision by running each primer individually with its own specific kmer length (in which case you should not do quality-trimming until after all primer-trimming is complete). You can also increase precision by turning off "rcomp" when running primers individual (since left-end primer sequences will always be the original sequence rather than reverse-complemented). You might need to play around with this a bit depending on how the sequences are reported (e.g. maybe half of the primers are already reported as reverse-complements). But those steps usually are not necessary with super-short genomes as in this case, particularly with primers that are designed to be unique in the genome; I'm just listing it as an example of what could be done if needed.

              Comment


              • Brian,

                Thanks for the quick response.

                It seems like splitting it up does give me a little better precision (as judged by a higher % of bases lost), so I'll stick with that.

                On a similar note, I'm wondering what your thoughts are on quality trimming. If I trim to q30 (qtrim=rl trimq=30) I lose about half my bases. Do you think this level of stringency is required for variant calling? What would you normally do?

                Comment


                • I do not recommend extremely stringent quality-trimming prior to variant-calling (or most anything, for that matter). In general, it reduces mapping accuracy and increases various forms of bias; though the magnitude is different for a tiny nonrepetitive virus than human. Also, shorter reads reduce variant quality-score estimates and the ability to detect indels.

                  BBTools' callvariants.sh can do quality-trimming on mapped reads just prior to variant-calling, so the quality-trimming does not reduce mapping accuracy. The default value is 10. Additionally, the default direction is "qtrim=r" because Illumina quality scores are inaccurate on the left side (artificially low) due to the base-caller; with correctly-calibrated quality scores, though, I would use "qtrim=rl". You can, actually, recalibrate the quality scores if you want using calctruequality.sh:

                  Starting with the raw reads, for variant-calling on single-ended amplicon data from a NextSeq, I'd generally do this:

                  Code:
                  bbduk.sh in=raw.fq out=trimmed1.fq ref=adapters.fa ktrim=r k=23 mink=11 hdist=1 minlen=150
                  
                  bbmap.sh in=trimmed1.fq out=mapped_raw.sam ref=virus.fa ambig=toss slow
                  
                  calctruequality.sh in=mapped_raw.sam ref=virus.fa callvariants
                  
                  filterbytile.sh in=raw.fq out=filtered.fq
                  
                  clumpify.sh in=filtered.fq out=clumped.fq dedupe optical spany adjacent
                  
                  bbduk.sh in=clumped.fq out=recal.fq recalibrate
                  
                  bbduk.sh in=recal.fq out=recal_trimmed.fq ref=adapters.fa ktrim=r k=23 mink=11 hdist=1 minlen=30
                  
                  (primer-trim)
                  
                  bbduk.sh in=primer_trimmed.fq out=qtrimmed.fq qtrim=rl trimq=8 minlen=30
                  
                  bbmap.sh in=qtrimmed.fq out=mapped_final.sam ref=virus.fa
                  
                  callvariants.sh in=mapped_final.sam ref=virus.fa out=vars.vcf ploidy=1
                  (possible additional variant-calling flags like "rarity=0.05" depending on whether you are interested in minority alleles of a multi-strain viral sample)
                  Note that quality-score recalibration and filter-by-tile are only effective if you have sufficient reads (millions are recommended) to prevent over-fitting. If you only have 100x data for a 15kbp virus I don't recommend them (the first 4 steps), but without recalibration, I don't recommend quality-trimming on the left end.

                  If the virus is sufficiently divergent from the reference you can alternatively assemble the raw reads and use that assembly as "ref" for recalibration (step 2 and 3).
                  Last edited by Brian Bushnell; 06-06-2017, 02:46 PM. Reason: Made more complete

                  Comment


                  • Very interesting. I like the idea of quality score recalibration. If I don't have millions of reads (only between 1000-20000x avg. coverage) can I still do all of the first steps without the filterbytile step? Will qScore recal still help me here?

                    Also,

                    "If the virus is sufficiently divergent from the reference you can alternatively assemble the raw reads and use that assembly as "ref" for recalibration (step 2 and 3)."

                    Do you mean use something like tadpole so de novo assemble the divergent virus and then generate a fasta file to use as a new reference?

                    I haven't tried clumpify's consensus feature but it would be cool if it would generate a new fasta file based on a reference as the consensus. Or is this possible already?

                    One other question related to variant calling. I'm using callvariants.sh and I'm noticing some variants that are clearly not real due to problems at the ends of the reads.

                    If I want to call variants from only the middle 60% of the read (1x150bp) is this the correct command?

                    callvariants.sh in=in.bam out=out.vcf ref=ref.fa minedist=30


                    Thanks again for your help.

                    Comment


                    • Originally posted by jweger1988 View Post
                      Very interesting. I like the idea of quality score recalibration. If I don't have millions of reads (only between 1000-20000x avg. coverage) can I still do all of the first steps without the filterbytile step? Will qScore recal still help me here?
                      20000x * 15000bp / 150bp = 2 million reads, which is plenty. 1000x * 15000bp /150bp = 100 thousand reads, which is insufficient. So, it depends. However, you can use the entire lane for filterbytile or recalibration (which will give optimal results). Filterbytile, specifically, does not need any mapping so it does not matter what the reference is; you can just dump a whole lane or flowcell into it and you will get optimal filtering. To use it effectively on a small library, you can run it in 2 stages, one where you generate the quality map using all reads, then a second for filtering just your library of interest:

                      Code:
                      filterbytile.sh in=whole_flowcell.fq dump=map.flowcell
                      filterbytile.sh in=library.fq indump=map.flowcell out=filtered.fq
                      You can do something similar with recalibration, but that is trickier, unless everything in the lane has the same reference. With >1 million reads recalibration should be useful. With less than that, it shouldn't hurt anything but will be less effective. The maximal amount a quality score is changed is modified by the total number of observations of the context- if you only have 3 examples of a context, a base's quality won't be changed. If you have thousands, it will. Contexts are fairly specific (e.g. quality score 26 for position 17 in a read where the base call is 'A') so you need many mapped reads to get just one observation of a given context, and many observations of a context to get a useful phred score. However, since NextSeq has binned q-scores, there are fewer possible contexts, so maybe even 100k reads would be enough.

                      "If the virus is sufficiently divergent from the reference you can alternatively assemble the raw reads and use that assembly as "ref" for recalibration (step 2 and 3)."

                      Do you mean use something like tadpole so de novo assemble the divergent virus and then generate a fasta file to use as a new reference?
                      Yes, that's what I mean. But that fasta would only be used for recalibration. To recalibrate you need to align to a fasta and count matches/mismatches. The closer that fasta is to the sample, the more accurate the process is, because differences between the sample and its reference will no longer appear as mismatches. So assembling the reads is often the best approach rather than using an existing reference, particularly for a small organism that's easy to assemble, or when there is a large difference between the sample and ref. For the final variant-calling, you still need to use the reference rather than an assembly.

                      I haven't tried Clumpify's consensus feature but it would be cool if it would generate a new fasta file based on a reference as the consensus. Or is this possible already?
                      Unfortunately... it only produces consensuses per clump, which is not what you want. Tadpole would be the best option.

                      One other question related to variant calling. I'm using callvariants.sh and I'm noticing some variants that are clearly not real due to problems at the ends of the reads.

                      If I want to call variants from only the middle 60% of the read (1x150bp) is this the correct command?

                      callvariants.sh in=in.bam out=out.vcf ref=ref.fa minedist=30

                      Thanks again for your help.
                      Yes, "minedist=30" should do what you want (ignore variants that are mainly near the ends of reads), but might be too aggressive. For 150bp reads the average distance of a variant from the read end should be 150/4 = 37.5, given random distribution, so maybe minedist=20 would be safer. You can alternatively yes "border=30" which will ignore the outermost 30bp of the reads and thus never consider anything in that area - that allows the whole read to be used for alignment, but only the best part in the middle to be used for variant-calling.

                      If you use the flags "out=vars.txt vcf=vars.vcf", CallVariants will also produce a tab-delimited text file that you can paste into Excel and manually look at the "edist" column to see whether it looks useful to exclude false positives. You can theoretically do that with VCF also but the tab-delimited one is much more convenient!

                      Comment


                      • Originally posted by Brian Bushnell View Post
                        Yes, "minedist=30" should do what you want (ignore variants that are mainly near the ends of reads), but might be too aggressive. For 150bp reads the average distance of a variant from the read end should be 150/4 = 37.5, given random distribution, so maybe minedist=20 would be safer. You can alternatively yes "border=30" which will ignore the outermost 30bp of the reads and thus never consider anything in that area - that allows the whole read to be used for alignment, but only the best part in the middle to be used for variant-calling.

                        If you use the flags "out=vars.txt vcf=vars.vcf", CallVariants will also produce a tab-delimited text file
                        Love these tidbits that you mention in the answers. I was going to look at a vcf2tab program for VCF's generated by callvariants.sh. Will try this out tomorrow.

                        Comment


                        • Hi Brian,

                          I am trying to QC sequencing library downloaded from SRA, so I do not know sequencing machine was used. It seems that BBDuk is struggling with recognition of paired reads in it and stop working without exit. So, below are detail:

                          These are FASTA example:
                          @ERR011089.1.1 I58_1_FC30DYKAAXX:8:1:3:1833 length=75
                          CTTCGGAAGACATTTCTTCAGAAAGTTTTTGGAGAAGATTTTTTAACGAAACCTATGGAGCGGGCTCAGCAGGAT
                          +
                          IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII(IIIIIIIIIIIIIIIIIDIIIII.I
                          @ERR011089.1.2 I58_1_FC30DYKAAXX:8:1:3:1833 length=75
                          CTGTCAAATCCNNTTTATCCATATTCTCCATAATGTAAAGCTTCACTTCATTCAAAAATGCCAGATTGGCATCCT
                          +
                          IIIIIIIIIII!!IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
                          @ERR011089.2.1 I58_1_FC30DYKAAXX:8:1:3:377 length=75
                          CGGCCAACCTCGTCGTCACGGACTACGCCTTCTCGCCCGACGAACGGCAGATACTCATCGCCTCGGGCGCCAGGC
                          +
                          IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII'III
                          @ERR011089.2.2 I58_1_FC30DYKAAXX:8:1:3:377 length=75
                          GGCGTCGCGCGNNGCTTCGGCCTCGCGGAGTATGGGCATCAGACTGTTGCCCGAGGCCAGGAAATAATCGGTCGC
                          +
                          IIIIIIIIIII!!IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII)IIIIIIIIIIII&II%


                          And this is BBDuk message:

                          BBDuk version 36.64
                          Set INTERLEAVED to true
                          Reset INTERLEAVED to false because paired input files were specified.
                          Initial:
                          Memory: max=102900m, free=99142m, used=3758m

                          Added 134628693 kmers; time: 37.045 seconds.
                          Memory: max=102900m, free=83248m, used=19652m

                          Input is being processed as paired
                          Started output streams: 0.239 seconds.
                          Exception in thread "Thread-16" java.lang.AssertionError:
                          There appear to be different numbers of reads in the paired input files.
                          The pairing may have been corrupted by an upstream process. It may be fixable by running repair.sh.
                          at stream.ConcurrentGenericReadInputStream.pair(ConcurrentGenericReadInputStream.java:480)
                          at stream.ConcurrentGenericReadInputStream.readLists(ConcurrentGenericReadInputStream.java:345)
                          at stream.ConcurrentGenericReadInputStream.run(ConcurrentGenericReadInputStream.java:189)
                          at java.lang.Thread.run(Thread.java:748)

                          I also tried as suggested repair.sh, it cannot fix the problem with reads as well.

                          So I would be grateful for help with this situation. At least may it is possible to change code somehow to push program exit in case of such crash.

                          Ksenia

                          Comment


                          • Hi Ksenia,

                            This is always a tricky problem when people rename the reads. However, it looks like in your case the file is interleaved (that's my guess from the read headers), and thus you should only be using a single input to bbduk and add the "interleaved" flag, but it seems you are giving it two input files. Can you post the bbduk command, and explain exactly what came out of the SRA extraction process? Also, posting the SRA extraction command you use might help (it won't help me, but it might mean something to somebody).

                            Comment


                            • Thank you Brian for swift reply!

                              Yes, you are right, I used reads files separated, not interleaved. Here is my SRA download command:

                              /sratoolkit.2.8.2-1-ubuntu64/bin/fastq-dump --gzip --skip-technical --readids --dumpbase --split-files --clip "$NAME"

                              And my BBDuk command:

                              bbduk.sh -Xmx100g in="$NAME"_1.fastq.gz in2="$NAME"_2.fastq.gz ref=/home/ksenia/ksenia2/Programms/bbmap/resources/adapters.fa,/home/ksenia/ksenia2/Programms/bbmap/resources/phix174_ill.ref.fa.gz out="$ID"_filtered.fastq interleaved=t fastawrap=10000 ktrim=r qtrim=rl minlength=30 maq=10 maxns=0 overwrite=t entropy=0.5 outs="$ID"_singletons.fastq tpe=t

                              I am quite sure in both these commands as for several hundred libraries they worked perfectly, but for several similar libraries, example of fastq I showed BBDuk failed. But it doesn't crash and exit, BBDuk just freeze and hang the whole pipeline.

                              These are examples of true input files:

                              reads1:

                              @ERR011090.1.1 I328_1_FC30MD2AAXX:3:1:3:296 length=75
                              TGCCTATCTCATGTGGATTTATGCCATTTTTCTTCATAAGTCTATCAACCAACAATCTGCCCCATNCATCNGNCA
                              +ERR011090.1.1 I328_1_FC30MD2AAXX:3:1:3:296 length=75
                              IIIIIIIIIIIIIIIIIIIIIIIIIHIIIIHIIII@II<I7I>GDIC=@7<A0/60C<,A2;93%"E/+2"+"65
                              @ERR011090.2.1 I328_1_FC30MD2AAXX:3:1:3:894 length=75
                              TAGGCTATAAATTGTTTTTAGCGGTTGAAACAGGAATTTTCATTACGGCAGGGACACTCCCCTTTNTTTTNTNTT
                              +ERR011090.2.1 I328_1_FC30MD2AAXX:3:1:3:894 length=75
                              IIIIIIIIIIIIIIIIIIIIIIII8IH0IG/>A3F@HFIG70?A/.0,=111$,#))$+0+#$#""()$-"+"-$


                              reads2:

                              @ERR011090.1.2 I328_1_FC30MD2AAXX:3:1:3:296 length=75
                              ACCAAAGATAGATCCGTTTGAGGGANTTTTTGGNGTTTTTGCAGATAGTCTGCCTGATGGATGGGGCAGATTGTT
                              +ERR011090.1.2 I328_1_FC30MD2AAXX:3:1:3:296 length=75
                              IIIIIIIIIIIIIIIIIII@IIIF4"HIIIIII"I96IAII18F0B65+/5C5298-:,2,.0<57$)-%2--/.
                              @ERR011090.2.2 I328_1_FC30MD2AAXX:3:1:3:894 length=75
                              TTCAGAACGCTGTAAAAGACACACANCAAAATGNAAAAAACATAGGTCTTACTGAAAAAGACAACAAGAACAATC
                              +ERR011090.2.2 I328_1_FC30MD2AAXX:3:1:3:894 length=75
                              IIIIIIIIIIIIDIIIIIIIIBIII"IIIIIFI"IIH?II7E>?=D0;E;IBII9A;8<69:G25I5889-3812

                              When I tried to launch repair.sh, it can't recognise paired reads and output everything as singletons. Maybe this info also can help. BBDuk, actually, starts to produce some output as well and sort some sequences on interleaved file and singletons but at some point stops working. And I have several libraries with such problem. Any idea how this can be solved?

                              Comment


                              • Can you dump these reads with "-F" option for fastq-dump to see if you are able to restore the original Illumina fastq headers. repair.sh should (if needed) work after that.

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Techniques and Challenges in Conservation Genomics
                                  by seqadmin



                                  The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                  Avian Conservation
                                  Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                  03-08-2024, 10:41 AM
                                • seqadmin
                                  The Impact of AI in Genomic Medicine
                                  by seqadmin



                                  Artificial intelligence (AI) has evolved from a futuristic vision to a mainstream technology, highlighted by the introduction of tools like OpenAI's ChatGPT and Google's Gemini. In recent years, AI has become increasingly integrated into the field of genomics. This integration has enabled new scientific discoveries while simultaneously raising important ethical questions1. Interviews with two researchers at the center of this intersection provide insightful perspectives into...
                                  02-26-2024, 02:07 PM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, 03-14-2024, 06:13 AM
                                0 responses
                                34 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-08-2024, 08:03 AM
                                0 responses
                                72 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-07-2024, 08:13 AM
                                0 responses
                                82 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-06-2024, 09:51 AM
                                0 responses
                                68 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X