Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Originally posted by willd View Post
    Dear Brian,

    I have many reads from a very specific scRNA contaminant of length 299 bp. Would BBDuk's contaminant filtering be a suitable tool for removing these, and what parameters would you recommend? (my reads are 100 bp PE)

    Would it be possible to do it in one run, e.g. by adding this 299bp sequence to the current adapters.fa file that you have provided in the resources folder?

    Many thanks for any help
    While you can probably use BBDuk, you may want to look at BBSplit which should be able to bin reads that map to your sequence away from the rest of the data.

    Comment


    • Originally posted by willd View Post
      Dear Brian,

      I have many reads from a very specific scRNA contaminant of length 299 bp. Would BBDuk's contaminant filtering be a suitable tool for removing these, and what parameters would you recommend? (my reads are 100 bp PE)

      Would it be possible to do it in one run along with adapter and quality trimming, e.g. by adding this 299bp sequence to the current adapters.fa file that you have provided in the resources folder?

      Many thanks for any help
      You can do this with either mapping (BBMap/BBSplit) or kmer-filtering (BBDuk/Seal); the result should be the same. BBDuk will be faster though.

      Assuming the reads are named "read1.fq" and "read2.fq", and your 299bp sequence is in sequence.fa:

      bbduk.sh in=read#.fq out=filtered#.fq ref=sequence.fa k=31

      You can adjust the sensitivity/specificity with other flags (like adding hdist=1, to catch the very-low-quality reads) but the command above should be adequate. It's technically possible to do this at the same time as adapter-trimming by adding the sequence to the adapters file; reads of that sequence should end up fully trimmed and thus will be discarded because they are too short (shorter than minlen which is 10 by default, I think). But you would achieve higher sensitivity and specificity by filtering in a different pass, so I'd recommend doing it in 2 steps.

      Comment


      • Dear Brian - thanks very much!

        In case it's of interest: I tried both 1 pass (contaminant filtering at same time as adapter-trimming) and 2 pass (adapter trimming then contaminant filtering) on two of my samples, and the output was close to identical: the number of remaining reads was ~0.4% lower for 1 pass v 2 pass (for both samples); differences in plots of per-base sequence quality, GC curves, etc. were almost unnoticeable.
        Last edited by willd; 01-13-2016, 05:13 AM. Reason: adding Pass 1 lower than pass 2

        Comment


        • Dear Brian, for one fastq file I got the error output below. FastQC did not show any particular problems with this file. The error refers to a specific read - could there be a problem for BBDuk to parse this read?

          Code:
              Executing jgi.BBDukF [in1=../../C467_RE_B00GGOO_8_1_C68T3ACXX.IND1.fastq.gz, in2=../../C467_RE_B00GGOO_8_2_C68T3ACXX.IND1.fastq.gz, out1=B00GGOO_1_TrimPass1.fastq, out2=B00GGOO_2_TrimPass1.fastq, minlen=25, qtrim=r, trimq=10, ktrim=r, k=25, mink=11, ref=my_adapters.fa, hdist=1, tpe, tbo]
              
              BBDuk version 35.82
              maskMiddle was disabled because useShortKmers=true
              Initial:
              Memory: max=777m, free=754m, used=23m
              
              Added 64944 kmers; time: 	0.212 seconds.
              Memory: max=777m, free=717m, used=60m
              
              Input is being processed as paired
              Started output streams:	0.176 seconds.
              java.lang.AssertionError: 
              Error in ../../C467_RE_B00GGOO_8_1_C68T3ACXX.IND1.fastq.gz, line 1210999, with these 4 lines:
              CTCGGTCACAACCTGAGGGCTCAGCCCCAGGGCACGGCCATAGCCGGCGCGAGGCCCAAGGCTGCGGTGCTGCGGCTACTGGCGGCGACGAGCCGATCGGA
              +
              CCCFFFFFHGHFHIJIJJJJJJIICHJIIGIICHIEGGBHHIHIJJJHFDBD:@BDDDDDDD>ABDD>@@CACDD9>@CDDDBBD################
              @HISEQ10:C68T3ACXX:8:1101:19859:76022/1
              
              	at stream.FASTQ.quadToRead(FASTQ.java:722)
              	at stream.FASTQ.toReadList(FASTQ.java:653)
              	at stream.FastqReadInputStream.fillBuffer(FastqReadInputStream.java:111)
              	at stream.FastqReadInputStream.nextList(FastqReadInputStream.java:96)
              	at stream.ConcurrentGenericReadInputStream$ReadThread.readLists(ConcurrentGenericReadInputStream.java:656)
              	at stream.ConcurrentGenericReadInputStream$ReadThread.run(ConcurrentGenericReadInputStream.java:635)
              Exception in thread "Thread-19" java.lang.NullPointerException
              	at jgi.BBDukF$ProcessThread.run(BBDukF.java:2399)
              Exception in thread "Thread-20" java.lang.NullPointerException
              	at jgi.BBDukF$ProcessThread.run(BBDukF.java:2399)
              Exception in thread "Thread-18" java.lang.NullPointerException
              	at jgi.BBDukF$ProcessThread.run(BBDukF.java:2399)
              Exception in thread "Thread-23" java.lang.NullPointerException
              	at jgi.BBDukF$ProcessThread.run(BBDukF.java:2399)
              Exception in thread "Thread-21" Exception in thread "Thread-24" java.lang.NullPointerException
              	at jgi.BBDukF$ProcessThread.run(BBDukF.java:2399)
              java.lang.NullPointerException
              	at jgi.BBDukF$ProcessThread.run(BBDukF.java:2399)
              Exception in thread "Thread-22" java.lang.NullPointerException
              	at jgi.BBDukF$ProcessThread.run(BBDukF.java:2399)
              Processing time:   		3.675 seconds.
              
              Input:                  	603200 reads 		60923200 bases.
              QTrimmed:               	67817 reads (11.24%) 	3329362 bases (5.46%)
              KTrimmed:               	52402 reads (8.69%) 	1521198 bases (2.50%)
              Trimmed by overlap:     	36830 reads (6.11%) 	190222 bases (0.31%)
              Result:                 	575826 reads (95.46%) 	55882418 bases (91.73%)
              
              Time:   			4.073 seconds.
              Reads Processed:        603k 	148.08k reads/sec
              Bases Processed:      60923k 	14.96m bases/sec
              Exception in thread "main" java.lang.RuntimeException: jgi.BBDukF terminated in an error state; the output may be corrupt.
              	at jgi.BBDukF.process(BBDukF.java:822)
              	at jgi.BBDukF.main(BBDukF.java:70)

          Comment


          • Originally posted by willd View Post
            Dear Brian, for one fastq file I got the error output below. FastQC did not show any particular problems with this file. The error refers to a specific read - could there be a problem for BBDuk to parse this read?
            It looks like the fastq file is corrupt and is missing a line, or something similar.

            Code:
            CTCGGTCACAACCTGAGGGCTCAGCCCCAGGGCACGGCCATAGCCGGCGCGAGGCCCAAGGCTGCGGTGCTGCGGCTACTGGCGGCGACGAGCCGATCGGA
            +
            CCCFFFFFHGHFHIJIJJJJJJIICHJIIGIICHIEGGBHHIHIJJJHFDBD:@BDDDDDDD>ABDD>@@CACDD9>@CDDDBBD################
            @HISEQ10:C68T3ACXX:8:1101:19859:76022/1#ID Median_fold
            should be more like

            Code:
            @HISEQ10:C68T3ACXX:8:1101:19859:76022/1#ID Median_fold
            CTCGGTCACAACCTGAGGGCTCAGCCCCAGGGCACGGCCATAGCCGGCGCGAGGCCCAAGGCTGCGGTGCTGCGGCTACTGGCGGCGACGAGCCGATCGGA
            +
            CCCFFFFFHGHFHIJIJJJJJJIICHJIIGIICHIEGGBHHIHIJJJHFDBD:@BDDDDDDD>ABDD>@@CACDD9>@CDDDBBD################
            Try looking at the file in that region, by piping zcat to grep around the term "@HISEQ10:C68T3ACXX:8:1101:19859:76022/1" with plus and minus 10 lines, then posting the output.

            Actually, "#ID Median_fold" should not be in there at all. It looks like you somehow got unrelated text mixed into your fastq file, which is what is corrupting it.

            Comment


            • Dear Brian, indeed there seems to be a "@HISEQ....." line missing in this region of the file, and I found that this has happened in at least one other part of the file as well. I guess I will check with the platform who generated this if they have another version of the file, though the md5sum checks out so I guess this sample might be lost if we can't trust it.

              Anyway... NO PROBLEM WITH YOUR EXCELLENT BBDuk! (sorry to have suggested that, and thanks again for your help).

              Code:
              @HISEQ10:C68T3ACXX:8:1101:20187:76019/1
              ATTTATTTATTTTATTATTTTTTTGAAACATTTTTACATGTTTATTTATCAAAAGTGAAAAAGCATACAATTCAACACAATTTCAATAGCATTTGACACCA
              +
              CCCFFFFFHHHHHJJJJJJJJJJJJJJJJJJJJJJJJJJJIJJJJJJJJJJJJJJFHIJJJJJIGGIJJHHHHGHEDFFDEEEEEEEEDCDEDDEEDDDBB
              @HISEQ10:C68T3ACXX:8:1101:20097:76043/1
              GGTTAGTATAGCTTAGTTAAACTTTCGTTTATTGCTAAAGGTTAATCACTGCTGTTTCCCGTGGGGGTGTGGCTAGGCTAAGCGTTTTGAGCTGCATTGCT
              +
              @@?DDFADFHHHHJIJHGIJJIJJIJIFHIJGJIAC,=?B01:19795:76016/1
              CTCGGTCACAACCTGAGGGCTCAGCCCCAGGGCACGGCCATAGCCGGCGCGAGGCCCAAGGCTGCGGTGCTGCGGCTACTGGCGGCGACGAGCCGATCGGA
              +
              CCCFFFFFHGHFHIJIJJJJJJIICHJIIGIICHIEGGBHHIHIJJJHFDBD:@BDDDDDDD>ABDD>@@CACDD9>@CDDDBBD################
              @HISEQ10:C68T3ACXX:8:1101:19859:76022/1
              GTACATTTAACCCAGTTTAGTGGCAAGTTCTTTAGCCTTTGCCTTTTCGAGCTTGGCGATACGAGCCACAGACTTAGGACCCAGGACGTTGCCACCCCAGT
              +
              C@BFFFFFHHHHHJJHIJJIJIJIJJJHIIIIJIJIJJJJJJIIIJJJJIJIIJJJJJJIJIJHHHFFFEEEEEDCCCCDDDDDDDDDDDDDDDDDDDDD>
              @HISEQ10:C68T3ACXX:8:1101:19768:76027/1
              ACACGCTGGACTCCCCTGGCCATTTCTGGGTTATATAAATGTTTGATGAAAAATCTCTCAGGGGAAATCAAATTTTTGTAAGAAGTTAATGACAGCCCATC
              +
              @@CFFFFFHHHHHJJJJJJDIJJJJJJIIJDHIJIIFIIJJGHIJIJJGIFIBHJJJJJHIIJJGHFHHFFFFFF;CDEEEDDDDCCDEEDDDDDDDDDDC
              @HISEQ10:C68T3ACXX:8:1101:19877:76028/1
              AGCCATCCAACCACTCAGTCTTGGCAGTGCAGATGAAAAACTGGGAACCATTTGTGTTGGATCGGAAGAGCACACGACTGAACTCAAGTCACATCACGATC
              +
              ;;?DDDDDFFFAFD3C@:<+C<CBEEG@9CFFCAFGI9DD)?F<2)9?1D3??4).8)==C@)'5C1=263?;@@<',;8:@B(5(5>>:,>AABBB####
              (PS: the "#ID Median_fold" that you mentioned was not in my output)

              Comment


              • Filtering different contaminants with different k-mer sizes

                Dear Brian,

                I use BBDuk for filtering reads based on quality, adapter content, and chloroplast content. It is a great tool!

                For some applications, I also want to filter out reads that contain a restriction enzyme motif (a sequence of length 4-8).

                Currently, this leads to up to four BBduk jobs that are connected by a pipe:

                Code:
                bbduk.sh  in=in_R1.fastq in2=in_R2.fastq removeifeitherbad=t out=stdout.fq ref=restriction_motifs.fa k=4 rcomp=t maskmiddle=f qout=33  | bbduk.sh  in=stdin.fq int=t removeifeitherbad=t out=stdout.fq minlength=30 qout=33 ref=adapters.fa hammingdistance=2 k=20 mink=10 hammingdistance2=1 ktrim=n maxns=0 maxbadkmers=0 rcomp=t | bbduk.sh  in=stdin.fq int=t removeifeitherbad=t out=stdout.fq ref=chloroplast.fa k=31 rcomp=t qout=33  | bbduk.sh in=stdin.fq int=t out1=out_R1.fastq removeifeitherbad=t out2=out_R2.fastq minlength=30 qtrim=r trimq=15 minavgquality=15 qout=33
                My question is:
                Can I condense these four jobs into less (ideally one) jobs?

                I assume that if a FASTA entry in the file given with "ref=" is smaller than the k-mer size, it is effectively not used for the filtering.
                Therefore, the main problem that I see is to define different k-mer sizes for different references (e.g. 4 for the restriction motif, and 31 for the chloroplast).

                Thanks!

                Chris

                Comment


                • Hi Chris,

                  You can merge the last two stages:

                  Code:
                  bbduk.sh in=stdin.fq int=t out=out_R#.fastq removeifeitherbad=t minlength=30 qtrim=r trimq=15 minavgquality=15 qout=33 ref=chloroplast.fa k=31 rcomp=t
                  And, you don't actually need to set "qout=33 rcomp=t removeifeitherbad=t" since those are all defaults, so the last two stages shorten to:

                  Code:
                  bbduk.sh in=stdin.fq int=t out=out_R#.fastq minlength=30 qtrim=r trimq=15 minavgquality=15 ref=chloroplast.fa k=31
                  Unfortunately, you cannot combine stages 1, 2, and 3, as they each need different kmer lengths or hamming distances. So the best way to do it is in a pipeline. Your assumption about ignoring sequences shorter than kmer length is correct, but unfortunately it does not help in this situation.

                  Note that (depending on the environment) BBDuk will try to grab around half of available memory for each process, which may lead to unpredictable results. I suggest using the -Xmx flag in each stage to tell it how much memory to use (-Xmx1g should be fine in each case) whenever piping BBDuk.

                  The way you accomplished variable-kmer-length adapter-filtering via the "ktrim=n" and "maxns=0" combo was clever, by the way!

                  Comment


                  • Dear Brian,

                    thanks for your help!

                    Best,
                    Chris

                    Comment


                    • Some pairs breaking after multiple steps

                      I've been doing some multi-step trimming of paired libraries that has been causing a low frequency of pairs breaking, and I'm trying to figure out if there's a way to avoid having this happen.
                      These libraries have a structure where read1 contains exclusively sample barcoding information, and read2 contains the sequences that will eventually be aligned. Ideally, I would like to preserve every base of read1, but on read2, trim several bases from the 5' of the read, as well as do some kmer and quality trimming on read2.

                      my solution was to do this in two steps, first doing the 5' trim on just read2, then doing the kmer trim with the skipr1 flag active. (I found out the hard way that the skipr1 flag doesn't skip ftl operations)
                      ie
                      Code:
                      bbduk.sh in=read2.fq.gz ftl=6 out=read2_ftl6.fq.gz
                      bbduk.sh in1=read1.fq in2=read2_ftl6.fq.gz out=read#_trimmed.fq.gz literal=<adapter sequence> ktrim=r k=23 mink=11 hdist=1 skipr1=TRUE
                      However, even though there's the same number of reads before and after 5' trimming, and the same number of reads in the final trimmed reads, some of the pairing breaks at a low frequency (I haven't picked out all the broken pairs yet, but the head and tail of the file have the same read names, and the first broken pair is ~2000 reads into the files). Is there a way to run this sort of trimming on only read2 while keeping the pairs properly mated? My other alternative would be to do the trimming on just read2, and then write a custom python script to pull out the correct ordered reads in the read1 fastq.

                      Thanks

                      Comment


                      • BBDuk is multithreaded, and will reorder reads unless you explicitly tell it not to via the "ordered" flag. So in this case, you could do this:

                        Code:
                        bbduk.sh in=read2.fq.gz ftl=6 out=read2_ftl6.fq.gz [U][B]ordered minlength=1[/B][/U]
                        bbduk.sh in1=read1.fq in2=read2_ftl6.fq.gz out=read#_trimmed.fq.gz literal=<adapter sequence> ktrim=r k=23 mink=11 hdist=1 skipr1=TRUE
                        It does not matter whether the reads are re-ordered in the second pass, because pairs will be kept together; but when processing just one of the files alone, it's important to retain the original order, and not discard any reads (which is the reason for the minlength=1 flag; the default is 10).

                        It's possible to repair the pairing with "repair.sh" but of course it's always most efficient to just avoid the problem in the first place.

                        Comment


                        • Thanks!
                          I thought it was probably something simple I was leaving out. Adding in the "ordered minlength=1" to the first command preserved the pairs as checked with reformat.sh vpair=TRUE

                          Comment


                          • Hi, I'm trying to trim fasta reads w/ quals but get this error.

                            Thanks,

                            Pat

                            ~pminx/bin/bbmap/bbduk.sh in=Cat_BES_325_500.fasta qfin=Cat_BES_325_500.qual out=Cat_BES_325_500_trim.fa qfout=Cat_BES_325_500_trim.qual ftl=125 ftr=125

                            error:
                            java -Djava.library.path=/gscuser/pminx/bin/bbmap/jni/ -ea -Xmx991m -Xms991m -cp /gscuser/pminx/bin/bbmap/current/ jgi.BBDukF in=Cat_BES_325_500.fasta qfin=Cat_BES_325_500.qual out=Cat_BES_325_500_trim.fa qfout=Cat_BES_325_500_trim.qual ftl=125 ftr=125
                            Executing jgi.BBDukF [in=Cat_BES_325_500.fasta, qfin=Cat_BES_325_500.qual, out=Cat_BES_325_500_trim.fa, qfout=Cat_BES_325_500_trim.qual, ftl=125, ftr=125]
                            BBDuk version 36.30
                            Exception in thread "main" java.lang.RuntimeException: Unknown parameter qfout=Cat_BES_325_500_trim.qual
                            at jgi.BBDukF.<init>(BBDukF.java:421)
                            at jgi.BBDukF.main(BBDukF.java:66)

                            Comment


                            • Hi Pat,

                              My apologies, I somehow forgot to include support for qual file output with BBDuk! And, reformat does not support ftl/ftr. I will fix BBDuk's support for qual files in the next release. What you can do now is this (and apologies for it being tedious):

                              reformat.sh in=Cat_BES_325_500.fasta qfin=Cat_BES_325_500.qual out=x.fq
                              bbduk.sh in=x.fq out=y.fq ftl=125 ftr=125
                              reformat.sh in=y.fq out=Cat_BES_325_500_trim.fa qfout=Cat_BES_325_500_trim.qual


                              However, please note that "ftl=125 ftr=125" will trim everything to the left of position 125, and everything to the right of position 125, leaving you with only 1bp reads, which are not very useful. What are you trying to accomplish with this command?

                              Comment


                              • Thanks for the quick response. I'll give it a try.

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Strategies for Sequencing Challenging Samples
                                  by seqadmin


                                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                  03-22-2024, 06:39 AM
                                • seqadmin
                                  Techniques and Challenges in Conservation Genomics
                                  by seqadmin



                                  The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                  Avian Conservation
                                  Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                  03-08-2024, 10:41 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, 03-27-2024, 06:37 PM
                                0 responses
                                13 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-27-2024, 06:07 PM
                                0 responses
                                12 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-22-2024, 10:03 AM
                                0 responses
                                53 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-21-2024, 07:32 AM
                                0 responses
                                69 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X