Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Thanks for the fast response. I will try reformat.

    Comment


    • Hi,

      I got the idea to use BBDuk as a tool to filter out kmers that are shared between two samples. Is this a good idea? It's metagenomics, and the concept is that one sample is representing the background composition of bacteria, and I want to remove that background composition from another sample. One challenge to the problem is that the samples are rather big, about 2x35 GB compressed fastq (paired end), each containing about 1 billion reads total (both read pairs combined).

      Comment


      • Originally posted by boulund View Post
        Hi,

        I got the idea to use BBDuk as a tool to filter out kmers that are shared between two samples. Is this a good idea? It's metagenomics, and the concept is that one sample is representing the background composition of bacteria, and I want to remove that background composition from another sample. One challenge to the problem is that the samples are rather big, about 2x35 GB compressed fastq (paired end), each containing about 1 billion reads total (both read pairs combined).
        You can find kmers that are shared between two samples like this (bearing in mind that it may take a lot of memory):

        kcompress.sh in=a.fq.gz out=a_kmers.fa.gz
        kcompress.sh in=b.fq.gz out=b_kmers.fa.gz
        kcompress.sh in=a_kmers.fa.gz,b_kmers.fa.gz out=shared_kmers.fa.gz mincount=2

        However, I think I'd probably tend to just assemble what you consider to be the background, and then map reads to the assembly requiring fairly high identity, keeping the reads that don't map. Either approach works (and also has disadvantages) but whole reads tend to be more specific than kmers.

        Originally posted by EssigSchurke View Post
        Thanks for the fast response. I will try reformat.
        BBDuk is fixed now, by the way

        Comment


        • "just assemble what you consider to be the background"

          Assemble as?

          Comment


          • To clarify, I meant running Megahit (for example) on the reads of the sample considered to be the background, then mapping the reads of the other sample to that assembly.

            Comment


            • Ok, that's interesting. Our current approach was just that; assembly of the background sample with Megahit, and then mapping the sample to be filtered against the background assembly to remove anything that matches. I was hoping it'd be possible to do it without the massive overhead of assembling the background sample, as that's fairly time consuming and memory hungry for these large samples. I will have to evaluate the different approaches against each other to see which one fits our setup the best. Thanks for your input, and thanks for pointing me to kcompress.sh!

              Comment


              • Different output with interleaved input

                Hello again-

                I think there is an inconsistent behavior in how bbduk handles interleaved input depending on whether the interleaved option is set to "true" or "auto".

                Consider the case where only one read in a pair is discarded because too short.
                With "interleaved=auto" we get in (interleaved) output only the read passing the filter, thus appearing as a single-end read. With "interleaved=true" both reads are discarded.

                Is this difference intentional? In my opinion, interleaved=auto does the right thing in discarding only the bad read and keeping the other. However, this creates an interleaved output with single-end reads (which are actually paired-end but with the mate gone) intercalated in pair-end ones. I'm not sure if the interleaved format has ever been defined to allow such a case (for the record, bwa seems to handle it correctly).

                I just thought useful to point this out...

                This should reproduce the issue wih BBDuk version 37.54:

                Code:
                bbduk.sh in=int2.fq out=stdout.fq qtrim=rl minlength=35 trimq=15 interleaved=auto
                bbduk.sh in=int2.fq out=stdout.fq qtrim=rl minlength=35 trimq=15 interleaved=true
                With int2.fq being:

                Code:
                @r1
                ATGGCATGCACCTGTAATCCCGCTACTTGTGAGGCTGAAGCAGGAGAAT
                +
                JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ
                @r1
                TAATTATATGTTTAAGTAAATGAGTAAAATTCAAGATTGCTATCGGATT
                +
                JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ
                @s1
                ATGGCATGCACCTGTAATCCCGCTACTTGTGAGGCTGAAGCAGGAGAAT
                +
                AAFFFJJJJJFJFJJJJJFAFJJJFJ7<FAJJJJJJJJAAFJJJJJJJJ
                @s1
                TAATTATATGTTTAAGTAAATGAGTAAAATTCAAGATTGCTATCGGATT
                +
                #################################################

                Comment


                • @Dario: Have you tried this with a file that has the (/1 of old Illumina style of 1:N:0 of new read headers)? Interleaved files may need those identifiers to be there. They can be added by reformat.sh.

                  Comment


                  • Originally posted by GenoMax View Post
                    @Dario: Have you tried this with a file that has the (/1 of old Illumina style of 1:N:0 of new read headers)? Interleaved files may need those identifiers to be there. They can be added by reformat.sh.
                    Hi Genomax- These are R1 and R2 real identifiers:

                    Code:
                    @E00295:75:H7LLTALXX:8:1101:4553:1643 1:N:0:8
                    @E00295:75:H7LLTALXX:8:1101:4553:1643 2:N:0:8
                    I think one issue is that the interleaved format has never been formally defined so it's not clear which of option of interleaved, auto or true, is doing the right thing.

                    By the way, with interleaved=auto bbduk gives to stderr the message:

                    Code:
                    Input is being processed as unpaired
                    If I append /1 and /2 to the read names then I get the same output as with interleaved=true (both reads discarded) and the message becomes:

                    Code:
                    Input is being processed as paired
                    so you are right /1 and /2 must be there for interleaved=auto to work as expected. Maybe it would be useful to make this behavior explicit in the documentation (unless I missed it)?

                    Having said that, I think discarding both reads in a pair when only one read fails is unnecessary.

                    Comment


                    • Having said that, I think discarding both reads in a pair when only one read fails is unnecessary.
                      Not having reads in sync causes issues with mapping with most aligners (you note that bwa is an exception). I assume only case where R2 would be eliminated (while R1 remains) would be if you were trimming based on quality. If R2 contains adapter contamination then likely the entire insert is small and may result in elimination of R1 as well. I generally never have to trim data based on quality so have not run into this issue so far.

                      Comment


                      • Hi!

                        Is it possible to get the various quality histograms for both before and after e.g. trimming with BBDuk in a single run, or do I need to run BBDuk it twice to produce metrics for before and after trimming?
                        That is; run once without any trimming, just outputting histograms, and then again to trim and output histograms? Or am I missing something? The histograms output by BBDuk normally show metrics after trimming/contaminant removal, right?

                        By the way, I might mention that I finally tried to assemble my very large background sample using Tadpole, and then align my primary sample to that to remove 'background/contamination' reads. It produced a fairly poor assembly overall, but at least it ran to completion on the 500GB background sample on my memory constrained machine (64GB). The kmer-based approach was just too memory consuming.
                        Last edited by boulund; 10-25-2017, 10:54 PM.

                        Comment


                        • Seal not printing outu file

                          Hi - I'm running seal to map reads to ref genomes. this is the command I ran:

                          Code:
                          seal in="${samplename}_nonribo.fq.gz" ref="${all4genomes}" pattern="${samplename}_out_%.fq.gz" outu="${samplename}_unmapped.fq.gz" ambig=all stats="${samplename}_mapstats.txt"
                          It's outputting the matched reads per reference sequence using the 'pattern', but it's not making the 'outu' file of unmapped reads (and there should be some unmapped reads, according to my stderr output).

                          Is there another trick to making this file?

                          Thanks,
                          MC

                          Comment


                          • Hi I would like to use bbduk to filter the reads that map to a genome. I have 48 pair samples and want to make sure I understand how to input the file.
                            My sample are are Pair end and are labeled as F and R (plus _1 and _2).
                            Should I put them all after in= ? or use in2= for the reverse? Does interleave means that I put each pair together?
                            eg. in= S1_F_paired_1.fq,S1_R_paired_2.fq,S10_F_paired_1.fq,S10_R_paired_2.fq...

                            I have them as two separate lines at the moment:
                            S1_F_paired_1.fq,S10_F_paired_1.fq,S11_F_paired_1.fq,S12_F_paired_1.fq,S13_F_paired_1.fq..
                            S1_R_paired_2.fq,S10_R_paired_2.fq,S11_R_paired_2.fq,S12_R_paired_2.fq,S13_R_paired_2.fq..

                            Thanks,

                            Catalina

                            Comment


                            • java error when trying to filter on entropy

                              Hi,

                              I am getting an error when trying to use BBDuk to filter based on entropy. I have previously filtered the dataset for phix and for adapters with no issues before so I'm a bit confused as to why it won't work now.

                              I'm working on a node with 12 Gb RAM so the 8 Gb called shouldn't be an issue. I got a similar error without the -Xmx flag.

                              Running CentOS Linux release 7.3.1611
                              java version "1.7.0_131"

                              My fq file contains ~135 million 100 bp unpaired sequences.

                              command and error logs as follows:

                              Code:
                              $ bbduk.sh -Xmx8g in=seq.fq out=seq_0-1-entrop-filtered.fq outm=low_complexity-0-1.fq entropy=0.1
                              java -Djava.library.path=/apps/chpc/bio/bbmap/jni/ -ea -Xmx8g -Xms8g -cp /apps/chpc/bio/bbmap/current/ jgi.BBDukF -Xmx8g in=fish-coral_1_filtered_clean.fq out=fish-coral_1_filtered_clean_0-1-entrop-filtered.fq outm=low_complexity-0-1.fq entropy=0.1
                              Executing jgi.BBDukF [-Xmx8g, in=seq.fq, out=seq_0-1-entrop-filtered.fq, outm=low_complexity-0-1.fq, entropy=0.1]
                              Version 37.90 [-Xmx8g, in=seq.fq, out=seq_0-1-entrop-filtered.fq, outm=low_complexity-0-1.fq, entropy=0.1]
                              
                              Initial:
                              Memory: max=8232m, free=8061m, used=171m
                              
                              Input is being processed as unpaired
                              Started output streams: 0.038 seconds.
                              Exception in thread "Thread-6" java.lang.ArrayIndexOutOfBoundsException: 39
                                      at structures.EntropyTracker.averageEntropy(EntropyTracker.java:302)
                                      at structures.EntropyTracker.passes(EntropyTracker.java:348)
                                      at jgi.BBDukF$ProcessThread.run(BBDukF.java:2583)
                              Exception in thread "Thread-28" java.lang.ArrayIndexOutOfBoundsException
                              Exception in thread "Thread-25" java.lang.ArrayIndexOutOfBoundsException
                              Exception in thread "Thread-8" java.lang.ArrayIndexOutOfBoundsException
                              Exception in thread "Thread-17" java.lang.ArrayIndexOutOfBoundsException
                              Exception in thread "Thread-9" java.lang.ArrayIndexOutOfBoundsException
                              Exception in thread "Thread-24" java.lang.ArrayIndexOutOfBoundsException
                              Exception in thread "Thread-13" java.lang.ArrayIndexOutOfBoundsException
                              Exception in thread "Thread-15" java.lang.ArrayIndexOutOfBoundsException
                              Exception in thread "Thread-23" java.lang.ArrayIndexOutOfBoundsException
                              Exception in thread "Thread-11" java.lang.ArrayIndexOutOfBoundsException
                              Exception in thread "Thread-14" java.lang.ArrayIndexOutOfBoundsException
                              Exception in thread "Thread-7" java.lang.ArrayIndexOutOfBoundsException
                              Exception in thread "Thread-10" java.lang.ArrayIndexOutOfBoundsException
                              Exception in thread "Thread-29" java.lang.ArrayIndexOutOfBoundsException
                              Exception in thread "Thread-20" java.lang.ArrayIndexOutOfBoundsException
                              Exception in thread "Thread-22" java.lang.ArrayIndexOutOfBoundsException
                              Exception in thread "Thread-16" java.lang.ArrayIndexOutOfBoundsException
                              Exception in thread "Thread-26" java.lang.ArrayIndexOutOfBoundsException
                              Exception in thread "Thread-19" java.lang.ArrayIndexOutOfBoundsException
                              Exception in thread "Thread-12" java.lang.ArrayIndexOutOfBoundsException
                              Exception in thread "Thread-18" java.lang.ArrayIndexOutOfBoundsException
                              Exception in thread "Thread-27" java.lang.ArrayIndexOutOfBoundsException
                              Exception in thread "Thread-21" java.lang.ArrayIndexOutOfBoundsException
                              Processing time:                0.192 seconds.
                              
                              Input:                          34841 reads             3436691 bases.
                              Low entropy discards:           2157 reads (6.19%)      215168 bases (6.26%)
                              Total Removed:                  2181 reads (6.26%)      216121 bases (6.29%)
                              Result:                         32660 reads (93.74%)    3220570 bases (93.71%)
                              
                              Time:                           0.255 seconds.
                              Reads Processed:       34841    136.55k reads/sec
                              Bases Processed:       3436k    13.47m bases/sec
                              Any suggestions?

                              Thank you.

                              Comment


                              • Entropy filtering: Java ArrayIndexOutOfBoundsException

                                Hi,

                                I'm having an issue while trying to filter a fastq file using an entropy filter. The library protocol used ribozero so there are a lot of poly T sequences that I would like to remove.

                                I have successfully removed adapter and phiX contamination from the file but when I try the entropy filter (with various -Xmx settings or none) I get a java array error.

                                There are ~137 million 100 bp unpaired reads in the fastq file and they have been filtered for adapters, low quality and phiX (using BBDuk).

                                I'm working on a node with 24 cores and 128 GiB of RAM running CentOS Linux release 7.3.1611 and java version "1.7.0_131".

                                Command and error messages follow:

                                $ bbduk.sh -Xmx8g in=seq.fq out=seq_0-1-entrop-filtered.fq outm=low_complexity-0-1.fq entropy=0.1
                                java -Djava.library.path=/apps/chpc/bio/bbmap/jni/ -ea -Xmx8g -Xms8g -cp /apps/chpc/bio/bbmap/current/ jgi.BBDukF -Xmx8g in=fish-coral_1_filtered_clean.fq out=fish-coral_1_filtered_clean_0-1-entrop-filtered.fq outm=low_complexity-0-1.fq entropy=0.1
                                Executing jgi.BBDukF [-Xmx8g, in=seq.fq, out=seq_0-1-entrop-filtered.fq, outm=low_complexity-0-1.fq, entropy=0.1]
                                Version 37.90 [-Xmx8g, in=seq.fq, out=seq_0-1-entrop-filtered.fq, outm=low_complexity-0-1.fq, entropy=0.1]

                                Initial:
                                Memory: max=8232m, free=8061m, used=171m

                                Input is being processed as unpaired
                                Started output streams: 0.038 seconds.
                                Exception in thread "Thread-6" java.lang.ArrayIndexOutOfBoundsException: 39
                                at structures.EntropyTracker.averageEntropy(EntropyTracker.java:302)
                                at structures.EntropyTracker.passes(EntropyTracker.java:348)
                                at jgi.BBDukF$ProcessThread.run(BBDukF.java:2583)
                                Exception in thread "Thread-28" java.lang.ArrayIndexOutOfBoundsException
                                Exception in thread "Thread-25" java.lang.ArrayIndexOutOfBoundsException
                                Exception in thread "Thread-8" java.lang.ArrayIndexOutOfBoundsException
                                Exception in thread "Thread-17" java.lang.ArrayIndexOutOfBoundsException
                                Exception in thread "Thread-9" java.lang.ArrayIndexOutOfBoundsException
                                Exception in thread "Thread-24" java.lang.ArrayIndexOutOfBoundsException
                                Exception in thread "Thread-13" java.lang.ArrayIndexOutOfBoundsException
                                Exception in thread "Thread-15" java.lang.ArrayIndexOutOfBoundsException
                                Exception in thread "Thread-23" java.lang.ArrayIndexOutOfBoundsException
                                Exception in thread "Thread-11" java.lang.ArrayIndexOutOfBoundsException
                                Exception in thread "Thread-14" java.lang.ArrayIndexOutOfBoundsException
                                Exception in thread "Thread-7" java.lang.ArrayIndexOutOfBoundsException
                                Exception in thread "Thread-10" java.lang.ArrayIndexOutOfBoundsException
                                Exception in thread "Thread-29" java.lang.ArrayIndexOutOfBoundsException
                                Exception in thread "Thread-20" java.lang.ArrayIndexOutOfBoundsException
                                Exception in thread "Thread-22" java.lang.ArrayIndexOutOfBoundsException
                                Exception in thread "Thread-16" java.lang.ArrayIndexOutOfBoundsException
                                Exception in thread "Thread-26" java.lang.ArrayIndexOutOfBoundsException
                                Exception in thread "Thread-19" java.lang.ArrayIndexOutOfBoundsException
                                Exception in thread "Thread-12" java.lang.ArrayIndexOutOfBoundsException
                                Exception in thread "Thread-18" java.lang.ArrayIndexOutOfBoundsException
                                Exception in thread "Thread-27" java.lang.ArrayIndexOutOfBoundsException
                                Exception in thread "Thread-21" java.lang.ArrayIndexOutOfBoundsException
                                Processing time: 0.192 seconds.

                                Input: 34841 reads 3436691 bases.
                                Low entropy discards: 2157 reads (6.19%) 215168 bases (6.26%)
                                Total Removed: 2181 reads (6.26%) 216121 bases (6.29%)
                                Result: 32660 reads (93.74%) 3220570 bases (93.71%)

                                Time: 0.255 seconds.
                                Reads Processed: 34841 136.55k reads/sec
                                Bases Processed: 3436k 13.47m bases/sec


                                Any suggestions?

                                Thanks.
                                Dave

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Techniques and Challenges in Conservation Genomics
                                  by seqadmin



                                  The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                  Avian Conservation
                                  Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                  03-08-2024, 10:41 AM
                                • seqadmin
                                  The Impact of AI in Genomic Medicine
                                  by seqadmin



                                  Artificial intelligence (AI) has evolved from a futuristic vision to a mainstream technology, highlighted by the introduction of tools like OpenAI's ChatGPT and Google's Gemini. In recent years, AI has become increasingly integrated into the field of genomics. This integration has enabled new scientific discoveries while simultaneously raising important ethical questions1. Interviews with two researchers at the center of this intersection provide insightful perspectives into...
                                  02-26-2024, 02:07 PM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, 03-14-2024, 06:13 AM
                                0 responses
                                34 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-08-2024, 08:03 AM
                                0 responses
                                72 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-07-2024, 08:13 AM
                                0 responses
                                81 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-06-2024, 09:51 AM
                                0 responses
                                68 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X