SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
duplicate reads removal vasvale Bioinformatics 19 01-08-2015 12:59 AM
some basic questions about duplicate removal ? a_mt Bioinformatics 1 06-27-2014 09:14 AM
Trinity-Duplicate removal reema Bioinformatics 2 02-27-2014 01:49 AM
duplicate reads removal for amplicon libraries? ngseq Illumina/Solexa 0 12-10-2012 12:50 PM
threshold for duplicate removal? mard Bioinformatics 2 03-21-2010 03:45 PM

Reply
 
Thread Tools
Old 08-17-2014, 05:58 AM   #1
gab0
Member
 
Location: Talca, Chile

Join Date: Apr 2014
Posts: 11
Default duplicate read removal

Hello again:

I have a question regarding duplicate removal. Previously I had detected in my files (PE library, 2x100bp, hiseq 2000) some reads that match perfectly with the "Process Controls for TruSeq® Sample Preparation Kits"

It seems that these reads biased the Kmer content in a strange way (I posted before in http://seqanswers.com/forums/showthread.php?t=45539 in case you want to see the fastqc report).

But I don't know whether I should concern about removing these reads before the de novo assembly. Mapping against a genome reference is not an option as my reads almost don't map with the closest reference genome (which isn't so close).

I do think that these controls shouldn't affect the assembly, but I'm still not sure about what should be done.

So, is it worth to remove these reads? Do you do this step in your assembly pipelines?

Thank you very much

Gabriel
gab0 is offline   Reply With Quote
Old 08-17-2014, 08:08 AM   #2
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

You can quickly remove them with BBDuk, like this:

bbduk.sh -Xmx1g in1=read1.fq in2=read2.fq out1=clean1.fq out2=clean2.fq ref=control.fa stats=stats.txt

They may not affect the assembly much, but no reason to risk it. You'll have to remove them either before assembly or after assembly so you may as well remove them before. Also, that will let you run FastQC again on a cleaner dataset so you can look for other potential problems in the real data. The "stats.txt" output is optional but will let you see what fraction of reads hit which artifact sequence.

P.S. The BBTools package also includes a duplicate read removal tool, which works on paired reads, though it requires a lot of memory (~1kb per read). It deduplicates based on sequence, not mapping (so it does not require a reference), and requires both read 1 and read 2 to be equal to the read 1 and read 2 of the other pair. It's insanely fast.

dedupe.sh in1=read1.fq in2=read2.fq out1=deduped1.fq out2=deduped2.fq ac=f

Last edited by Brian Bushnell; 08-17-2014 at 08:16 AM.
Brian Bushnell is offline   Reply With Quote
Old 08-17-2014, 12:17 PM   #3
gab0
Member
 
Location: Talca, Chile

Join Date: Apr 2014
Posts: 11
Default

Quote:
Originally Posted by Brian Bushnell View Post
You can quickly remove them with BBDuk, like this:

bbduk.sh -Xmx1g in1=read1.fq in2=read2.fq out1=clean1.fq out2=clean2.fq ref=control.fa stats=stats.txt

They may not affect the assembly much, but no reason to risk it. You'll have to remove them either before assembly or after assembly so you may as well remove them before. Also, that will let you run FastQC again on a cleaner dataset so you can look for other potential problems in the real data. The "stats.txt" output is optional but will let you see what fraction of reads hit which artifact sequence.

P.S. The BBTools package also includes a duplicate read removal tool, which works on paired reads, though it requires a lot of memory (~1kb per read). It deduplicates based on sequence, not mapping (so it does not require a reference), and requires both read 1 and read 2 to be equal to the read 1 and read 2 of the other pair. It's insanely fast.

dedupe.sh in1=read1.fq in2=read2.fq out1=deduped1.fq out2=deduped2.fq ac=f
Hi Brian:

Thanks a lot for your post. Turns out that I had installed in my computer your package, and indeed I noticed the duplicate read removal tool, but didn't want to go that way (or use it as last resort) as it will eliminate true read duplicates, and I'm interested in removing only the control sequences.

As my reads are of 101bp, and the control sequences are longer that 101 bp (150bp, 250bp, and so on), is it okay to give the full lenght control sequences in the fasta file (control.fa)? or should I reduce the lenght to 101bp?

Before posting for help, I tried BBDuk but as I didn't know exactly how to use it, I used trimming parameters (wrong move, d'ouh :P).

This was the line that I executed previously.
bbduk.sh in1=(file) in2=(file) out1=(file) out2=(file) ref=adaptadores.fa ktrim=r minlength=20
bbduk.sh in1=(file) in2=(file) out1=(file) out2=(file) ref=adaptadores.fa ktrim=r minlength=20 k=80

I'll try again with the exact command line that you suggested.

Thanks again Brian!!

Last edited by gab0; 08-17-2014 at 12:36 PM.
gab0 is offline   Reply With Quote
Old 08-17-2014, 12:49 PM   #4
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Quote:
Originally Posted by gab0 View Post
Before posting for help, I tried BBDuk but as I didn't know that the control sequences were included in the package database, I gave a file with substrings of the control sequence. Also, when I tested it, I used trimming parameters (wrong move, d'ouh :P).
The BBTools distribution only includes the trueseq adapter sequences, which should generally be right-trimmed (since they occur at the 5' tail of reads with insert size shorter than read length, but sequence prior to that should be valid); other non-genomic sequences should be filtered out entirely, but I don't include a file containing those.

As for "k=80" in your command line... BBDuk has a max kmer length of 31. If you tell it to use a longer kmer length it will emulate this by requiring consecutive kmers; for example, k=80 would require 49 consecutive matching 31-mers, which in practice gives almost identical results to using true 80-mers but is not quite the same.

Most importantly, "k=N" will not filter sequences shorter than N bp. Since adapters and primers are generally shorter than 80 bp, it's not useful for removing those. The information content of a random 31bp sequence is 62 bits, and thus has a 1/4,611,686,018,427,387,904 chance of matching another random 31-mer. The reason I included the ability in BBDuk to handle k>31 is purely for homologous genes, low-complexity sequences, and highly-conserved sequences like ribosomes; the chances of a 31-mer from an artificial sequence accidentally matching a real organism is negligible. So, I recommend using k=31 or less when filtering/trimming synthetic compounds (typically, I use k=25 for adapter-trimming and k=27 for filtering, because some synthetic compounds are shorter than 31bp); I only use k>31 when trying to distinguish between different organisms.

As for read length - since BBDuk operates on 31-mers (or less), it's best to include the entire sequence in the reference, no matter how long. As an aside, since you are using 101-bp reads, I advise you to trim the last 1bp before using them, as it will have a very high error rate (up to 7% in my tests). You can do this with BBDuk using the "ftr=99" flag (which means force-trim-right after 99bp, using 0-based coordinates, so only the last 1bp gets trimmed).

Last edited by Brian Bushnell; 08-17-2014 at 12:51 PM.
Brian Bushnell is offline   Reply With Quote
Old 08-19-2014, 04:46 AM   #5
gab0
Member
 
Location: Talca, Chile

Join Date: Apr 2014
Posts: 11
Default

Hi Brian

Your suggestion worked as a charm, now I have my files without these control sequences and FastQC does not show any Kmer warning at all. I ran it exactly as you suggested, by using a fasta file with the whole length of the control sequences.

Thanks for your help!!
gab0 is offline   Reply With Quote
Old 08-19-2014, 06:51 AM   #6
ronton
Member
 
Location: US

Join Date: Jun 2014
Posts: 34
Default

Quote:
Originally Posted by Brian Bushnell View Post
other non-genomic sequences should be filtered out entirely, but I don't include a file containing those.
I've seen this suggestion before; where to find or how to generate such a file?
ronton is offline   Reply With Quote
Old 09-29-2014, 04:49 PM   #7
ssully
Member
 
Location: NYC

Join Date: Aug 2010
Posts: 48
Default

I installed BBmap on my Windows 7 machine (Java RE 1.7 is installed). But I can't seem to get dedup to read any of my input files. They're standard paired fastq files I named test1.f1 and test2.fq. I put them in the bbmaps directory.

here's the command
dedup.sh in1=testq1.fq in2-test2.fq out1=dedup1.fq out2=dedup2.fq ac=f

here is the output:
Code:
java -Djava.library.path=/c/Users/Steven/bbmap/jni/ -ea -Xmx1310m -Xms1310m -cp /c/Users/Steven/bbmap/current/ jgi.Dedupe in1=test1.fq in2=test2.fq  out1=dd1.fq out2=dd2.fq ac=f

Executing jgi.Dedupe [in1=test1.fq, in2=test2.fq, out1=dedup1.fq, out2=dedup2.fq, ac=f]

Exception in thread "main" java.lang.RuntimeException: Unknown parameter in1=test1.fq

at jgi.Dedupe.<init>(Dedupe.java:342)
at jgi.Dedupe.main(Dedupe.java:56)
using -in1=test1.fq etc, or in=test#.fq etc was no better.

Last edited by ssully; 09-29-2014 at 04:59 PM.
ssully is offline   Reply With Quote
Old 09-29-2014, 06:12 PM   #8
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

ssully,

I forgot to add the "in1" and "in2" flags to Dedupe's parser, since we use interleaved files for everything. I'll fix that tomorrow. For now, please do this:

java -ea -Xmx200m -cp C:\Users\Steven\bbmap\current\ jgi.ReformatReads in1=test1.fq in2=test2.fq out=interleaved.fq

...then use the resulting interleaved file for Dedupe:

java -da -Xmx1310m -Xms1310m -cp C:\Users\Steven\bbmap\current\ jgi.Dedupe in=interleaved.fq out=dd.fq ac=f interleaved=t

Then you can deinterleave it with Reformat, if you wish:

java -ea -Xmx200m -cp C:\Users\Steven\bbmap\current\ jgi.ReformatReads in=dd.fq out1=dd1.fq out2=dd2.fq

Sorry for the inconvenience!

-Brian

Last edited by Brian Bushnell; 09-30-2014 at 09:58 AM.
Brian Bushnell is offline   Reply With Quote
Old 09-30-2014, 09:08 AM   #9
ssully
Member
 
Location: NYC

Join Date: Aug 2010
Posts: 48
Default

tried it but 'java' was not recognized as a command. I added the path to 'my' java.exe
and reran, but then it threw this error
Code:
Error: Could not find or load main class jgi.ReformatReads
There is a 'ReformatReads.class' under
java -ea -Xmx200m -cp /c/Users/Steven/bbmap/current/jgi/
but that path doesn't work either.
Code:
java -ea -Xmx200m -cp /c/Users/Steven/bbmap/current/jgi/ jgi.ReformatReads in1=test1.fq in2=test2.fq out=interleaved.fq
Error: Could not find or load main class jgi.ReformatReads

as a last resort this command (removing the space)
java -ea -Xmx200m -cp /c/Users/Steven/bbmap/current/jgi/jgi.ReformatReads in1=test1.fq in2=test2.fq out=interleaved.fq

returns

Code:
Error: Could not find or load main class in1=test1.fq
Obviously I'm not familiar with java syntax so any help is appreciated.

Last edited by ssully; 09-30-2014 at 09:11 AM.
ssully is offline   Reply With Quote
Old 09-30-2014, 09:18 AM   #10
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

I updated my prior post. The command should be:

java -ea -Xmx200m -cp C:\Users\Steven\bbmap\current\ jgi.ReformatReads in1=test1.fq in2=test2.fq out=interleaved.fq

Note that Dedupe uses a lot of memory, ~1kb per read. How big are your input files?

Last edited by Brian Bushnell; 09-30-2014 at 09:59 AM.
Brian Bushnell is offline   Reply With Quote
Old 09-30-2014, 01:35 PM   #11
ssully
Member
 
Location: NYC

Join Date: Aug 2010
Posts: 48
Default

That works! Thanks.
ssully is offline   Reply With Quote
Old 10-01-2014, 11:17 AM   #12
ssully
Member
 
Location: NYC

Join Date: Aug 2010
Posts: 48
Default

Hmm..the interleaved file is 140 million reads. I don't have 140 Gb of RAM. Guess I have to run this on a unix cluster. Is the corrected package (for running the dedup command) ready for download?

Last edited by ssully; 10-01-2014 at 11:22 AM.
ssully is offline   Reply With Quote
Old 10-01-2014, 11:24 AM   #13
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

In that case... you will need to deduplicate using a sorting-based approach, such as mapping to a reference, sorting by position, then marking duplicates with a tool like Picard. Without a reference, you can still sort alphabetically, but I don't know of any tools that do that (for paired reads) except one I wrote a while ago that is deprecated. Probably, someone else will have a suggestion.
Brian Bushnell is offline   Reply With Quote
Old 10-02-2014, 01:18 PM   #14
ssully
Member
 
Location: NYC

Join Date: Aug 2010
Posts: 48
Default

When I wrote 'I don't have 140 Gb of RAM' I meant on my workstation (it has 32Gb). But I do have access to a cluster where I can allocate >140 gb to a job . Should I try that or are you suggesting I shouldn't bother?

Last edited by ssully; 10-02-2014 at 02:04 PM.
ssully is offline   Reply With Quote
Old 10-02-2014, 03:20 PM   #15
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

The latest version (33.54, uploaded yesterday) allows dual input files for Dedupe, though it will still only output a single interleaved file. So on your high-memory cluster, you can do this:

java -ea -Xmx140g -Xms140g -cp /path/ jgi.Dedupe in1=read1.fq in2=read2.fq out=deduped.fq ac=f

You will still need to de-interleave the output with Reformat, if you need 2 output files:

java -ea -Xmx200m -cp /path/ jgi.ReformatReads in=deduped.fq out1=dd1.fq out2=dd2.fq

Whether or not you SHOULD deduplicate the data depends on your experiment. What are you doing? It's often a good idea if you expect lots of PCR duplicates, but not necessarily otherwise.

Last edited by Brian Bushnell; 01-12-2015 at 08:49 AM.
Brian Bushnell is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 11:05 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO