SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Cufflinks outputs duplicate GTF entries? griffon42 Bioinformatics 34 02-08-2014 06:02 PM
example for using Picard removing duplicate reads? fabrice Bioinformatics 9 10-18-2013 02:32 AM
duplicate entries in pileup file? sdvie Bioinformatics 2 08-30-2011 02:08 AM
Splitting concatenated PE fastq to two files for respect reads JayM Illumina/Solexa 5 11-05-2010 02:58 AM
Removing duplicate reads for tophat? hong_sunwoo RNA Sequencing 2 10-09-2010 12:46 AM

Reply
 
Thread Tools
Old 10-17-2016, 02:22 PM   #1
horvathdp
Member
 
Location: Fargo

Join Date: Dec 2011
Posts: 65
Default Removing duplicate fastq entries from concatenated files

I have concatenated two fastq files and I m pretty certain I have quite a few duplicates. Is there a script, program (something in BBMAP?) or common way to remove duplicates based on the sequence identifier (as opposed to a kmer-or sequence based method since I want to retain all unique fragments at this point)? Any assistance would be most appreciated.
horvathdp is offline   Reply With Quote
Old 10-17-2016, 02:56 PM   #2
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,978
Default

Why would there be duplicates if the files came from two different lanes/flowcells?
GenoMax is offline   Reply With Quote
Old 10-18-2016, 04:50 AM   #3
horvathdp
Member
 
Location: Fargo

Join Date: Dec 2011
Posts: 65
Default

They were not. They were two different selections (one was a selection of fragments from low copy regions of the genome based on kmers counts and the other a selection of genomic fragments that mapped to transcribed sequences) from the same sets of flow cells. Thus I am expecting a fair number of common frags from both selections.
horvathdp is offline   Reply With Quote
Old 10-18-2016, 05:15 AM   #4
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,978
Default

Quote:
Originally Posted by horvathdp View Post
Is there a script, program (something in BBMAP?) or common way to remove duplicates based on the sequence identifier (as opposed to a kmer-or sequence based method since I want to retain all unique fragments at this point)?
Since you had asked about "based on sequence identifiers" originally .. but it sounds like you are just looking to de-duplicate the actual fastq reads.

dedupe.sh from BBMap is what you need. Depending on the size of your sequence file be ready to allocate adequate amount of RAM to the process.
GenoMax is offline   Reply With Quote
Old 10-18-2016, 05:23 AM   #5
horvathdp
Member
 
Location: Fargo

Join Date: Dec 2011
Posts: 65
Default

If I ran dedupe, wouldn't that result in eliminating all duplicated kmers not just duplicated fragments? I want to assemble the resulting file, and I worry that normalizing the kmer counts to no greater than 1 would not be the best file for assembling. Am I wrong in this thinking? I toyed with the idea of just normalizing to 20 (which I intend to do at the end anyways, but figured that might leave cases where I still have more duplicate sequences than necessary.
horvathdp is offline   Reply With Quote
Old 10-18-2016, 05:35 AM   #6
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,978
Default

Since dedupe can do the following

Quote:
Removes duplicate sequences, which may be specified to be exact matches, subsequences, or sequences within some percent identity.
You can specify that only exact maches over (full length) be eliminated (I assume that is what you want)?
GenoMax is offline   Reply With Quote
Old 10-18-2016, 06:32 AM   #7
horvathdp
Member
 
Location: Fargo

Join Date: Dec 2011
Posts: 65
Default

Possibly? If that works, then why do people not just use these essentially 1X files for assembly? I normally see a 20 or 30X coverage for assemblies. This all said, do you know of way to just eliminate duplicate entries in a fastq file based on identifiers rather than sequence?
horvathdp is offline   Reply With Quote
Old 10-18-2016, 06:48 AM   #8
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,978
Default

Quote:
Originally Posted by horvathdp View Post
Possibly? If that works, then why do people not just use these essentially 1X files for assembly? I normally see a 20 or 30X coverage for assemblies. This all said, do you know of way to just eliminate duplicate entries in a fastq file based on identifiers rather than sequence?
I am not sure what you are referring to here.

If one was certain to have every part of starting material covered (e.g. if we had a theoretical sequencer that started at one end of the chromosome and went through the entire length) then 1x sequencing would be enough. By using 30x you are ensuring that all sequenceable areas would be sampled (and be represented in) your data.

In theory there can be no duplicate entries as far as sequence identifiers go (if you are referring to fastq headers). You would need to cat the same file twice to make a new one.
GenoMax is offline   Reply With Quote
Old 10-18-2016, 07:02 AM   #9
horvathdp
Member
 
Location: Fargo

Join Date: Dec 2011
Posts: 65
Default

Ahhh!!! I might have just o=found the answer to my own question:
./dedupe.sh in=concat1.merged out=depuded_concat.merged rmn=t
the rmn=f requires both sequence and identifier to be identical.

sorce https://github.com/BioInfoTools/BBMa...r/sh/dedupe.sh
horvathdp is offline   Reply With Quote
Old 10-18-2016, 07:10 AM   #10
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,978
Default

I think I understand this finally ...

Original fastq dataset was sampled (two different ways) and you want to eliminate duplicates that may have been selected in both datasets leaving only one copy in the final combined file. And yes, the dedupe solution you discovered will work for that.

Last edited by GenoMax; 10-18-2016 at 07:27 AM.
GenoMax is offline   Reply With Quote
Old 10-18-2016, 07:17 AM   #11
horvathdp
Member
 
Location: Fargo

Join Date: Dec 2011
Posts: 65
Default

In answer to your question above, I did essentially (in some cases when the two different selection protocols identified the same fragment) concat the same file twice. Thus my desire to remove the duplicates. Do you follow?
horvathdp is offline   Reply With Quote
Old 10-18-2016, 07:36 AM   #12
horvathdp
Member
 
Location: Fargo

Join Date: Dec 2011
Posts: 65
Default

Yes! Thanks
horvathdp is offline   Reply With Quote
Old 10-21-2016, 09:18 AM   #13
horvathdp
Member
 
Location: Fargo

Join Date: Dec 2011
Posts: 65
Default

NO! sadly, dedupe.sh uses too much memory for my 250G machine for me to run this on my more than 800 million frag file.

Any other ideas that might just sort and remove the duplicates by sequence name?
horvathdp is offline   Reply With Quote
Old 10-21-2016, 09:23 AM   #14
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,978
Default

How many sequences do you expect are duplicated? You could identify them (sort | uniq -d) after just pulling the headers out (grep "^@YOUR_SEQ_ID" and then remove one of the copies.
GenoMax is offline   Reply With Quote
Old 10-21-2016, 09:29 AM   #15
horvathdp
Member
 
Location: Fargo

Join Date: Dec 2011
Posts: 65
Default

I am playing with a unix-based option right now. I made a short test list of names in a file test.txt

@HWI-D00653:49:H2FF5BCXX:2:1101:1631:2117 1:N:0:ATCACG
@HWI-D00653:49:H2FF5BCXX:2:1101:1631:2117 2:N:0:ATCACG trim=1
@HWI-D00653:49:H2FF5BCXX:2:1101:1804:2196 1:N:0:ATCACG
@HWI-D00653:49:H2FF5BCXX:2:1101:2187:2119 1:N:0:ATCACG

and am trying to see if I can regenerate a fastq file from it using the command
grep -A 3 test.txt concated.fastq >out.fastq

If it works, I can generate a unique list using grep|sort|uniq

My guess is this could take a while though as the test.txt has been running for the last 5 minutes
horvathdp is offline   Reply With Quote
Old 07-05-2019, 12:20 PM   #16
davstern
Junior Member
 
Location: Leesburg, VA

Join Date: Feb 2012
Posts: 1
Default

seqkit rmdup does this in a flash

https://bioinf.shenwei.me/seqkit/usage/
davstern is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 04:11 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO