I have concatenated two fastq files and I m pretty certain I have quite a few duplicates. Is there a script, program (something in BBMAP?) or common way to remove duplicates based on the sequence identifier (as opposed to a kmer-or sequence based method since I want to retain all unique fragments at this point)? Any assistance would be most appreciated.
Seqanswers Leaderboard Ad
Collapse
Announcement
Collapse
No announcement yet.
X
-
They were not. They were two different selections (one was a selection of fragments from low copy regions of the genome based on kmers counts and the other a selection of genomic fragments that mapped to transcribed sequences) from the same sets of flow cells. Thus I am expecting a fair number of common frags from both selections.
Comment
-
Originally posted by horvathdp View PostIs there a script, program (something in BBMAP?) or common way to remove duplicates based on the sequence identifier (as opposed to a kmer-or sequence based method since I want to retain all unique fragments at this point)?
dedupe.sh from BBMap is what you need. Depending on the size of your sequence file be ready to allocate adequate amount of RAM to the process.
Comment
-
If I ran dedupe, wouldn't that result in eliminating all duplicated kmers not just duplicated fragments? I want to assemble the resulting file, and I worry that normalizing the kmer counts to no greater than 1 would not be the best file for assembling. Am I wrong in this thinking? I toyed with the idea of just normalizing to 20 (which I intend to do at the end anyways, but figured that might leave cases where I still have more duplicate sequences than necessary.
Comment
-
Since dedupe can do the following
Removes duplicate sequences, which may be specified to be exact matches, subsequences, or sequences within some percent identity.
Comment
-
Possibly? If that works, then why do people not just use these essentially 1X files for assembly? I normally see a 20 or 30X coverage for assemblies. This all said, do you know of way to just eliminate duplicate entries in a fastq file based on identifiers rather than sequence?
Comment
-
Originally posted by horvathdp View PostPossibly? If that works, then why do people not just use these essentially 1X files for assembly? I normally see a 20 or 30X coverage for assemblies. This all said, do you know of way to just eliminate duplicate entries in a fastq file based on identifiers rather than sequence?
If one was certain to have every part of starting material covered (e.g. if we had a theoretical sequencer that started at one end of the chromosome and went through the entire length) then 1x sequencing would be enough. By using 30x you are ensuring that all sequenceable areas would be sampled (and be represented in) your data.
In theory there can be no duplicate entries as far as sequence identifiers go (if you are referring to fastq headers). You would need to cat the same file twice to make a new one.
Comment
-
Ahhh!!! I might have just o=found the answer to my own question:
./dedupe.sh in=concat1.merged out=depuded_concat.merged rmn=t
the rmn=f requires both sequence and identifier to be identical.
sorce https://github.com/BioInfoTools/BBMa...r/sh/dedupe.sh
Comment
-
I think I understand this finally ...
Original fastq dataset was sampled (two different ways) and you want to eliminate duplicates that may have been selected in both datasets leaving only one copy in the final combined file. And yes, the dedupe solution you discovered will work for that.Last edited by GenoMax; 10-18-2016, 07:27 AM.
Comment
-
I am playing with a unix-based option right now. I made a short test list of names in a file test.txt
@HWI-D00653:49:H2FF5BCXX:2:1101:1631:2117 1:N:0:ATCACG
@HWI-D00653:49:H2FF5BCXX:2:1101:1631:2117 2:N:0:ATCACG trim=1
@HWI-D00653:49:H2FF5BCXX:2:1101:1804:2196 1:N:0:ATCACG
@HWI-D00653:49:H2FF5BCXX:2:1101:2187:2119 1:N:0:ATCACG
and am trying to see if I can regenerate a fastq file from it using the command
grep -A 3 test.txt concated.fastq >out.fastq
If it works, I can generate a unique list using grep|sort|uniq
My guess is this could take a while though as the test.txt has been running for the last 5 minutes
Comment
Latest Articles
Collapse
-
by seqadmin
The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...-
Channel: Articles
05-06-2024, 07:48 AM -
-
by seqadmin
The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...-
Channel: Articles
04-22-2024, 07:01 AM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
Started by seqadmin, Today, 07:03 AM
|
0 responses
10 views
0 likes
|
Last Post
by seqadmin
Today, 07:03 AM
|
||
Started by seqadmin, 05-10-2024, 06:35 AM
|
0 responses
31 views
0 likes
|
Last Post
by seqadmin
05-10-2024, 06:35 AM
|
||
Started by seqadmin, 05-09-2024, 02:46 PM
|
0 responses
41 views
0 likes
|
Last Post
by seqadmin
05-09-2024, 02:46 PM
|
||
Started by seqadmin, 05-07-2024, 06:57 AM
|
0 responses
33 views
0 likes
|
Last Post
by seqadmin
05-07-2024, 06:57 AM
|
Comment