Hi I would like to know is there any tool for removing the unpaired fastq reads from a huge set of paired end fastq data?
Seqanswers Leaderboard Ad
Collapse
Announcement
Collapse
No announcement yet.
X
-
You said huge, but how many reads are there (roughly)? One million? Ten million? More?
Would a Python script using Biopython be useful? It would help to show a small sample of the reads (say the first ten) to ensure I understand your read naming convention. You can wrap the example with [ code ] and [ /code ] tags to format it nicely on the forum.
Comment
-
About 50 million reads... that would make any approach using a list of IDs in memory rather tricky.
Can we assume the read pairs are next to each other, e.g. you started out with this:
Code:read1.f read1.r read2.f read2.r read3.f read3.r read4.f read4.r read5.f read5.r
Code:read1.f read1.r read2.r read3.f read4.f read4.r read5.r
Code:read1.f read1.r read4.f read4.r
If the reads are in a random order, things will be much harder...
Comment
-
This is a general slow version (you can switch "fastq" to any other supported file format like "fasta" or "qual"):
Code:from Bio import SeqIO mixed_file = "mixed.fastq" paired_file = "paired.fastq" def get_paired(iterator): prev = None for curr in iterator: if curr.id.endswith("/1"): prev = curr elif not curr.id.endswith("/2"): raise ValueError("Expect IDs to end /1 and /2") elif prev and prev.id == curr.id[:-2] + "/1": yield prev yield curr prev = None records = get_paired(SeqIO.parse(mixed_file, "fastq")) count = SeqIO.write(records, paired_file, "fastq") print "Saved %i records (%i pairs)" % (count, count/2)
Code:from Bio.SeqIO.QualityIO import FastqGeneralIterator mixed_file = "mixed.fastq" paired_file = "paired.fastq" out_handle = open(paired_file, "w") prev = None for curr in FastqGeneralIterator(open(mixed_file, "rU")): if curr[0].split()[0].endswith("/1"): prev = curr elif not curr[0].split()[0].endswith("/2"): raise ValueError("Expect IDs to end /1 and /2,\n%s" % curr[0]) elif prev and prev[0].split()[0] == curr[0].split()[0][:-2] + "/1": out_handle.write("@%s\n%s\n+\n%s\n" % prev) out_handle.write("@%s\n%s\n+\n%s\n" % curr) prev = None out_handle.close()
This was written and tested using Biopython 1.54beta, should be fine on Biopython 1.51 or later. I've only used a 500MB test file - it took about 30s.
If you are willing to make more assumptions about the file layout, or skip the minimal error checking, I could make this faster still - but I would need to see a sample of the data as suggested before (e.g. the first ten reads).
Comment
-
Another way to do this: first run tophat, and check the accepted_hits.sam and delete the line with "*"
HWI-EAS266_0005:1:32:2465:21083#0 177 chr1 120 255 42M = 199385343 0 AAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAACC ^^[YVRK[bb^bbbb_babbb^bbbbbbaabbcbbb`bbbba NM:i:1
HWI-EAS266_0005:1:59:17297:6482#0 177 chr1 120 255 42M = 199385343 0 AAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAACC ORXYaa\__bbbTbabbb`b[bbb``abbbbbb_bbb`b^bb NM:i:1
HWI-EAS266_0005:1:3:4093:8164#0 145 chr1 550 255 42M = 241327181 0 GTGCAGAGGAGAACGCAGCTCCGCCCTCGCGGTGCTCTCCGG Rbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb NM:i:1
HWI-EAS266_0005:1:60:8918:19240#0 145 chr1 550 255 42M = 241327181 0 GTGCAGAGGAGAACGCAGCTCCGCCCTCGCGGTGCTCTCCGG BBBBa_\`a_`a[a[_aaaXTa```a_a[a_]]_`T]`__aa NM:i:1
HWI-EAS266_0005:1:79:5108:21091#0 73 chr1 1146 1 42M * 0 0 GCGCCCCCTGCTGGCGCCGGGGCGCTGCAGGGCCCTCTTGCT aQ_a`^ca_`a__c]`\cbbbbbb^bbb]ba^abbbbbbbbb NM:i:1
HWI-EAS266_0005:1:8:7442:18432#0 73 chr1 1168 1 42M * 0 0 CACTGCAGGGCCCTCTTGCTTACTGTATAGTGGTGGCACGCC _Sb_ab]bbb^_ba^b_b\`V]\b``b_ababbbababbbb` NM:i:0
HWI-EAS266_0005:1:35:18364:16230#0 147 chr1 1303 255 42M = 1458 0 TTCTTGCTCCAACAGTAGTGGCGGATTATAGGGAAACACCCG B_bb`bbcbbbbbcbbbbbcbbbbb`bbbbbbbabababbbb NM:i:2
HWI-EAS266_0005:1:26:17783:8912#0 137 chr1 1585 0 42M * 0 0 GGTATTTTTTTAAATTTCCACTGATGATTTTGCTGCATGGCC BBBBBababaaaa_bb]bb`bbbacbbbcc_bc`bbcbbbbb NM:i:1
Is this method reasonable?
Comment
-
I have tried to run your script but the problem is that my file does not have the /1 or /2 ending. In my case the file is told with 1 or 2 before the N on the last letters of the file (bold) have two separate fastq files which look like:
File 1:
@M00289:3:000000000-A752F:1:1102:13093:3364 1:N:0:20
GGCTGTGATCACGGGGCAAAACAGCCACTTCATTAACTTTCAGTAAGGGCTGCAAGGTAACTCCAGCAGGTGCCTTTTGTGTTTCACTCAACCCTAGGTCAAAACGTTGTTCACTCATGGCTTGTTCCAACCAAGGCGACTCTAAGG
+
BCCCBCFFFFFFGGGGGGGGGGHHHHHHHHHHHHHHHHHHHHHGHHHHHGHGGHHHGGHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHGHHHHHHHHHHHGHHHGHHHHHHHHHHHHHGGHHHHHHGGHHGHGGGGGHHHHH
@M00289:3:000000000-A752F:1:1110:7454:18361 1:N:0:20
ATCTCAACGCGTCCAAATGAGGCATCGCTGTATTCAGGTTACTTTACATAAGAGTTTTTATGTTAAATAGGACTAAAAATATACTCTAATTTTAGAGTTTTCTTTTAGGTATGATGTAAAAACATACAAGCCTAAGAGTTTAATTTAAAGG
+
CCCCCFFFDDDDGGGGGGGGGGGGHHGGHGGHHHHHHHHHHHHHHHHHHHHHHHHHHHGHHHGHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHGHHHHHHHHGHHHHHHHHHHGHHHHHHGG
FIle 2:
@M00289:3:000000000-A752F:1:1112:9370:8062 2:N:0:20
CCACACCCCCTGCAGAGCGTTCTCGCAGACACAGTCCGCAAAGCCAGTGCCGACTTGAGCCACCTTGACCAGTGTTTTTATTAGAACTAGAAACTAGAGGATTTGTTGCAC
+
BBCDDCCEDEEEGGGGGGGGGGGHGGGGGHHHHHHHHGGGGGHHHFHGHHHGCGGGHHHHHHHHHHHGHHHHHGGHHHGHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
@M00289:3:000000000-A752F:1:1114:14694:7357 2:N:0:20
TAATTTAACTGTAGTTCATCCGCATTTTCCCCTCTAATGCCCCAGTTTTCTTTAGGGGTTTCAAATATAGTTATCTCAACATCAT
+
BBBBCFFFFFFFGGGGGGGGGGGGGGHHHHHHHGHHHHHHHHGGGHHHHHHHHHHHGGEGGHHHHHHHHHHHHHHHHHHHHHHFH
@M00289:3:000000000-A752F:1:1102:13093:3364 2:N:0:20
CCTTAGAGTCGCCTTGGTTGGAACAAGCCATGAGTGAACAACGTTTTGACCTAGGGTTGAGTGAAACACAAAAGGCACCTGCTGGAGTTACCTTGCAGCCCTTACTGAAAGTTAATGAAGTGGCTGTTTTGCCCCGTGATCACAGCC
+
BBBBCFFFFFCDGGGGGGGGGGHHHHHHGHHHHHHHHHHHHHGHHHGHGHHGHHHHHHGHHGHHHHHHHHGHHHGHHHGHHHHHHHHHHHHHHHHHFHHHGGHHHHHHHHHHHHHHHHHGHHHGHG0FGH03GHGF<DC2FFGHHHH
As you can see the reads are not correlative in each file so I supposse this makes the thing more complicated. My files are not very big (5.2Mb and 5.4 Mb respectively) so I wonder if you know a way for selecting only the ones which have data for both ends.
This next script should be about four times faster, but is only suitable for FASTQ files:
Code:from Bio.SeqIO.QualityIO import FastqGeneralIterator mixed_file = "mixed.fastq" paired_file = "paired.fastq" out_handle = open(paired_file, "w") prev = None for curr in FastqGeneralIterator(open(mixed_file, "rU")): if curr[0].split()[0].endswith("/1"): prev = curr elif not curr[0].split()[0].endswith("/2"): raise ValueError("Expect IDs to end /1 and /2,\n%s" % curr[0]) elif prev and prev[0].split()[0] == curr[0].split()[0][:-2] + "/1": out_handle.write("@%s\n%s\n+\n%s\n" % prev) out_handle.write("@%s\n%s\n+\n%s\n" % curr) prev = None out_handle.close()
This was written and tested using Biopython 1.54beta, should be fine on Biopython 1.51 or later. I've only used a 500MB test file - it took about 30s.
If you are willing to make more assumptions about the file layout, or skip the minimal error checking, I could make this faster still - but I would need to see a sample of the data as suggested before (e.g. the first ten reads).[/QUOTE]Last edited by chariko; 08-05-2014, 07:15 AM.
Comment
-
Originally posted by chariko View PostI have tried to run your script but the problem is that my file does not have the /1 or /2 ending. In my case the file is told with 1 or 2 before the N on the last letters of the file (bold) have two separate fastq files which look like: ...
As you can see the reads are not correlative in each file so I supposse this makes the thing more complicated. My files are not very big (5.2Mb and 5.4 Mb respectively) so I wonder if you know a way for selecting only the ones which have data for both ends.
I have a related Python script (which has a Galaxy wrapper) which knows more naming schemes, but it expects the reads to be sorted/interleaved - yours are more mixed up:
Galaxy tools and wrappers for sequence analysis. Contribute to peterjc/pico_galaxy development by creating an account on GitHub.
Do you have the original Illumina FASTQ files before whatever processing was done to them?
Peter
Comment
-
Originally posted by maubp View PostIllumina changed their naming schema. There are scripts out there to convert this back to the /1 and /2 style (even as Unix one-line commands with sed or awk).
I have a related Python script (which has a Galaxy wrapper) which knows more naming schemes, but it expects the reads to be sorted/interleaved - yours are more mixed up:
Galaxy tools and wrappers for sequence analysis. Contribute to peterjc/pico_galaxy development by creating an account on GitHub.
Do you have the original Illumina FASTQ files before whatever processing was done to them?
Peter
Comment
-
Either use a pair-aware filtering pipeline, or, I suggest:
1. Interleave the pairs (one FASTQ file with r1, r2, r1, r2, etc)
2. Filter the interleaved file (preserving the original order)
3. Run https://github.com/peterjc/pico_gala...aired_unpaired
Comment
-
Originally posted by chariko View PostI have tried to run your script but the problem is that my file does not have the /1 or /2 ending. In my case the file is told with 1 or 2 before the N on the last letters of the file (bold) have two separate fastq files
Comment
-
Thanks SES for your suggestion. I will try it also.
Anyway, I finally made it work by doing the following:
First I changed my ending form 1:N to /1 and 2:N to /1 and /2 with:
gawk '{print((NR % 4 == 1) ? $1"/1" : $0)}' Sample1_R1.fq > Sample1_newtags_R1.fq
gawk '{print((NR % 4 == 1) ? $1"/2" : $0)}' Sample1_R2.fq > Sample1_newtags_R2.fq
(https://wikis.utexas.edu/display/bio...nux+one-liners).
I interleaved the pairs using the script interleave_fastq.py (https://gist.github.com/ngcrawford/2232505).
I filter my fastq file using the following script:
(https://github.com/brentp/bio-playgr...r/reads-utils/)
the problem is that the filtered file does not mantain the order of paired reads so you will have it to do it manually (in my case with Excel).
And finally after installing galaxy locally I run the galaxy script made by maubp.
Galaxy tools and wrappers for sequence analysis. Contribute to peterjc/pico_galaxy development by creating an account on GitHub.
When running again fastqc with my files, the duplication levels where ok.
Thanks a lot for your suggestions.Last edited by chariko; 08-08-2014, 06:44 AM.
Comment
Latest Articles
Collapse
-
by seqadmin
Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...-
Channel: Articles
12-16-2024, 07:57 AM -
-
by seqadmin
Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.
Long-Read Sequencing
Long-read sequencing has seen remarkable advancements,...-
Channel: Articles
12-02-2024, 01:49 PM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
Started by seqadmin, 12-17-2024, 10:28 AM
|
0 responses
27 views
0 likes
|
Last Post
by seqadmin
12-17-2024, 10:28 AM
|
||
Started by seqadmin, 12-13-2024, 08:24 AM
|
0 responses
43 views
0 likes
|
Last Post
by seqadmin
12-13-2024, 08:24 AM
|
||
Started by seqadmin, 12-12-2024, 07:41 AM
|
0 responses
29 views
0 likes
|
Last Post
by seqadmin
12-12-2024, 07:41 AM
|
||
Started by seqadmin, 12-11-2024, 07:45 AM
|
0 responses
42 views
0 likes
|
Last Post
by seqadmin
12-11-2024, 07:45 AM
|
Comment