Seqanswers Leaderboard Ad

**maubp** · 04-01-2010, 02:01 AM

It should be a pretty simple scripting job, but the details will depend on how you have your data (e.g. one single file with read name suffixes like /1 /2, or separate files for forward and reverse reads).

**bioenvisage** · 04-01-2010, 03:27 AM

Hi It is in one single file having forward and reverse (1 & 2)

**maubp** · 04-01-2010, 03:37 AM

You said huge, but how many reads are there (roughly)? One million? Ten million? More?

Would a Python script using Biopython be useful? It would help to show a small sample of the reads (say the first ten) to ensure I understand your read naming convention. You can wrap the example with [ code ] and [ /code ] tags to format it nicely on the forum.

**bioenvisage** · 04-06-2010, 06:03 AM

hi maubp .. there are around 50 million reads

**maubp** · 04-06-2010, 01:05 PM

About 50 million reads... that would make any approach using a list of IDs in memory rather tricky.

Can we assume the read pairs are next to each other, e.g. you started out with this:

Code:

read1.f
read1.r
read2.f
read2.r
read3.f
read3.r
read4.f
read4.r
read5.f
read5.r

and you now have just a subset but in the same order, e.g.

Code:

read1.f
read1.r
read2.r
read3.f
read4.f
read4.r
read5.r

from which the only complete pairs remaining are:

Code:

read1.f
read1.r
read4.f
read4.r

If we can assume this (that pairs if present are consecutive) this should be pretty easy to do with minimal RAM requirements.

If the reads are in a random order, things will be much harder...

**bioenvisage** · 04-07-2010, 10:32 AM

hi the read order is in

0/1
0/2
0/1
0/2
0/1
0/2

**maubp** · 04-07-2010, 01:25 PM

This is a general slow version (you can switch "fastq" to any other supported file format like "fasta" or "qual"):

Code:

from Bio import SeqIO

mixed_file = "mixed.fastq"
paired_file = "paired.fastq"

def get_paired(iterator):
    prev = None
    for curr in iterator:
        if curr.id.endswith("/1"):
            prev = curr
        elif not curr.id.endswith("/2"):
            raise ValueError("Expect IDs to end /1 and /2")
        elif prev and prev.id == curr.id[:-2] + "/1":
            yield prev
            yield curr
            prev = None

records = get_paired(SeqIO.parse(mixed_file, "fastq"))
count = SeqIO.write(records, paired_file, "fastq")
print "Saved %i records (%i pairs)" % (count, count/2)

This next script should be about four times faster, but is only suitable for FASTQ files:

Code:

from Bio.SeqIO.QualityIO import FastqGeneralIterator

mixed_file = "mixed.fastq"
paired_file = "paired.fastq"

out_handle = open(paired_file, "w")
prev = None
for curr in FastqGeneralIterator(open(mixed_file, "rU")):
    if curr[0].split()[0].endswith("/1"):
        prev = curr
    elif not curr[0].split()[0].endswith("/2"):
        raise ValueError("Expect IDs to end /1 and /2,\n%s" % curr[0])
    elif prev and prev[0].split()[0] == curr[0].split()[0][:-2] + "/1":
        out_handle.write("@%s\n%s\n+\n%s\n" % prev)
        out_handle.write("@%s\n%s\n+\n%s\n" % curr)
        prev = None
out_handle.close()

See also http://news.open-bio.org/news/2009/0...on-fast-fastq/

This was written and tested using Biopython 1.54beta, should be fine on Biopython 1.51 or later. I've only used a 500MB test file - it took about 30s.

If you are willing to make more assumptions about the file layout, or skip the minimal error checking, I could make this faster still - but I would need to see a sample of the data as suggested before (e.g. the first ten reads).

**sunnyvu** · 04-07-2010, 02:54 PM

Another way to do this: first run tophat, and check the accepted_hits.sam and delete the line with "*"
HWI-EAS266_0005:1:32:2465:21083#0 177 chr1 120 255 42M = 199385343 0 AAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAACC ^^[YVRK[bb^bbbb_babbb^bbbbbbaabbcbbb`bbbba NM:i:1
HWI-EAS266_0005:1:59:17297:6482#0 177 chr1 120 255 42M = 199385343 0 AAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAACC ORXYaa\__bbbTbabbb`b[bbb``abbbbbb_bbb`b^bb NM:i:1
HWI-EAS266_0005:1:3:4093:8164#0 145 chr1 550 255 42M = 241327181 0 GTGCAGAGGAGAACGCAGCTCCGCCCTCGCGGTGCTCTCCGG Rbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb NM:i:1
HWI-EAS266_0005:1:60:8918:19240#0 145 chr1 550 255 42M = 241327181 0 GTGCAGAGGAGAACGCAGCTCCGCCCTCGCGGTGCTCTCCGG BBBBa_\`a_`a[a[_aaaXTa```a_a[a_]]_`T]`__aa NM:i:1
HWI-EAS266_0005:1:79:5108:21091#0 73 chr1 1146 1 42M * 0 0 GCGCCCCCTGCTGGCGCCGGGGCGCTGCAGGGCCCTCTTGCT aQ_a`^ca_`a__c]`\cbbbbbb^bbb]ba^abbbbbbbbb NM:i:1
HWI-EAS266_0005:1:8:7442:18432#0 73 chr1 1168 1 42M * 0 0 CACTGCAGGGCCCTCTTGCTTACTGTATAGTGGTGGCACGCC _Sb_ab]bbb^_ba^b_b\`V]\b``b_ababbbababbbb` NM:i:0
HWI-EAS266_0005:1:35:18364:16230#0 147 chr1 1303 255 42M = 1458 0 TTCTTGCTCCAACAGTAGTGGCGGATTATAGGGAAACACCCG B_bb`bbcbbbbbcbbbbbcbbbbb`bbbbbbbabababbbb NM:i:2
HWI-EAS266_0005:1:26:17783:8912#0 137 chr1 1585 0 42M * 0 0 GGTATTTTTTTAAATTTCCACTGATGATTTTGCTGCATGGCC BBBBBababaaaa_bb]bb`bbbacbbbcc_bc`bbcbbbbb NM:i:1

Is this method reasonable?

**chariko** · 08-05-2014, 07:11 AM

I have tried to run your script but the problem is that my file does not have the /1 or /2 ending. In my case the file is told with 1 or 2 before the N on the last letters of the file (bold) have two separate fastq files which look like:
File 1:
@M00289:3:000000000-A752F:1:1102:13093:3364 1:N:0:20
GGCTGTGATCACGGGGCAAAACAGCCACTTCATTAACTTTCAGTAAGGGCTGCAAGGTAACTCCAGCAGGTGCCTTTTGTGTTTCACTCAACCCTAGGTCAAAACGTTGTTCACTCATGGCTTGTTCCAACCAAGGCGACTCTAAGG
+
BCCCBCFFFFFFGGGGGGGGGGHHHHHHHHHHHHHHHHHHHHHGHHHHHGHGGHHHGGHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHGHHHHHHHHHHHGHHHGHHHHHHHHHHHHHGGHHHHHHGGHHGHGGGGGHHHHH
@M00289:3:000000000-A752F:1:1110:7454:18361 1:N:0:20
ATCTCAACGCGTCCAAATGAGGCATCGCTGTATTCAGGTTACTTTACATAAGAGTTTTTATGTTAAATAGGACTAAAAATATACTCTAATTTTAGAGTTTTCTTTTAGGTATGATGTAAAAACATACAAGCCTAAGAGTTTAATTTAAAGG
+
CCCCCFFFDDDDGGGGGGGGGGGGHHGGHGGHHHHHHHHHHHHHHHHHHHHHHHHHHHGHHHGHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHGHHHHHHHHGHHHHHHHHHHGHHHHHHGG

FIle 2:

@M00289:3:000000000-A752F:1:1112:9370:8062 2:N:0:20
CCACACCCCCTGCAGAGCGTTCTCGCAGACACAGTCCGCAAAGCCAGTGCCGACTTGAGCCACCTTGACCAGTGTTTTTATTAGAACTAGAAACTAGAGGATTTGTTGCAC
+
BBCDDCCEDEEEGGGGGGGGGGGHGGGGGHHHHHHHHGGGGGHHHFHGHHHGCGGGHHHHHHHHHHHGHHHHHGGHHHGHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH

@M00289:3:000000000-A752F:1:1114:14694:7357 2:N:0:20
TAATTTAACTGTAGTTCATCCGCATTTTCCCCTCTAATGCCCCAGTTTTCTTTAGGGGTTTCAAATATAGTTATCTCAACATCAT
+
BBBBCFFFFFFFGGGGGGGGGGGGGGHHHHHHHGHHHHHHHHGGGHHHHHHHHHHHGGEGGHHHHHHHHHHHHHHHHHHHHHHFH
@M00289:3:000000000-A752F:1:1102:13093:3364 2:N:0:20
CCTTAGAGTCGCCTTGGTTGGAACAAGCCATGAGTGAACAACGTTTTGACCTAGGGTTGAGTGAAACACAAAAGGCACCTGCTGGAGTTACCTTGCAGCCCTTACTGAAAGTTAATGAAGTGGCTGTTTTGCCCCGTGATCACAGCC
+
BBBBCFFFFFCDGGGGGGGGGGHHHHHHGHHHHHHHHHHHHHGHHHGHGHHGHHHHHHGHHGHHHHHHHHGHHHGHHHGHHHHHHHHHHHHHHHHHFHHHGGHHHHHHHHHHHHHHHHHGHHHGHG0FGH03GHGF<DC2FFGHHHH

As you can see the reads are not correlative in each file so I supposse this makes the thing more complicated. My files are not very big (5.2Mb and 5.4 Mb respectively) so I wonder if you know a way for selecting only the ones which have data for both ends.

This next script should be about four times faster, but is only suitable for FASTQ files:

Code:

from Bio.SeqIO.QualityIO import FastqGeneralIterator

mixed_file = "mixed.fastq"
paired_file = "paired.fastq"

out_handle = open(paired_file, "w")
prev = None
for curr in FastqGeneralIterator(open(mixed_file, "rU")):
    if curr[0].split()[0].endswith("/1"):
        prev = curr
    elif not curr[0].split()[0].endswith("/2"):
        raise ValueError("Expect IDs to end /1 and /2,\n%s" % curr[0])
    elif prev and prev[0].split()[0] == curr[0].split()[0][:-2] + "/1":
        out_handle.write("@%s\n%s\n+\n%s\n" % prev)
        out_handle.write("@%s\n%s\n+\n%s\n" % curr)
        prev = None
out_handle.close()

See also http://news.open-bio.org/news/2009/0...on-fast-fastq/

This was written and tested using Biopython 1.54beta, should be fine on Biopython 1.51 or later. I've only used a 500MB test file - it took about 30s.

If you are willing to make more assumptions about the file layout, or skip the minimal error checking, I could make this faster still - but I would need to see a sample of the data as suggested before (e.g. the first ten reads).[/QUOTE]

**maubp** · 08-05-2014, 07:18 AM

Originally posted by chariko View Post

I have tried to run your script but the problem is that my file does not have the /1 or /2 ending. In my case the file is told with 1 or 2 before the N on the last letters of the file (bold) have two separate fastq files which look like: ...

As you can see the reads are not correlative in each file so I supposse this makes the thing more complicated. My files are not very big (5.2Mb and 5.4 Mb respectively) so I wonder if you know a way for selecting only the ones which have data for both ends.

Illumina changed their naming schema. There are scripts out there to convert this back to the /1 and /2 style (even as Unix one-line commands with sed or awk).

I have a related Python script (which has a Galaxy wrapper) which knows more naming schemes, but it expects the reads to be sorted/interleaved - yours are more mixed up:

pico_galaxy/tools/fastq_paired_unpaired at master · peterjc/pico_galaxy

https://github.com/peterjc/pico_galaxy/tree/master/tools/fastq_paired_unpaired

Galaxy tools and wrappers for sequence analysis. Contribute to peterjc/pico_galaxy development by creating an account on GitHub.

Do you have the original Illumina FASTQ files before whatever processing was done to them?

Peter

**chariko** · 08-05-2014, 07:49 AM

Originally posted by maubp View Post

Illumina changed their naming schema. There are scripts out there to convert this back to the /1 and /2 style (even as Unix one-line commands with sed or awk).

I have a related Python script (which has a Galaxy wrapper) which knows more naming schemes, but it expects the reads to be sorted/interleaved - yours are more mixed up:

pico_galaxy/tools/fastq_paired_unpaired at master · peterjc/pico_galaxy

https://github.com/peterjc/pico_galaxy/tree/master/tools/fastq_paired_unpaired

Galaxy tools and wrappers for sequence analysis. Contribute to peterjc/pico_galaxy development by creating an account on GitHub.

Do you have the original Illumina FASTQ files before whatever processing was done to them?

Peter

Because I obtained a huge number of duplicates in my original fastq file (it is a de novo experiment)I first removed them in each file separately and that resulted in a different number of reads and also mixed all of them. Any clue?

**maubp** · 08-05-2014, 07:53 AM

Either use a pair-aware filtering pipeline, or, I suggest:

1. Interleave the pairs (one FASTQ file with r1, r2, r1, r2, etc)
2. Filter the interleaved file (preserving the original order)
3. Run https://github.com/peterjc/pico_gala...aired_unpaired

**SES** · 08-07-2014, 03:01 PM

Originally posted by chariko View Post

I have tried to run your script but the problem is that my file does not have the /1 or /2 ending. In my case the file is told with 1 or 2 before the N on the last letters of the file (bold) have two separate fastq files

You may want to try Pairfq for a more general solution (handling multi-line Fasta/q, compressed or uncompressed, multiple Illumina formats). The specific command you would want is pairfq makepairs, and this is described on the wiki for that command.

**chariko** · 08-08-2014, 05:30 AM

Thanks SES for your suggestion. I will try it also.

Anyway, I finally made it work by doing the following:

First I changed my ending form 1:N to /1 and 2:N to /1 and /2 with:

gawk '{print((NR % 4 == 1) ? $1"/1" : $0)}' Sample1_R1.fq > Sample1_newtags_R1.fq
gawk '{print((NR % 4 == 1) ? $1"/2" : $0)}' Sample1_R2.fq > Sample1_newtags_R2.fq

(https://wikis.utexas.edu/display/bio...nux+one-liners).

I interleaved the pairs using the script interleave_fastq.py (https://gist.github.com/ngcrawford/2232505).

I filter my fastq file using the following script:
(https://github.com/brentp/bio-playgr...r/reads-utils/)

the problem is that the filtered file does not mantain the order of paired reads so you will have it to do it manually (in my case with Excel).

And finally after installing galaxy locally I run the galaxy script made by maubp.

pico_galaxy/tools/fastq_paired_unpaired at master · peterjc/pico_galaxy

https://github.com/peterjc/pico_galaxy/tree/master/tools/fastq_paired_unpaired

Galaxy tools and wrappers for sequence analysis. Contribute to peterjc/pico_galaxy development by creating an account on GitHub.

When running again fastqc with my files, the duplication levels where ok.

Thanks a lot for your suggestions.

Topics	Statistics	Last Post
Evaluating Genome Sequencing for ECMO Patients in the NICU by seqadmin Started by seqadmin, 12-17-2024, 10:28 AM	0 responses 27 views 0 likes	Last Post by seqadmin 12-17-2024, 10:28 AM
New Genetic Toolkit Refines Studies on Gene Function and Disease by seqadmin Started by seqadmin, 12-13-2024, 08:24 AM	0 responses 43 views 0 likes	Last Post by seqadmin 12-13-2024, 08:24 AM
Study Links Brain Mechanism to Emotional Responses in Animals and Humans by seqadmin Started by seqadmin, 12-12-2024, 07:41 AM	0 responses 29 views 0 likes	Last Post by seqadmin 12-12-2024, 07:41 AM
Study Identifies Ribosomal RNA Fingerprints as Early Cancer Biomarkers by seqadmin Started by seqadmin, 12-11-2024, 07:45 AM	0 responses 42 views 0 likes	Last Post by seqadmin 12-11-2024, 07:45 AM

Seqanswers Leaderboard Ad

Announcement

removal of unpaired reads

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News