Seqanswers Leaderboard Ad

**maubp** · 10-12-2011, 11:34 PM

Here's an SFF example copied from the Biopython Tutorial, http://biopython.org/DIST/docs/tutorial/Tutorial.htm

Code:

from Bio import SeqIO
input_file = "big_file.sff"
id_file = "short_list.txt"
output_file = "short_list.sff"
wanted = set(line.rstrip("\n").split(None,1)[0] for line in open(id_file))
print "Found %i unique identifiers in %s" % (len(wanted), id_file)
records = (r for r in SeqIO.parse(input_file, "sff") if r.id in wanted)
count = SeqIO.write(records, output_file, "sff")
print "Saved %i records from %s to %s" % (count, input_file, output_file)
if count < len(wanted):
    print "Warning %i IDs not found in %s" % (len(wanted)-count, input_file)

Switch a few format names, and we get a FASTQ version:

Code:

from Bio import SeqIO
input_file = "big_file.fastq"
id_file = "short_list.txt"
output_file = "short_list.fastq"
wanted = set(line.rstrip("\n").split(None,1)[0] for line in open(id_file))
print "Found %i unique identifiers in %s" % (len(wanted), id_file)
records = (r for r in SeqIO.parse(input_file, "fastq") if r.id in wanted)
count = SeqIO.write(records, output_file, "fastq")
print "Saved %i records from %s to %s" % (count, input_file, output_file)
if count < len(wanted):
    print "Warning %i IDs not found in %s" % (len(wanted)-count, input_file)

The IDs are taken from a text file, one per line. If your file is tab separated, the first entry of each line is used.

Probably fast enough for a one off task, if this is something you'll be doing regularly there are likely quicker solutions.

P.S. I've got Python scripts to do this and related tasks within Galaxy, checkout http://toolshed.g2.bx.psu.edu/ and in particular http://toolshed.g2.bx.psu.edu/view/p...q_select_by_id

**maasha** · 10-12-2011, 11:49 PM

You can do this with Biopieces (www.biopieces.org) like this:

First you need a file with the FASTQ sequence names you are interested in - or IDs if you like - one per line. And then:

Code:

read_fastq -i in.fastq | grab -E ids.txt | write_fastq -xo out.fastq

Check out grab for details.

Cheers,

Martin

**pepperoni** · 10-13-2011, 09:52 AM

Thank you for your quick reply Maubp
I tried saving what you wrote into a file called ExtractFastq.py and ran it but I guess it was not a complete script since I got the following error. (sorry I am a begginer)

$ python ExtractFastq.py
Traceback (most recent call last):
File "ExtractFastq.py", line 2, in <module>
input_file = TrepoSinFilt.fastq
NameError: name 'TrepoSinFilt' is not defined

For the suggestion of Galaxy I am uploading the file, however it has 15Gb so I am not sure that with the speed of my internet I will be able to achieve it.

**pepperoni** · 10-13-2011, 09:56 AM

Originally posted by maasha View Post

You can do this with Biopieces (www.biopieces.org) like this:

First you need a file with the FASTQ sequence names you are interested in - or IDs if you like - one per line. And then:

Code:

read_fastq -i in.fastq | grab -E ids.txt | write_fastq -xo out.fastq

Check out grab for details.

Cheers,

Martin

Thank you for your quick reply Maasha, I am trying to install Biopieces, however I cannot run the test commands, it says
zsh: command not found: bp_test

I already checked that I have all the perl modules

any suggestions?

**maasha** · 10-13-2011, 10:06 AM

Follow the Installation instructions carefully. Also, it looks like you are running zsh. You have to run bash for Biopieces to work.

**maubp** · 10-13-2011, 10:32 AM

Originally posted by pepperoni View Post

Thank you for your quick reply Maubp
I tried saving what you wrote into a file called ExtractFastq.py and ran it but I guess it was not a complete script since I got the following error. (sorry I am a begginer)

$ python ExtractFastq.py
Traceback (most recent call last):
File "ExtractFastq.py", line 2, in <module>
input_file = TrepoSinFilt.fastq
NameError: name 'TrepoSinFilt' is not defined

Probably you just need to put quotes round the filename as in the example, that's how you define a string in Python. i.e.

Code:

...
input_file = "TrepoSinFilt.fastq"
...

You can also use single quotes rather than the double quote characters.

Originally posted by pepperoni View Post

For the suggestion of Galaxy I am uploading the file, however it has 15Gb so I am not sure that with the speed of my internet I will be able to achieve it.

Uploading something that large to the Penn State Galaxy might be hard, also they might have started imposing user disk quotas now...

[update] ... and the tool I'm talking about isn't on the Penn State Galaxy anyway, its an optional extra from the Tool Shed which can be added to a local Galaxy.

**maubp** · 10-13-2011, 10:55 AM

Actually, for 15Gb of data, you probably will care about the speed even for a one off - this should be about 5x faster as explained here (if you care about the Python side of things):

OBF » Working with FASTQ files in Biopython when speed matters » Working with FASTQ files in Biopython when speed matters

http://news.open-bio.org/news/2009/09/biopython-fast-fastq/

Open Bioinformatics Foundation Homepage

Something like this:

Code:

from Bio.SeqIO.QualityIO import FastqGeneralIterator

input_file = "big_file.fastq"
id_file = "short_list.txt"
output_file = "short_list.fastq"

wanted = set(line.rstrip("\n").split(None,1)[0] for line in open(id_file))
print "Found %i unique identifiers in %s" % (len(wanted), id_file)

count = 0
handle = open(output_file, "w")
for title, seq, qual in FastqGeneralIterator(open(input_file)) :
    if title.split(None,1)[0] in wanted:
        handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual))
        count += 1
handle.close()

print "Saved %i records from %s to %s" % (count, input_file, output_file)
if count < len(wanted):
    print "Warning %i IDs not found in %s" % (len(wanted)-count, input_file)

A time comparison with BioPieces would be interesting if you can get both methods to work

Please include installation and trouble shooting time as well

**pmiguel** · 10-13-2011, 11:05 AM

cdbfasta -Q and cdbyank are the best solutions I have found. Fast!

But also, would not the GNU version of grep work since it allows you to return a number of lines after the line that matches with the -A parameter?

--
Phillip

**maubp** · 10-13-2011, 11:15 AM

How would you supply a potentially very long list of IDs to grep?

**pepperoni** · 10-13-2011, 01:22 PM

Originally posted by maubp View Post

Actually, for 15Gb of data, you probably will care about the speed even for a one off - this should be about 5x faster as explained here (if you care about the Python side of things):

OBF » Working with FASTQ files in Biopython when speed matters » Working with FASTQ files in Biopython when speed matters

http://news.open-bio.org/news/2009/09/biopython-fast-fastq/

Open Bioinformatics Foundation Homepage

Something like this:

Code:

from Bio.SeqIO.QualityIO import FastqGeneralIterator

input_file = "big_file.fastq"
id_file = "short_list.txt"
output_file = "short_list.fastq"

wanted = set(line.rstrip("\n").split(None,1)[0] for line in open(id_file))
print "Found %i unique identifiers in %s" % (len(wanted), id_file)

count = 0
handle = open(output_file, "w")
for title, seq, qual in FastqGeneralIterator(open(input_file)) :
    if title.split(None,1)[0] in wanted:
        handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual))
        count += 1
handle.close()

print "Saved %i records from %s to %s" % (count, input_file, output_file)
if count < len(wanted):
    print "Warning %i IDs not found in %s" % (len(wanted)-count, input_file)

A time comparison with BioPieces would be interesting if you can get both methods to work

Please include installation and trouble shooting time as well

Now it has worked but I get this error:
alejandra@alejandra-desktop[DocumentosKNoCupieron] python ExtractFastq2.py
Found 3125101 unique identifiers in alignedTrepofa.lst
Traceback (most recent call last):
File "ExtractFastq2.py", line 12, in <module>
for title, seq, qual in FastqGeneralIterator(open(input_file)) :
File "/usr/lib/pymodules/python2.6/Bio/SeqIO/QualityIO.py", line 920, in FastqGeneralIterator
% (title_line, seq_len, len(quality_string)))
ValueError: Lengths of sequence and quality values differs for Corrida1:1117_10_107/1 (49 and 73).

I guess that it is because of the conversion from csfasta & qual to fastq. I used the solid2fastq from maq.
do you think that it may work to change my file to fastq sanger in galaxy before running the script?
On the other hand I already installed galaxy but in which directory shall I clone your tool from the galaxy tool shed?

thanks

**maubp** · 10-13-2011, 02:13 PM

Biopython expects FASTQ records where seq and qual are the same length. Galaxy accepts colour space FASTQ where the seq is one letter longer than the qual string (the nucleotide before the colour changes doesn't get a quality score for some reason). My Galaxy tool should cope with that situation by using the Galaxy library - but the examples above won't.

However, the error message says the sequence is length 49 and the quality is 73, which does not make sense at all. Can you show use some of the file - the bit including record Corrida1:1117_10_107/1 in particular?

My guess is something has gone wrong in the solid to fastq conversion.

**pepperoni** · 10-13-2011, 04:50 PM

Here is a bit of the file:

@Corrida1:1117_10_107/1
GNNNANNATNCGANGNNNTNTAANNAANTNGNNTCNGATNTNNNCNNAT
+
@-"-"-">-"-"5A-"3

-"'-"-"-"?-"5.(-"-"6?-"9-"7-"-"7/-"-;6-"&-"-"-"5-"-"-=
@Corrida1:1117_10_146/1
ANNNGNNCANTATNGNNNGNCCANNCCNANCNNTGNATTNCNNNGNNTT
+
B-"-"-">-"-"==-"(1&-"?-"-"-"/-"%.%-"-",.-"&-"(-"-"+1-"%-0-")-"-"-"1-"-"+%
@Corrida1:1117_10_1017/1
GNNNCNNTANGCANTNNNGNACTNNACNGNANNGTNGTTNGNNNANNTT
+
B-"-"-"<-"-"21-"=9,-"'-"-"-">-")>9-"-"))-"5-".-"-").-"=+9-"+-"-"-"%-"-"('
@Corrida1:1117_11_136/1
ANNNTNNTANGATNGNNNANGTGNNTCNGNTGNGGNGGGNTNNNANNAT
+
7-"-"-"B-"-"?<-"<=A-">-"-"-"<-"3*'-"-"81-";-"&(-"7(-"3/)-")-"-"-",-"-"%9

**maubp** · 10-14-2011, 02:49 AM

Originally posted by pepperoni View Post

Here is a bit of the file:

Great, with the [ code ] tags (via the # button in the forum's advanced editor) we get this, without the

face:

Code:

@Corrida1:1117_10_107/1
GNNNANNATNCGANGNNNTNTAANNAANTNGNNTCNGATNTNNNCNNAT
+
@-"-"-">-"-"5A-"3:(-"'-"-"-"?-"5.(-"-"6?-"9-"7-"-"7/-"-;6-"&-"-"-"5-"-"-=
@Corrida1:1117_10_146/1
ANNNGNNCANTATNGNNNGNCCANNCCNANCNNTGNATTNCNNNGNNTT
+
B-"-"-">-"-"==-"(1&-"?-"-"-"/-"%.%-"-",.-"&-"(-"-"+1-"%-0-")-"-"-"1-"-"+%
@Corrida1:1117_10_1017/1
GNNNCNNTANGCANTNNNGNACTNNACNGNANNGTNGTTNGNNNANNTT
+
B-"-"-"<-"-"21-"=9,-"'-"-"-">-")>9-"-"))-"5-".-"-").-"=+9-"+-"-"-"%-"-"('
@Corrida1:1117_11_136/1
ANNNTNNTANGATNGNNNANGTGNNTCNGNTGNGGNGGGNTNNNANNAT
+
7-"-"-"B-"-"?<-"<=A-">-"-"-"<-"3*'-"-"81-";-"&(-"7(-"3/)-")-"-"-",-"-"%9

Does that match what you see in the file, or has it got changed during posting to the forum?

It is very clear that the sequence and quality strings are not the same length. There is also a surprising number of minus signs and quotes in the quality strings. I thought at first there was an extra minus sign between every real quality character, but that doesn't quite explain it. Something is very wrong though.

Also your sequences look like nucleotides, could you confirm this is meant to be color-space data?

Could you show us the equivalent bits of the original color-space FASTA and QUAL files?

**pepperoni** · 10-14-2011, 04:09 AM

yes the file is exactly like that (without the happy face)
These are the original files

# Title: Corrida_16_01RMDSPFR004_1
>1117_10_107_F3
T02...0..03.120.2...3.300..00.3.2..31.203.3...1..03
>1117_10_146_F3
T30...2..10.303.2...2.110..11.0.1..32.033.1...2..33
>1117_10_1017_F3
T32...1..30.210.3...2.013..01.2.0..23.233.2...0..33
>1117_11_136_F3
T20...3..30.203.2...0.232..31.2.32.22.222.3...0..03

# Title: Corrida_16_01RMDSPFR004_1
>1117_10_107_F3
23 31 -1 -1 -1 29 -1 -1 20 32 -1 18 25 7 -1 6 -1 -1 -1 30 -1 20 13 7 -1 -1 21 30 -1 24 -1 22 -1 -1 22 14 -1 12 26 21 -1 5 -1 -1 -1 20 -1 -1 12 28
>1117_10_146_F3
20 33 -1 -1 -1 29 -1 -1 28 28 -1 7 16 5 -1 30 -1 -1 -1 14 -1 4 13 4 -1 -1 11 13 -1 5 -1 7 -1 -1 10 16 -1 4 12 15 -1 8 -1 -1 -1 16 -1 -1 10 4
>1117_10_1017_F3
33 33 -1 -1 -1 27 -1 -1 17 16 -1 28 24 11 -1 6 -1 -1 -1 29 -1 8 29 24 -1 -1 8 8 -1 20 -1 13 -1 -1 8 13 -1 28 10 24 -1 10 -1 -1 -1 4 -1 -1 7 6
>1117_11_136_F3
16 22 -1 -1 -1 33 -1 -1 30 27 -1 27 28 32 -1 29 -1 -1 -1 27 -1 18 9 6 -1 -1 23 16 -1 26 -1 5 7 -1 22 7 -1 18 14 8 -1 8 -1 -1 -1 11 -1 -1 4 24

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 18 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 22 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 16 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 47 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Extract subset of Fastq sequences based on a list of IDs

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News