SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
random subset paired-end fastq dnusol Bioinformatics 15 04-17-2016 02:36 AM
Extract snp ids seq_GA Bioinformatics 0 11-22-2011 05:09 PM
How to extract sequences between adaptors ? Giorgio C Bioinformatics 8 07-13-2011 12:10 AM
extract subset (mapped reads) from csfasta and .qual files KevinLam SOLiD 1 01-18-2010 12:38 AM
EXTRACT MAQ sequences? johnsequence Bioinformatics 4 01-12-2010 04:55 AM

Reply
 
Thread Tools
Old 10-12-2011, 07:16 PM   #1
pepperoni
Member
 
Location: US

Join Date: Oct 2011
Posts: 59
Default Extract subset of Fastq sequences based on a list of IDs

Does anyone has a script to extract a subset of fastq sequences based on a list of IDs?
similar to the "choose_FASTAs_from_list" script from scriptome but for fastq files

thank you very much
pepperoni is offline   Reply With Quote
Old 10-12-2011, 11:34 PM   #2
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,543
Default

Here's an SFF example copied from the Biopython Tutorial, http://biopython.org/DIST/docs/tutorial/Tutorial.htm
Code:
from Bio import SeqIO
input_file = "big_file.sff"
id_file = "short_list.txt"
output_file = "short_list.sff"
wanted = set(line.rstrip("\n").split(None,1)[0] for line in open(id_file))
print "Found %i unique identifiers in %s" % (len(wanted), id_file)
records = (r for r in SeqIO.parse(input_file, "sff") if r.id in wanted)
count = SeqIO.write(records, output_file, "sff")
print "Saved %i records from %s to %s" % (count, input_file, output_file)
if count < len(wanted):
    print "Warning %i IDs not found in %s" % (len(wanted)-count, input_file)
Switch a few format names, and we get a FASTQ version:
Code:
from Bio import SeqIO
input_file = "big_file.fastq"
id_file = "short_list.txt"
output_file = "short_list.fastq"
wanted = set(line.rstrip("\n").split(None,1)[0] for line in open(id_file))
print "Found %i unique identifiers in %s" % (len(wanted), id_file)
records = (r for r in SeqIO.parse(input_file, "fastq") if r.id in wanted)
count = SeqIO.write(records, output_file, "fastq")
print "Saved %i records from %s to %s" % (count, input_file, output_file)
if count < len(wanted):
    print "Warning %i IDs not found in %s" % (len(wanted)-count, input_file)
The IDs are taken from a text file, one per line. If your file is tab separated, the first entry of each line is used.

Probably fast enough for a one off task, if this is something you'll be doing regularly there are likely quicker solutions.

P.S. I've got Python scripts to do this and related tasks within Galaxy, checkout http://toolshed.g2.bx.psu.edu/ and in particular http://toolshed.g2.bx.psu.edu/view/p...q_select_by_id

Last edited by maubp; 05-06-2013 at 02:18 AM. Reason: Adding Tool Shed URL
maubp is offline   Reply With Quote
Old 10-12-2011, 11:49 PM   #3
maasha
Senior Member
 
Location: Denmark

Join Date: Apr 2009
Posts: 153
Default

You can do this with Biopieces (www.biopieces.org) like this:

First you need a file with the FASTQ sequence names you are interested in - or IDs if you like - one per line. And then:

Code:
read_fastq -i in.fastq | grab -E ids.txt | write_fastq -xo out.fastq
Check out grab for details.


Cheers,


Martin
maasha is offline   Reply With Quote
Old 10-13-2011, 09:52 AM   #4
pepperoni
Member
 
Location: US

Join Date: Oct 2011
Posts: 59
Default

Thank you for your quick reply Maubp
I tried saving what you wrote into a file called ExtractFastq.py and ran it but I guess it was not a complete script since I got the following error. (sorry I am a begginer)

$ python ExtractFastq.py
Traceback (most recent call last):
File "ExtractFastq.py", line 2, in <module>
input_file = TrepoSinFilt.fastq
NameError: name 'TrepoSinFilt' is not defined

For the suggestion of Galaxy I am uploading the file, however it has 15Gb so I am not sure that with the speed of my internet I will be able to achieve it.
pepperoni is offline   Reply With Quote
Old 10-13-2011, 09:56 AM   #5
pepperoni
Member
 
Location: US

Join Date: Oct 2011
Posts: 59
Default

Quote:
Originally Posted by maasha View Post
You can do this with Biopieces (www.biopieces.org) like this:

First you need a file with the FASTQ sequence names you are interested in - or IDs if you like - one per line. And then:

Code:
read_fastq -i in.fastq | grab -E ids.txt | write_fastq -xo out.fastq
Check out grab for details.


Cheers,


Martin
Thank you for your quick reply Maasha, I am trying to install Biopieces, however I cannot run the test commands, it says
zsh: command not found: bp_test

I already checked that I have all the perl modules

any suggestions?
pepperoni is offline   Reply With Quote
Old 10-13-2011, 10:06 AM   #6
maasha
Senior Member
 
Location: Denmark

Join Date: Apr 2009
Posts: 153
Default

Follow the Installation instructions carefully. Also, it looks like you are running zsh. You have to run bash for Biopieces to work.
maasha is offline   Reply With Quote
Old 10-13-2011, 10:32 AM   #7
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,543
Default

Quote:
Originally Posted by pepperoni View Post
Thank you for your quick reply Maubp
I tried saving what you wrote into a file called ExtractFastq.py and ran it but I guess it was not a complete script since I got the following error. (sorry I am a begginer)

$ python ExtractFastq.py
Traceback (most recent call last):
File "ExtractFastq.py", line 2, in <module>
input_file = TrepoSinFilt.fastq
NameError: name 'TrepoSinFilt' is not defined
Probably you just need to put quotes round the filename as in the example, that's how you define a string in Python. i.e.
Code:
...
input_file = "TrepoSinFilt.fastq"
...
You can also use single quotes rather than the double quote characters.

Quote:
Originally Posted by pepperoni View Post
For the suggestion of Galaxy I am uploading the file, however it has 15Gb so I am not sure that with the speed of my internet I will be able to achieve it.
Uploading something that large to the Penn State Galaxy might be hard, also they might have started imposing user disk quotas now...

[update] ... and the tool I'm talking about isn't on the Penn State Galaxy anyway, its an optional extra from the Tool Shed which can be added to a local Galaxy.

Last edited by maubp; 10-13-2011 at 10:56 AM.
maubp is offline   Reply With Quote
Old 10-13-2011, 10:55 AM   #8
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,543
Default

Actually, for 15Gb of data, you probably will care about the speed even for a one off - this should be about 5x faster as explained here (if you care about the Python side of things):
http://news.open-bio.org/news/2009/0...on-fast-fastq/

Something like this:
Code:
from Bio.SeqIO.QualityIO import FastqGeneralIterator

input_file = "big_file.fastq"
id_file = "short_list.txt"
output_file = "short_list.fastq"

wanted = set(line.rstrip("\n").split(None,1)[0] for line in open(id_file))
print "Found %i unique identifiers in %s" % (len(wanted), id_file)

count = 0
handle = open(output_file, "w")
for title, seq, qual in FastqGeneralIterator(open(input_file)) :
    if title.split(None,1)[0] in wanted:
        handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual))
        count += 1
handle.close()

print "Saved %i records from %s to %s" % (count, input_file, output_file)
if count < len(wanted):
    print "Warning %i IDs not found in %s" % (len(wanted)-count, input_file)
A time comparison with BioPieces would be interesting if you can get both methods to work

Please include installation and trouble shooting time as well
maubp is offline   Reply With Quote
Old 10-13-2011, 11:05 AM   #9
pmiguel
Senior Member
 
Location: Purdue University, West Lafayette, Indiana

Join Date: Aug 2008
Posts: 2,315
Default

cdbfasta -Q and cdbyank are the best solutions I have found. Fast!

But also, would not the GNU version of grep work since it allows you to return a number of lines after the line that matches with the -A parameter?

--
Phillip
pmiguel is offline   Reply With Quote
Old 10-13-2011, 11:15 AM   #10
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,543
Default

How would you supply a potentially very long list of IDs to grep?
maubp is offline   Reply With Quote
Old 10-13-2011, 01:22 PM   #11
pepperoni
Member
 
Location: US

Join Date: Oct 2011
Posts: 59
Default

Quote:
Originally Posted by maubp View Post
Actually, for 15Gb of data, you probably will care about the speed even for a one off - this should be about 5x faster as explained here (if you care about the Python side of things):
http://news.open-bio.org/news/2009/0...on-fast-fastq/

Something like this:
Code:
from Bio.SeqIO.QualityIO import FastqGeneralIterator

input_file = "big_file.fastq"
id_file = "short_list.txt"
output_file = "short_list.fastq"

wanted = set(line.rstrip("\n").split(None,1)[0] for line in open(id_file))
print "Found %i unique identifiers in %s" % (len(wanted), id_file)

count = 0
handle = open(output_file, "w")
for title, seq, qual in FastqGeneralIterator(open(input_file)) :
    if title.split(None,1)[0] in wanted:
        handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual))
        count += 1
handle.close()

print "Saved %i records from %s to %s" % (count, input_file, output_file)
if count < len(wanted):
    print "Warning %i IDs not found in %s" % (len(wanted)-count, input_file)
A time comparison with BioPieces would be interesting if you can get both methods to work

Please include installation and trouble shooting time as well


Now it has worked but I get this error:
alejandra@alejandra-desktop[DocumentosKNoCupieron] python ExtractFastq2.py
Found 3125101 unique identifiers in alignedTrepofa.lst
Traceback (most recent call last):
File "ExtractFastq2.py", line 12, in <module>
for title, seq, qual in FastqGeneralIterator(open(input_file)) :
File "/usr/lib/pymodules/python2.6/Bio/SeqIO/QualityIO.py", line 920, in FastqGeneralIterator
% (title_line, seq_len, len(quality_string)))
ValueError: Lengths of sequence and quality values differs for Corrida1:1117_10_107/1 (49 and 73).

I guess that it is because of the conversion from csfasta & qual to fastq. I used the solid2fastq from maq.
do you think that it may work to change my file to fastq sanger in galaxy before running the script?
On the other hand I already installed galaxy but in which directory shall I clone your tool from the galaxy tool shed?

thanks
pepperoni is offline   Reply With Quote
Old 10-13-2011, 02:13 PM   #12
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,543
Default

Biopython expects FASTQ records where seq and qual are the same length. Galaxy accepts colour space FASTQ where the seq is one letter longer than the qual string (the nucleotide before the colour changes doesn't get a quality score for some reason). My Galaxy tool should cope with that situation by using the Galaxy library - but the examples above won't.

However, the error message says the sequence is length 49 and the quality is 73, which does not make sense at all. Can you show use some of the file - the bit including record Corrida1:1117_10_107/1 in particular?

My guess is something has gone wrong in the solid to fastq conversion.
maubp is offline   Reply With Quote
Old 10-13-2011, 04:50 PM   #13
pepperoni
Member
 
Location: US

Join Date: Oct 2011
Posts: 59
Default

Here is a bit of the file:

@Corrida1:1117_10_107/1
GNNNANNATNCGANGNNNTNTAANNAANTNGNNTCNGATNTNNNCNNAT
+
@-"-"-">-"-"5A-"3-"'-"-"-"?-"5.(-"-"6?-"9-"7-"-"7/-"-;6-"&-"-"-"5-"-"-=
@Corrida1:1117_10_146/1
ANNNGNNCANTATNGNNNGNCCANNCCNANCNNTGNATTNCNNNGNNTT
+
B-"-"-">-"-"==-"(1&-"?-"-"-"/-"%.%-"-",.-"&-"(-"-"+1-"%-0-")-"-"-"1-"-"+%
@Corrida1:1117_10_1017/1
GNNNCNNTANGCANTNNNGNACTNNACNGNANNGTNGTTNGNNNANNTT
+
B-"-"-"<-"-"21-"=9,-"'-"-"-">-")>9-"-"))-"5-".-"-").-"=+9-"+-"-"-"%-"-"('
@Corrida1:1117_11_136/1
ANNNTNNTANGATNGNNNANGTGNNTCNGNTGNGGNGGGNTNNNANNAT
+
7-"-"-"B-"-"?<-"<=A-">-"-"-"<-"3*'-"-"81-";-"&(-"7(-"3/)-")-"-"-",-"-"%9
pepperoni is offline   Reply With Quote
Old 10-14-2011, 02:49 AM   #14
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,543
Default

Quote:
Originally Posted by pepperoni View Post
Here is a bit of the file:
Great, with the [ code ] tags (via the # button in the forum's advanced editor) we get this, without the face:
Code:
@Corrida1:1117_10_107/1
GNNNANNATNCGANGNNNTNTAANNAANTNGNNTCNGATNTNNNCNNAT
+
@-"-"-">-"-"5A-"3:(-"'-"-"-"?-"5.(-"-"6?-"9-"7-"-"7/-"-;6-"&-"-"-"5-"-"-=
@Corrida1:1117_10_146/1
ANNNGNNCANTATNGNNNGNCCANNCCNANCNNTGNATTNCNNNGNNTT
+
B-"-"-">-"-"==-"(1&-"?-"-"-"/-"%.%-"-",.-"&-"(-"-"+1-"%-0-")-"-"-"1-"-"+%
@Corrida1:1117_10_1017/1
GNNNCNNTANGCANTNNNGNACTNNACNGNANNGTNGTTNGNNNANNTT
+
B-"-"-"<-"-"21-"=9,-"'-"-"-">-")>9-"-"))-"5-".-"-").-"=+9-"+-"-"-"%-"-"('
@Corrida1:1117_11_136/1
ANNNTNNTANGATNGNNNANGTGNNTCNGNTGNGGNGGGNTNNNANNAT
+
7-"-"-"B-"-"?<-"<=A-">-"-"-"<-"3*'-"-"81-";-"&(-"7(-"3/)-")-"-"-",-"-"%9
Does that match what you see in the file, or has it got changed during posting to the forum?

It is very clear that the sequence and quality strings are not the same length. There is also a surprising number of minus signs and quotes in the quality strings. I thought at first there was an extra minus sign between every real quality character, but that doesn't quite explain it. Something is very wrong though.

Also your sequences look like nucleotides, could you confirm this is meant to be color-space data?

Could you show us the equivalent bits of the original color-space FASTA and QUAL files?

Last edited by maubp; 10-14-2011 at 02:50 AM. Reason: typo
maubp is offline   Reply With Quote
Old 10-14-2011, 04:09 AM   #15
pepperoni
Member
 
Location: US

Join Date: Oct 2011
Posts: 59
Default

yes the file is exactly like that (without the happy face)
These are the original files

# Title: Corrida_16_01RMDSPFR004_1
>1117_10_107_F3
T02...0..03.120.2...3.300..00.3.2..31.203.3...1..03
>1117_10_146_F3
T30...2..10.303.2...2.110..11.0.1..32.033.1...2..33
>1117_10_1017_F3
T32...1..30.210.3...2.013..01.2.0..23.233.2...0..33
>1117_11_136_F3
T20...3..30.203.2...0.232..31.2.32.22.222.3...0..03

# Title: Corrida_16_01RMDSPFR004_1
>1117_10_107_F3
23 31 -1 -1 -1 29 -1 -1 20 32 -1 18 25 7 -1 6 -1 -1 -1 30 -1 20 13 7 -1 -1 21 30 -1 24 -1 22 -1 -1 22 14 -1 12 26 21 -1 5 -1 -1 -1 20 -1 -1 12 28
>1117_10_146_F3
20 33 -1 -1 -1 29 -1 -1 28 28 -1 7 16 5 -1 30 -1 -1 -1 14 -1 4 13 4 -1 -1 11 13 -1 5 -1 7 -1 -1 10 16 -1 4 12 15 -1 8 -1 -1 -1 16 -1 -1 10 4
>1117_10_1017_F3
33 33 -1 -1 -1 27 -1 -1 17 16 -1 28 24 11 -1 6 -1 -1 -1 29 -1 8 29 24 -1 -1 8 8 -1 20 -1 13 -1 -1 8 13 -1 28 10 24 -1 10 -1 -1 -1 4 -1 -1 7 6
>1117_11_136_F3
16 22 -1 -1 -1 33 -1 -1 30 27 -1 27 28 32 -1 29 -1 -1 -1 27 -1 18 9 6 -1 -1 23 16 -1 26 -1 5 7 -1 22 7 -1 18 14 8 -1 8 -1 -1 -1 11 -1 -1 4 24
pepperoni is offline   Reply With Quote
Old 10-14-2011, 04:24 AM   #16
pmiguel
Senior Member
 
Location: Purdue University, West Lafayette, Indiana

Join Date: Aug 2008
Posts: 2,315
Default

BTW, SOLiD does no read filtering, so all the worst reads taken from beads sitting at the very edge of the flowcell cause the reads at the beginning and end of the files to be very low quality. You might want to try a test on reads pulled from the middle of the file. If those are okay, just filter you input data by throwing out reads that have missing data (".") bases. Not really worth the effort to get a conversion program to stop choking on garbage reads.

--
Phillip
pmiguel is offline   Reply With Quote
Old 10-14-2011, 05:15 AM   #17
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,543
Default

Your color space FASTA file:
Code:
# Title: Corrida_16_01RMDSPFR004_1
>1117_10_107_F3
T02...0..03.120.2...3.300..00.3.2..31.203.3...1..03
>1117_10_146_F3
T30...2..10.303.2...2.110..11.0.1..32.033.1...2..33
>1117_10_1017_F3
T32...1..30.210.3...2.013..01.2.0..23.233.2...0..33
>1117_11_136_F3
T20...3..30.203.2...0.232..31.2.32.22.222.3...0..03
These are leading base and fifty colour scores, total length 51.

Your color space QUAL file:
Code:
# Title: Corrida_16_01RMDSPFR004_1
>1117_10_107_F3
23 31 -1 -1 -1 29 -1 -1 20 32 -1 18 25 7 -1 6 -1 -1 -1 30 -1 20 13 7 -1 -1 21 30 -1 24 -1 22 -1 -1 22 14 -1 12 26 21 -1 5 -1 -1 -1 20 -1 -1 12 28 
>1117_10_146_F3
20 33 -1 -1 -1 29 -1 -1 28 28 -1 7 16 5 -1 30 -1 -1 -1 14 -1 4 13 4 -1 -1 11 13 -1 5 -1 7 -1 -1 10 16 -1 4 12 15 -1 8 -1 -1 -1 16 -1 -1 10 4 
>1117_10_1017_F3
33 33 -1 -1 -1 27 -1 -1 17 16 -1 28 24 11 -1 6 -1 -1 -1 29 -1 8 29 24 -1 -1 8 8 -1 20 -1 13 -1 -1 8 13 -1 28 10 24 -1 10 -1 -1 -1 4 -1 -1 7 6 
>1117_11_136_F3
16 22 -1 -1 -1 33 -1 -1 30 27 -1 27 28 32 -1 29 -1 -1 -1 27 -1 18 9 6 -1 -1 23 16 -1 26 -1 5 7 -1 22 7 -1 18 14 8 -1 8 -1 -1 -1 11 -1 -1 4 24
These have 50 quality scores, as expected. I'm not sure why there are some -1 scores, PHRED only goes down to zero, but I would expect your FASTQ to look like this (treating those as PHRED 0 which becomes ! in FASTQ):
Code:
@1117_10_107_F3
T02...0..03.120.2...3.300..00.3.2..31.203.3...1..03
+
8@!!!>!!5A!3:(!'!!!?!5.(!!6?!9!7!!7/!-;6!&!!!5!!-=
@1117_10_146_F3
T30...2..10.303.2...2.110..11.0.1..32.033.1...2..33
+
5B!!!>!!==!(1&!?!!!/!%.%!!,.!&!(!!+1!%-0!)!!!1!!+%
@1117_10_1017_F3
T32...1..30.210.3...2.013..01.2.0..23.233.2...0..33
+
BB!!!<!!21!=9,!'!!!>!)>9!!))!5!.!!).!=+9!+!!!%!!('
@1117_11_136_F3
T20...3..30.203.2...0.232..31.2.32.22.222.3...0..03
+
17!!!B!!?<!<=A!>!!!<!3*'!!81!;!&(!7(!3/)!)!!!,!!%9
maubp is offline   Reply With Quote
Old 10-14-2011, 05:17 AM   #18
pepperoni
Member
 
Location: US

Join Date: Oct 2011
Posts: 59
Default

Ok thanks pmiguel, I'll try that
pepperoni is offline   Reply With Quote
Old 10-14-2011, 05:21 AM   #19
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,543
Default

Quote:
Originally Posted by pmiguel View Post
BTW, SOLiD does no read filtering, so all the worst reads taken from beads sitting at the very edge of the flowcell cause the reads at the beginning and end of the files to be very low quality. You might want to try a test on reads pulled from the middle of the file. If those are okay, just filter you input data by throwing out reads that have missing data (".") bases. Not really worth the effort to get a conversion program to stop choking on garbage reads.
If that is the problem, it does seem worth reporting it and getting it fixed to stop someone else wasting their time with this kind of issue.

My guess is solid2fastq from maq doesn't like these -1 quality scores.
maubp is offline   Reply With Quote
Old 10-14-2011, 07:33 AM   #20
pepperoni
Member
 
Location: US

Join Date: Oct 2011
Posts: 59
Default

Hello again, I have already removed the sequences with dots in the .csfasta file and created a file with the list of IDs.
>1117_22_215_F3
T32332201112312003133333333333333333033333333333103
>1117_22_218_F3
T13321013031133113333112332130011113223331203321333
>1117_22_388_F3
T32022222220031010131122221332210302310301030210322

Now I need to choose the corresponding lines in the .qual file.

I tried to convert the .qual file into .tab first but it removed the spaces:

original .qual
>1117_10_107_F3
23 31 -1 -1 -1 29 -1 -1 20 32 -1 18 25 7 -1 6 -1 -1 -1 30 -1 20 13 7 -1 -1 21 30 -1 24 -1 22 -1 -1 22 14 -1 12 26 21 -1 5 -1 -1 -1 20 -1 -1 12 28
>1117_10_146_F3
20 33 -1 -1 -1 29 -1 -1 28 28 -1 7 16 5 -1 30 -1 -1 -1 14 -1 4 13 4 -1 -1 11 13 -1 5 -1 7 -1 -1 10 16 -1 4 12 15 -1 8 -1 -1 -1 16 -1 -1 10 4

.tab

1117_10_107_F3 2331-1-1-129-1-12032-118257-16-1-1-130-120137-1-12130-124-122-1-12214-1122621-15-1-1-120-1-11228
1117_10_146_F3 2033-1-1-129-1-12828-17165-130-1-1-114-14134-1-11113-15-17-1-11016-141215-18-1-1-116-1-1104

Does any one know how can I choose the corresponding .qual data?
thanks

Alejandra
pepperoni is offline   Reply With Quote
Reply

Tags
choose, extract, fastq, list, script

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 01:14 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO