SEQanswers

Go Back   SEQanswers > Sequencing Technologies/Companies > Ion Torrent



Similar Threads
Thread Thread Starter Forum Replies Last Post
Tophat error: Error: qual length (113) differs from seq length (101) for fastq record JenBarb RNA Sequencing 1 10-20-2016 09:07 AM
random subset paired-end fastq dnusol Bioinformatics 15 04-17-2016 02:36 AM
Extract subset of Fastq sequences based on a list of IDs pepperoni Bioinformatics 36 05-06-2013 01:38 AM
Extract unaligned reads (Tophat) from FastQ Uwe Appelt Bioinformatics 5 08-07-2012 04:33 AM
extract subset (mapped reads) from csfasta and .qual files KevinLam SOLiD 1 01-18-2010 12:38 AM

Reply
 
Thread Tools
Old 05-24-2013, 03:41 AM   #1
dapizarro
Junior Member
 
Location: Madrid

Join Date: May 2013
Posts: 1
Default extract subset of fastq based on length sequence??

Hi,

Does anyone has a script in Perl to extract a subset of fastq sequences based on length sequence?

thanks very much!!
dapizarro is offline   Reply With Quote
Old 05-24-2013, 11:35 AM   #2
muthu545
Member
 
Location: san antonio

Join Date: Jul 2011
Posts: 32
Default

Hi,

I've written a python code which could do the same job for you.
unzip the gz files
Input.fastq.gz
Filter_fastq_by_Sequence_length.py.gz

The input.fastq file has 50 sequence reads which are of varying length from 22 bp, 33bp, 36bp and 41 bp... This is just a model

Execute the following code in command line:
for help
python Filter_fastq_by_Sequence_length.py -h

Code:
python Filter_fastq_by_Sequence_length.py -i Input.fastq -l 22 -o Output.fastq

Once is code is executed successfully,
The Output.fastq file created will have 2 sequences reads of 22 bp each

Try to excute length - 33, 36, 41 and 0 to understand how the program works.

Then, You could try your input file on this code and change the length.
It should hopefully work.

Let me know how it goes and in case you need any help.
--
Thanks
Attached Files
File Type: gz Input.fastq.gz (1.7 KB, 78 views)
File Type: gz Filter_fastq_by_Sequence_length.py.gz (671 Bytes, 458 views)
muthu545 is offline   Reply With Quote
Old 12-13-2013, 11:49 AM   #3
wingtec
Member
 
Location: Charlottesville, VA

Join Date: Apr 2010
Posts: 34
Default subset fastq according to sequence lengths

Hi,

Thanks for your Python script.

However, when I was trying to run it in my Mac (OSX) I got the following error message:

d-128-54-196:PythonApps yb8d$ python Filter_fastq_by_Sequence_length.py -i Input.fastq -l 22 Output.fastq
Using Following inputs
Input file is Input.fastq
Seq_length is 22
Output file is
Filtering in Progress......
Traceback (most recent call last):
File "Filter_fastq_by_Sequence_length.py", line 58, in <module>
filter_by_len(param[0],param[1],param[2])
File "Filter_fastq_by_Sequence_length.py", line 6, in filter_by_len
f=open(ofile,'w')
IOError: [Errno 2] No such file or directory: ''

Can you shed some light as what caused this error?

Best

Wing
wingtec is offline   Reply With Quote
Old 12-13-2013, 12:01 PM   #4
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

You need an "-o" in front of "Output.fastq":

Code:
python Filter_fastq_by_Sequence_length.py -i Input.fastq -l 22 -o Output.fastq
dpryan is offline   Reply With Quote
Old 12-14-2013, 06:12 AM   #5
muthu545
Member
 
Location: san antonio

Join Date: Jul 2011
Posts: 32
Default

Hi Wing,

Devon's solution for the problem is right. Thanks.
The script errored out, as it was not able to recognize the
outfile file.

Quote:
Originally Posted by dpryan View Post
You need an "-o" in front of "Output.fastq":

Code:
python Filter_fastq_by_Sequence_length.py -i Input.fastq -l 22 -o Output.fastq
Thanks
--
Muthu
muthu545 is offline   Reply With Quote
Old 12-15-2013, 06:32 AM   #6
wingtec
Member
 
Location: Charlottesville, VA

Join Date: Apr 2010
Posts: 34
Default

Hi Muthu et al.,

Thanks much for the quick reply for picking up my stupid omission of a main switch. After the fix, I am happy to report that everything works beautifully.

Wing
wingtec is offline   Reply With Quote
Old 01-14-2014, 01:12 AM   #7
chayan
Member
 
Location: USA

Join Date: Nov 2012
Posts: 51
Default

Hii every one

I have two fastq files of raw reads from Ion_PGM.. I just want to know that is it possible to get the stat of how many Q20 reads it has?? and is it possible to extract those reads in fastq format?? Can i extract the reads of 100bases using the following script??

Thanx for any help in advance

Regards

Chayan
chayan is offline   Reply With Quote
Old 08-25-2014, 03:00 PM   #8
muthu545
Member
 
Location: san antonio

Join Date: Jul 2011
Posts: 32
Default

Chayan,
The script only allows you to extract Fastq sequences by length and not by quality.
Hopefully you would have figures that out by now. sorry for the late reply.

Thanks
--
Muthu
muthu545 is offline   Reply With Quote
Old 08-25-2014, 03:57 PM   #9
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

BBTools has a script called reformat.sh which will allow extraction of reads with a minimum average quality of at least X (maq=X) or minimum read length of at least Y (minlength=Y). It can also write a histogram of the read qualities (aqhist=) using linear and logarithmic averages. Requires Java.

reformat.sh in=reads.fq out=filtered.fq maq=20 minlength=100 aqhist=hist.txt
Brian Bushnell is offline   Reply With Quote
Old 08-26-2014, 01:12 AM   #10
chayan
Member
 
Location: USA

Join Date: Nov 2012
Posts: 51
Default

Okk thanks to both of you, additionally is there a tool or utility which allow k-mer based read extraction?
chayan is offline   Reply With Quote
Old 08-26-2014, 08:43 AM   #11
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Depends on exactly what you have in mind, but I wrote a tool (BBDuk) that will filter reads based on the presence of specific kmers. For example:

bbduk.sh -Xmx1g in=reads.fq out=unmatched.fq outm=matched.fq ref=kmers.fa k=31

That will split the file reads.fq into two output files, one containing reads with kmers matching the reference, and one with the rest of the reads, using a kmer length of 31.
Brian Bushnell is offline   Reply With Quote
Old 08-26-2014, 12:42 PM   #12
chayan
Member
 
Location: USA

Join Date: Nov 2012
Posts: 51
Default

Okk i understand..but i want a different utility..i have a metagenomic read files..it is more likely that within that file reads coming from a particular organism will have a similar kind of k-mer frequency, suppose tetramer and based on this criteria i want to extract the read subsets tnd hen perform the asssembly..unfortunately here i cant use any direc reference as i am lookingt for the novel lineages..am i now clear to you??
chayan is offline   Reply With Quote
Old 08-26-2014, 01:01 PM   #13
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Ahh, you want a binning tool. If you make a reference containing organisms that are somewhat closely related - say, at least 70% identity - you can use BBSplit. If not, well... there are various binning tools that use kmer frequency, or coverage, or both. But they don't tend to work well on short reads. I don't know of a single tool that will do a good job of solving this problem; I think it's generally addressed through a complicated pipeline involving a lot of labor.
Brian Bushnell is offline   Reply With Quote
Reply

Tags
iontorrent, length, perl, script, sequence

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 08:10 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO