SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
MiSeq FastQ Header for pair-end data golharam Illumina/Solexa 3 01-16-2015 02:06 PM
Missing header Fastq file Nino Bioinformatics 5 01-28-2014 03:20 AM
Rewriting the header of fastq SeqTrbl Bioinformatics 1 09-13-2013 12:48 AM

Reply
 
Thread Tools
Old 09-10-2015, 07:33 AM   #1
loba17
Member
 
Location: Switzerland

Join Date: Sep 2011
Posts: 19
Default Illumina Fastq Header Search

Dear All,

I would like to retrieve sequences (fastq format) from an Illumina fastq data file using the first part of the sequence header.

Example of a Illumina fastq header:
@X01032:109:000000000-AGKF7:1:1101:11950:1779 1:N:0:1

My query:
@X01032:109:000000000-AGKF7:1:1101:11950:1779

I tried usearch (fastx_getseqs), seqtk, and seqret but nothing works because of the special characters (e.g. ":","-") in the header. A simple grep like

Code:
grep "@X01032:109:000000000-AGKF7:1:1101:11950:1779" -A 3 in.fastq
would work but it would take a long time to finish. I could reformat the headers but I prefer not to (if possible).

Is there a tool out there that would work with Illumina fastq files?

Thanks for the help!
loba17 is offline   Reply With Quote
Old 09-10-2015, 09:27 AM   #2
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

You can do that with "filterbyname.sh" in the BBMap package.

filterbyname.sh in=reads.fq out=filtered.fq include=t names=names.txt

...where names.txt has 1 name per line. Or, you can say "names=X01032:109:000000000-AGKF7:1:1101:11950:1779" instead. This program will include reads that have non-matching stuff after the first whitespace. You should not include the leading "@" in the query, as it is not part of the name. But, if you do include the leading @ for whatever reason, then add the flag "truncateheadersymbol".
Brian Bushnell is offline   Reply With Quote
Old 09-11-2015, 12:48 AM   #3
loba17
Member
 
Location: Switzerland

Join Date: Sep 2011
Posts: 19
Default Works - problem solved!

Dear Brian,

thanks for your suggestion!

I downloaded bbmap and I tried filterbyname.sh

Code:
filterbyname.sh in=in.fq out=out.fq names=select.list include=t truncateheadersymbol

Input is being processed as unpaired
Time:               53.202 seconds.
Reads Processed:    5747570 	108.03k reads/sec
Bases Processed:    2296943848 	43.17m bases/sec
Reads Out:          65246
Bases Out:          25944173
Number of reads for in.fq: 5,747,570
Number of headers selected: 66,182
Number of reads for out.fq: 65,246

Works great and I really like the output summary!

Question 1: Is there a way (setting) to get a list of the records that did not match?

Question 2: bbmap seems to be a nice and very useful collection of tools - thanks a lot! - but is there an overview or a summary that would describe the tools briefly.

Thanks for the help !
loba17 is offline   Reply With Quote
Old 09-11-2015, 08:17 AM   #4
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,795
Default

Quote:
Originally Posted by loba17 View Post
Question 2: bbmap seems to be a nice and very useful collection of tools - thanks a lot! - but is there an overview or a summary that would describe the tools briefly.

Thanks for the help !
See this thread for a recap of many things BBMap can do: http://seqanswers.com/forums/showthread.php?t=58221

I would suggest trying outu=filename with your command to see if that captures reads that did not match.
GenoMax is offline   Reply With Quote
Old 09-11-2015, 09:27 AM   #5
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Quote:
Originally Posted by GenoMax View Post
I would suggest trying outu=filename with your command to see if that captures reads that did not match.
You know, to be consistent, I should really add that (I'll make a note to do so)! Unfortunately filterbyname does not currently capture outu. Instead, you need to run it twice, with "include=t" to capture the matching reads, and "include=f" to capture the nonmatching reads.
Brian Bushnell is offline   Reply With Quote
Old 09-14-2015, 04:55 AM   #6
loba17
Member
 
Location: Switzerland

Join Date: Sep 2011
Posts: 19
Default Thanks

Dear Brian, thanks for the clarification and the help.
loba17 is offline   Reply With Quote
Old 09-21-2015, 03:38 AM   #7
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,541
Default

My Python script with a Galaxy interface:
https://github.com/peterjc/pico_gala...q_filter_by_id
http://toolshed.g2.bx.psu.edu/view/p...q_filter_by_id
maubp is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 10:14 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO