SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Converting FASTA/qual file pair from 454 to FASTQ oiiio Bioinformatics 9 01-01-2016 04:55 PM
Consensus FASTA from BAM files mixter Bioinformatics 14 05-19-2014 10:07 AM
building a consensus FASTA from BAM(s) adaptivegenome Bioinformatics 3 06-18-2013 08:10 PM
Any scripts converting fastq 2 scarf mingkunli Bioinformatics 1 06-09-2011 06:08 AM
Converting Solexa new format to FASTQ asafle Bioinformatics 3 08-01-2009 11:07 AM

Reply
 
Thread Tools
Old 12-17-2009, 11:06 AM   #1
zlu
Member
 
Location: UK

Join Date: Nov 2008
Posts: 32
Default converting consensus fastq to fasta

I was trying to convert fastq consensus sequneces (generated by the pileup2fq utility of samtools) to a multi-fasta file with the MAQ's fq_all2std.pl fq2fa script but it didn't work. The result is a mix of 60bp sequence and quality lines with the > headers.

Is there quick way to extract only the sequences from the fastq file?

Thank you.
zlu is offline   Reply With Quote
Old 12-18-2009, 03:45 AM   #2
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,543
Default

Quote:
Originally Posted by zlu View Post
I was trying to convert fastq consensus sequneces (generated by the pileup2fq utility of samtools) to a multi-fasta file with the MAQ's fq_all2std.pl fq2fa script but it didn't work. The result is a mix of 60bp sequence and quality lines with the > headers.

Is there quick way to extract only the sequences from the fastq file?

Thank you.
I personally use Biopython, e.g.
http://www.biopython.org/wiki/Reading_from_unix_pipes
http://news.open-bio.org/news/2009/0...vert-function/

Code:
from Bio import SeqIO
count = SeqIO.convert("example.fastq", "fastq", "example.fasta", "fasta")
print "Converted %i reads" % count
You might also consider BioPerl or EMBOSS seqret, etc.

However, the problem could be a broken FASTQ file. Could you show use the start of the file (say two reads)?
maubp is offline   Reply With Quote
Old 12-18-2009, 04:05 AM   #3
zlu
Member
 
Location: UK

Join Date: Nov 2008
Posts: 32
Default

Perhaps I didn't explain it clearly. I wasn't trying to convert the Illumina short reads but mapped consensus sequence of a few chromosomes. Each is a few Mbp long, e.g:

@chr1
nnnnnnnagatagaaataCACGATGCGAGCAATCAAATTTCATAACATCACCATGAGTTT
GGTCCGAAGCATGAGTGTTTACAATGTTTGAAtaCCTTATACAGTTCTTATACATACTTT
ATAAATTATTTCCCaagctgttttgatacactcactaacagaTATTCTATAGAAGGAAAA
GTTATCCACTTATGCACATTTATAGTTTTCAGAATTGTGGATAATTAGAAATTACACACA
AAGTTATACTATTTTTAGCAACATATTCACAGGTATTTGACATATAGAGAACTGAAAAAG
TATAATTGTGTGGATAAGTCGTCCAACTCATGATTTTATAAGGATTTATTTATTGATATT
TACATAAAAATACTGTGCATAACTAATAAGCAGGATAAAGTTATCCACCGATTGTTATTA
+
!!!!!!!????????BBBEEEHKKKKKKKKKKKKNNNNNNQQNNNNNNQWWWW7WZWWWZ
ZZZ]]]`````````>]]]]]]]ZTQQQTQNNKKKKKKHKKHHHEEEEEEEHHHHHHHHH
HKKHHHHHHEEEEEBBBBBBBBBBBB?=B@BBBBBB??BBBBEEEEEEEEEEHKNNNNNQ
QQTTTTWWWWWWWTTWZWWW]`>```c``]`ccc``cfcfliiloouSxxuuuuTollLo
olliifilloolfif````]]WWWWWTTTTTTTTQQQQQQQNKKHHHHEEEEHEEEHHEE
EEEEEEEHHHHHHHHHEEEEEEHHHHHEEHKHHHHHHHHHHHEEHHHHHHHHKNNNNKNN
NQQ@NKNNNQQQTWWWZZZBZZZZ]<`]ZZZZZZZZLZZ``]]]ZZZWTTTQQQNNNNNK

And using the method described above, this is what I got:

>chr1nnnnnnnagatagaaataCACGATGCGAGCAATCAAATTTCATAACATCACCATGAGTTT
>`]]``fGff`````]iifc`````cciiLoxuuuxZ{{xxruuxx{{{~~rrrrrrrrr
Ooiiff`ZZZWTTTTTTQQNTQTWWWWWW]]]`]]cfifffciiiiiorrxx{{~x~{{x
>QNNQQQTTQQQQNNNNQZ``cfffolllloollruuuurrroruxxxxx{~~xuroorr
rrorruxxxroooux{{{xxuuuuuruurrrrollccfcc]ZZB]?]]W7QQTTWWWWWW
>QTT7TQQTTTWZZZZ]]]]]]ZZZZ]]]]]]`````]]]]]`]`c`]```Z]``fflll
louxxxxuS{{{{{~~~~~~{{~{{{{xuuroiilroiiollorlloorZ{~~{xxxuuu
>KKKKKKKHHEB????BBBEEKHHHKKKKNNNNQQQQTZZZZZZ]]````````]]WWW<
WWWW7TTTTTTTQKKKKKKHHEEEE<BBEHHH=HEEEEEHHHHEEEEHHHHHHKKKKKKK
>QQQBQQQNNNNKKKKKKQQQQQQTTTTWWZWTTQQQQQQQQTTWZZZZZ]ZZTWWTTTQ
QQTQQNNNTWWWZZ]]]ZZWWWWWWTTTTQQQEQQQQNNNNNNHHHHEEBEEEEEBBBBB
>QQQATTTTTTTWWTTTQQQQQQQNNKKKQQQTTTT7TTQQ5QNNNQKNNQQQQQQQQQQ
QQN4HKKHHHHHHHHHHHHHKHHEEBBBBBBBBBBBBBBB7??????????????!!!!!
>LAcffffooor{{Z~~~~~~~~~~|~~~q~~~~~v~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~c~~xuuric```]]ZZWTNKHEEB????!!!!!!!!!!!!!!!!!!
>QQQ5>QQ@QNNNNNNQQQQNNNNNNKHEBBBB???BBBBBBBBBEEEEEEBBBBBBBBB
BBBBBBEHKNNKKKKKKNNNKKKKKKKKKKKW]]]cccfffc`]Z`>cfiOollllllLl
>NQQTTTWQQQQQQTTNNNQQQT7GTTAWWWTTTZZW7TQTTQQQQQQTQQTTTQTTTTT
WWZWZ]]]]WZ]``D]]]D```D]]]Z]]]ZZWWWT7@QNKKKKKHEBBBBHKHHHNQTT
zlu is offline   Reply With Quote
Old 12-18-2009, 04:35 AM   #4
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,543
Default

Are the sequences and quality lines in the original FASTQ line wrapped? This is valid but not recommended as some tools (perhaps including MAQ) don't understand it.

The tools I suggested can all cope with line wrapped FASTQ. See also: http://dx.doi.org/10.1093/nar/gkp1137

(Or it may be the forum making it look line wrapped and adding spaces when it isn't really - trying using the [ code ] text [ /code ] tags to stop this)

Last edited by maubp; 12-18-2009 at 04:37 AM.
maubp is offline   Reply With Quote
Old 12-18-2009, 06:56 AM   #5
zlu
Member
 
Location: UK

Join Date: Nov 2008
Posts: 32
Default

yes, the lines are indedd wrapped. Thanks for the reminder.

However, trying seqret of EMBOSS 6.1, I got the following error:

[SBSUser@pipeline Assembly2]$ seqret -sformat fastq-sanger
Reads and writes (returns) sequences
Input (gapped) sequence(s): CNS.fq
Error: Unable to read sequence CNS.fq'
zlu is offline   Reply With Quote
Old 12-18-2009, 07:11 AM   #6
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,543
Default

Quote:
Originally Posted by zlu View Post
yes, the lines are indedd wrapped. Thanks for the reminder.
That probably does explain MAQ's failure.
Quote:
Originally Posted by zlu View Post
However, trying seqret of EMBOSS 6.1,
Is that plain un-patched EMBOSS 6.1.0? They are currently on patch 3, and this did include some FASTQ fixes. See:
ftp://emboss.open-bio.org/pub/EMBOSS/fixes/README.fixes

Quote:
Originally Posted by zlu View Post
I got the following error:

[SBSUser@pipeline Assembly2]$ seqret -sformat fastq-sanger
Reads and writes (returns) sequences
Input (gapped) sequence(s): CNS.fq
Error: Unable to read sequence CNS.fq'
That is odd. Note you can also try giving the filename as part of the command line, e.g.

$ seqret -sformat fastq-sanger -sequence CNS.fq -osformat fasta -outseq CNS.fasta

How big is the file? If you zipped it up and emailed it to me I could try it here for you if you liked.
maubp is offline   Reply With Quote
Old 12-18-2009, 07:39 AM   #7
lh3
Senior Member
 
Location: Boston

Join Date: Feb 2008
Posts: 693
Default

Most of fastq parsers do not work with multi-line fastq files. You'd better write your own script. After all, processing multi-line fastq is not that hard.
lh3 is offline   Reply With Quote
Old 12-18-2009, 08:08 AM   #8
zlu
Member
 
Location: UK

Join Date: Nov 2008
Posts: 32
Default

Quote:
Originally Posted by maubp View Post
Is that plain un-patched EMBOSS 6.1.0? They are currently on patch 3, and this did include some FASTQ fixes.
I'll try to patch it and see how it goes.

Quote:
Originally Posted by maubp View Post
How big is the file? If you zipped it up and emailed it to me I could try it here for you if you liked
The compressed file is more than 1.3GB.

Thanks for your help. I'll try to process the fastq with a script.
zlu is offline   Reply With Quote
Old 12-22-2009, 03:42 AM   #9
zlu
Member
 
Location: UK

Join Date: Nov 2008
Posts: 32
Default

Using the patched emboss, I can now convert the line wrapped fastq files. Thank you.
zlu is offline   Reply With Quote
Old 12-22-2009, 03:49 AM   #10
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,543
Default

Quote:
Originally Posted by lh3 View Post
Most of fastq parsers do not work with multi-line fastq files. You'd better write your own script. After all, processing multi-line fastq is not that hard.
Why doesn't MAQ understand multi-line fastq? Would you accept a patch to implement this?
maubp is offline   Reply With Quote
Old 12-22-2009, 06:04 AM   #11
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,170
Default

Quote:
Originally Posted by maubp View Post
Quote:
Originally Posted by lh3 View Post
Most of fastq parsers do not work with multi-line fastq files.
Why doesn't MAQ understand multi-line fastq? Would you accept a patch to implement this?
From the Wikipedia article on FASTQ (http://en.wikipedia.org/wiki/FASTQ_format):

Quote:
The original Sanger FASTQ files also allowed the sequence and quality strings to be wrapped (split over multiple lines), but this is generally discouraged as it can make parsing complicated due to the unfortunate choice of "@" and "+" as markers (these characters can also occur in the quality string).
kmcarr is offline   Reply With Quote
Old 12-22-2009, 06:22 AM   #12
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,543
Default

Just because the consensus is that line wrapping in FASTQ is/was a bad idea, doesn't mean it can't be reliably parsed. It does make the code a little more complicated I admit - but as this thread shows, it would be useful to some if MAQ could understand line-wrapped FASTQ.
maubp is offline   Reply With Quote
Old 12-22-2009, 07:34 AM   #13
lh3
Senior Member
 
Location: Boston

Join Date: Feb 2008
Posts: 693
Default

The version in MAQ svn, which is never released, parses multi-line fastq and accepts gzipped input, as it uses the same parser as bwa. Nonetheless, as maq only works with reads no longer than 127bp, not supporting multi-line fastq is not a major problem. It is more important for bwa to parse multi-line fastq as it works with long reads.
lh3 is offline   Reply With Quote
Old 09-11-2010, 12:55 AM   #14
gpcr
Member
 
Location: usa

Join Date: May 2010
Posts: 18
Smile

Quote:
Originally Posted by zlu View Post
Perhaps I didn't explain it clearly. I wasn't trying to convert the Illumina short reads but mapped consensus sequence of a few chromosomes. Each is a few Mbp long, e.g:

@chr1
nnnnnnnagatagaaataCACGATGCGAGCAATCAAATTTCATAACATCACCATGAGTTT
GGTCCGAAGCATGAGTGTTTACAATGTTTGAAtaCCTTATACAGTTCTTATACATACTTT
ATAAATTATTTCCCaagctgttttgatacactcactaacagaTATTCTATAGAAGGAAAA
GTTATCCACTTATGCACATTTATAGTTTTCAGAATTGTGGATAATTAGAAATTACACACA
AAGTTATACTATTTTTAGCAACATATTCACAGGTATTTGACATATAGAGAACTGAAAAAG
TATAATTGTGTGGATAAGTCGTCCAACTCATGATTTTATAAGGATTTATTTATTGATATT
TACATAAAAATACTGTGCATAACTAATAAGCAGGATAAAGTTATCCACCGATTGTTATTA
+
!!!!!!!????????BBBEEEHKKKKKKKKKKKKNNNNNNQQNNNNNNQWWWW7WZWWWZ
ZZZ]]]`````````>]]]]]]]ZTQQQTQNNKKKKKKHKKHHHEEEEEEEHHHHHHHHH
HKKHHHHHHEEEEEBBBBBBBBBBBB?=B@BBBBBB??BBBBEEEEEEEEEEHKNNNNNQ
QQTTTTWWWWWWWTTWZWWW]`>```c``]`ccc``cfcfliiloouSxxuuuuTollLo
olliifilloolfif````]]WWWWWTTTTTTTTQQQQQQQNKKHHHHEEEEHEEEHHEE
EEEEEEEHHHHHHHHHEEEEEEHHHHHEEHKHHHHHHHHHHHEEHHHHHHHHKNNNNKNN
NQQ@NKNNNQQQTWWWZZZBZZZZ]<`]ZZZZZZZZLZZ``]]]ZZZWTTTQQQNNNNNK

And using the method described above, this is what I got:

>chr1nnnnnnnagatagaaataCACGATGCGAGCAATCAAATTTCATAACATCACCATGAGTTT
>`]]``fGff`````]iifc`````cciiLoxuuuxZ{{xxruuxx{{{~~rrrrrrrrr
Ooiiff`ZZZWTTTTTTQQNTQTWWWWWW]]]`]]cfifffciiiiiorrxx{{~x~{{x
>QNNQQQTTQQQQNNNNQZ``cfffolllloollruuuurrroruxxxxx{~~xuroorr
rrorruxxxroooux{{{xxuuuuuruurrrrollccfcc]ZZB]?]]W7QQTTWWWWWW
>QTT7TQQTTTWZZZZ]]]]]]ZZZZ]]]]]]`````]]]]]`]`c`]```Z]``fflll
louxxxxuS{{{{{~~~~~~{{~{{{{xuuroiilroiiollorlloorZ{~~{xxxuuu
>KKKKKKKHHEB????BBBEEKHHHKKKKNNNNQQQQTZZZZZZ]]````````]]WWW<
WWWW7TTTTTTTQKKKKKKHHEEEE<BBEHHH=HEEEEEHHHHEEEEHHHHHHKKKKKKK
>QQQBQQQNNNNKKKKKKQQQQQQTTTTWWZWTTQQQQQQQQTTWZZZZZ]ZZTWWTTTQ
QQTQQNNNTWWWZZ]]]ZZWWWWWWTTTTQQQEQQQQNNNNNNHHHHEEBEEEEEBBBBB
>QQQATTTTTTTWWTTTQQQQQQQNNKKKQQQTTTT7TTQQ5QNNNQKNNQQQQQQQQQQ
QQN4HKKHHHHHHHHHHHHHKHHEEBBBBBBBBBBBBBBB7??????????????!!!!!
>LAcffffooor{{Z~~~~~~~~~~|~~~q~~~~~v~~~~~~~~~~~~~~~~~~~~~~~~
~~~~~~~~~~~~~~c~~xuuric```]]ZZWTNKHEEB????!!!!!!!!!!!!!!!!!!
>QQQ5>QQ@QNNNNNNQQQQNNNNNNKHEBBBB???BBBBBBBBBEEEEEEBBBBBBBBB
BBBBBBEHKNNKKKKKKNNNKKKKKKKKKKKW]]]cccfffc`]Z`>cfiOollllllLl
>NQQTTTWQQQQQQTTNNNQQQT7GTTAWWWTTTZZW7TQTTQQQQQQTQQTTTQTTTTT
WWZWZ]]]]WZ]``D]]]D```D]]]Z]]]ZZWWWT7@QNKKKKKHEBBBBHKHHHNQTT
awk '/@chr1$/,/^+$/' consensus.fastq | perl -pe "s/@/>/ ; s/\+//" > chr1.fasta

here "chr1" is chromosome
gpcr is offline   Reply With Quote
Old 08-13-2011, 05:41 AM   #15
yifangt
Member
 
Location: Canada

Join Date: Feb 2011
Posts: 61
Default

My question is related, but not quite the same, which is: I need to get a sub-string of the sequence and the corresponding quality score for each of the entries, the file format untouched. My Illumina reads consists of 101bp long. I want remove the first 10bp and last 25bp then only the middle 66bp left.

The reason I need do this is my DNA sequence is methylated and the first 10 and last 25bp seem not having good quality in general so that I want get rid of them. My challenge is to remove both quality score and sequence correspondingly.
Code:
use Bio::SeqIO; use Bio::Seq::Quality;

$seqio = Bio::SeqIO->new('-format'=>'fastq' , '-file'=>'some.fasq');

my $out_fastq = Bio::SeqIO->new( -format => 'fastq', '-file'=> 'subset.fastq');

while((my $line = $seqio->next_seq() ) { 
# keep the id 
# substring the sequence (from 11 to 66) 
# substring the quality score (from 11 to 66) $out_fastq->write_seq($line); }
Or any tools there to do my job?

Thanks a lot!

Yifang

Last edited by yifangt; 08-13-2011 at 05:44 AM.
yifangt is offline   Reply With Quote
Old 08-13-2011, 05:50 AM   #16
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,170
Default

Quote:
Originally Posted by yifangt View Post
My question is related, but not quite the same, which is: I need to get a sub-string of the sequence and the corresponding quality score for each of the entries, the file format untouched. My Illumina reads consists of 101bp long. I want remove the first 10bp and last 25bp then only the middle 66bp left.

The reason I need do this is my DNA sequence is methylated and the first 10 and last 25bp seem not having good quality in general so that I want get rid of them. My challenge is to remove both quality score and sequence correspondingly.
Code:
use Bio::SeqIO; use Bio::Seq::Quality;

$seqio = Bio::SeqIO->new('-format'=>'fastq' , '-file'=>'some.fasq');

my $out_fastq = Bio::SeqIO->new( -format => 'fastq', '-file'=> 'subset.fastq');

while((my $line = $seqio->next_seq() ) { 
# keep the id 
# substring the sequence (from 11 to 66) 
# substring the quality score (from 11 to 66) $out_fastq->write_seq($line); }
Or any tools there to do my job?

Thanks a lot!

Yifang
Yifang,

First rule of coding, never write your own when a tool already exist to do what you want. The FASTX-Toolkit has a utility to do exactly what you want, namely the fastx_trimmer. You give an input file (fasta or fastq), the postion of the first and last base you wish to keep (in your case 11 and 76) and it will produce a trimmed (bases and quality) file.
kmcarr is offline   Reply With Quote
Old 08-13-2011, 06:03 AM   #17
yifangt
Member
 
Location: Canada

Join Date: Feb 2011
Posts: 61
Default

Thanks a lot kmcarr! I was searching for it, just missed the inside. I will give it a try.

Best!
Yifang
yifangt is offline   Reply With Quote
Old 08-17-2011, 09:17 AM   #18
gpcr
Member
 
Location: usa

Join Date: May 2010
Posts: 18
Default

Code:
awk '{print ; getline } {print substr($0, 11, 76) ; getline; print ; getline ; print substr($0, 11, 76) }' input.fastq
gpcr is offline   Reply With Quote
Old 08-17-2011, 10:11 AM   #19
yifangt
Member
 
Location: Canada

Join Date: Feb 2011
Posts: 61
Default

This is a cool script. Thanks gpcr! Yifang
yifangt is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 12:35 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO