SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
how to convert general fastq to fastq int format? feng Bioinformatics 21 07-04-2014 12:40 AM
i converted illumina fastq into sanger fastq, need advice Aicen Bioinformatics 5 08-27-2012 07:24 AM
Convert illumina v1.5 fastq to sanger fastq zouzou Bioinformatics 29 05-14-2012 10:07 PM
Reduce file size after Illumina FASTQ to Sanger FASTQ conversion? jjw14 Illumina/Solexa 2 06-01-2010 05:35 PM
format problem:convert fastq to seq/qual file anyone1985 Bioinformatics 1 04-10-2009 09:27 AM

Reply
 
Thread Tools
Old 12-21-2009, 08:26 AM   #1
byb121
Member
 
Location: Newcastle upon Tyne

Join Date: Aug 2009
Posts: 18
Default For MAQ: Is there a Tool to convert sanger-format fastq file to illumina-fotmat fastq

Hello everyone,

I am new to next-gen sequencing and this forum. Hope someone can help me out here.

To practice and test software tools for alignment, I downloaded a short reads dataset of a yeast genome and tried to convert the sanger-fastq format data to Maq’s BFQ ( I didn't know that SRA provides sanger-format fastq and MAQ prefer the other format of fastq).

Command line I used was
Code:
maq fastq2bfq SRR002051.fastq SRR002051.bfq
Part of warnings showed on the screen.
Code:
[seq_read_fastq] Inconsistent sequence name: ;E)$$$%%%%$%$""&"""""". Continue anyway.
[seq_read_fastq] Inconsistent sequence name: 32-)"""""". Continue anyway.
[seq_read_fastq] Inconsistent sequence name: *IDI*II%A;1+3&"""""". Continue anyway.
[seq_read_fastq] Inconsistent sequence name: $$,$"#&&%4&+$("""""". Continue anyway.
[seq_read_fastq] Inconsistent sequence name: 6&%*I)''%11#"+-"""""". Continue anyway.
[seq_read_fastq] Inconsistent sequence name: 43&"""""". Continue anyway.
[seq_read_fastq] Inconsistent sequence name: (I#$,)B:E/(&"""""". Continue anyway.
[seq_read_fastq] Inconsistent sequence name: I5.=;&#!"-"""""". Continue anyway.
[seq_read_fastq] Inconsistent sequence name: """""". Continue anyway.
[seq_read_fastq] Inconsistent sequence name: (%%+%$/"""""". Continue anyway.
[seq_read_fastq] Inconsistent sequence name: $&/#2#&%!%"!"""""". Continue anyway.
[seq_read_fastq] Inconsistent sequence name: /%%!$#%*#"&"""""". Continue anyway.
[seq_read_fastq] Inconsistent sequence name: +6+/&%+&%$"""""". Continue anyway.
[seq_read_fastq] Inconsistent sequence name: +F)'$5*&+9%""+%"""""". Continue anyway.
[seq_read_fastq] Inconsistent sequence name: %%'"!"""""". Continue anyway.
I checked the bfq format fastq file by converting it to sanger format fastq. It appears that fastq2bfq can not handle the symbol '@' contained in quality score lines in sanger format fastq files.

I am wondering if anyone have already wrote a sanger-format fastq to illumina-format fastq cnoverter, it will be really helpful to me.

Thanks.
byb121 is offline   Reply With Quote
Old 12-21-2009, 09:33 AM   #2
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,539
Default

You have already got Sanger style FASTQ files from the NCBI SRA, and MAQ likes standard Sanger FASTQ files. You would only need to convert if you started with Solexa or Illumina encoded FASTQ files

Maybe the problem is something else - could you post the first 20 lines or so of the FASTQ file in the forum - use the [ code ] data [ /code ] tags to make it display nicely.
maubp is offline   Reply With Quote
Old 12-21-2009, 09:37 AM   #3
aaronh
Member
 
Location: California

Join Date: Sep 2008
Posts: 45
Default

It is not the '@' symbol that not allowed, it is an '@' which follows an illegal space in the description. Unfortunately, many of the fastq files are not properly formatted and contain spaces in the sequence name which causes maq to mess up. Clean up the sequence names and the tool will work.
aaronh is offline   Reply With Quote
Old 12-22-2009, 02:47 AM   #4
byb121
Member
 
Location: Newcastle upon Tyne

Join Date: Aug 2009
Posts: 18
Default

Quote:
Originally Posted by maubp View Post
You have already got Sanger style FASTQ files from the NCBI SRA, and MAQ likes standard Sanger FASTQ files. You would only need to convert if you started with Solexa or Illumina encoded FASTQ files

Maybe the problem is something else - could you post the first 20 lines or so of the FASTQ file in the forum - use the [ code ] data [ /code ] tags to make it display nicely.
Thanks for replys.
Here's the 20 lines of the FASTQ file I downloaded from SRA
Code:
@SRR002051.1 :8:1:325:773 length=33
AAAGAACATTAAAGCTATATTATAAGCAAAGAT
+SRR002051.1 :8:1:325:773 length=33
IIIIIIIIIIIIIIIIIIIIIIIII'[email protected]$)-
@SRR002051.2 :8:1:409:432 length=33
AAGTTATGAAATTGTAATTCCAATATCGTAAGC
+SRR002051.2 :8:1:409:432 length=33
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIII07
@SRR002051.3 :8:1:488:490 length=33
AATTTCTTACCATATTAGACAAGGCACTATCTT
+SRR002051.3 :8:1:488:490 length=33
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIII&I
@SRR002051.4 :8:1:899:554 length=33
AGATTTCTAATATGGTTAAGAAGCGAACTTTTT
+SRR002051.4 :8:1:899:554 length=33
IIIIIIIIIIIIIIIIIII?IIIIII<IIIIII
@SRR002051.5 :8:1:464:463 length=33
AAAGCAGCAGCACGTAGTTCTTCATCCTTCTTC
+SRR002051.5 :8:1:464:463 length=33
IIIIIIIIIIIIIIIIIIIIIIIFIIIIII%.I
This is the first 20 lines of the FASTQ file that I converted from the BFQ file.
Code:
@SRR002051.1
AAAGAACATTAAAGCTATATTATAAGCAAAGAT
+
:8:1:325:773``````=33IIIIIIIIIIII
@I$)-
NNNNNNNGTNAAGTTATGAAATTGTAATTCCAATATCGTAAGC
+
!!!!!!!5:!73``````=33IIIIIIIIIIII""""""""""
@SRR002051.3
AATTTCTTACCATATTAGACAAGGCACTATCTT
+
:8:1:488:490``````=33IIIIIIIIIIII
@SRR002051.4
AGATTTCTAATATGGTTAAGAAGCGAACTTTTT
+
:8:1:899:554``````=33IIIIIIIIIIII
@SRR002051.5
AAAGCAGCAGCACGTAGTTCTTCATCCTTCTTC
+
:8:1:464:463``````=33IIIIIIIIIIII
Clearly, short read SRR002051.2 has both wrong sequence and incorrect quality scores. I checked several more reads which have '@' in quality scores, they have the same problem.

Last edited by byb121; 12-22-2009 at 03:38 AM.
byb121 is offline   Reply With Quote
Old 12-22-2009, 03:47 AM   #5
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,539
Default

Looking at that, I think aaronh is right - MAQ doesn't like the descriptions after the identifiers. I would file a bug on MAQ.

In the short term, you could convert this and remove the descriptions using another tool.

e.g. In Biopython 1.51 or later using the SeqIO interface:
Code:
from Bio import SeqIO

def remove_descr(records):
    """Iterate over SeqRecord objects clearing their description."""
    for rec in records :
        rec.description = ""
        yield rec

records = remove_descr(SeqIO.parse(open("byb121_sra.fastq"), "fastq"))

out_handle = open("byb121_maq.fastq", "w")
count = SeqIO.write(records, out_handle, "fastq")
out_handle.close()

print "Converted %i records" % count
maubp is offline   Reply With Quote
Old 12-22-2009, 07:46 AM   #6
byb121
Member
 
Location: Newcastle upon Tyne

Join Date: Aug 2009
Posts: 18
Default

Thanks a lot. Since it's a short-term practice anyway, I will just get rid of those spaces or perhaps everything after the space. It ls always good to know that I didn't do anything wrong

If MAQ can fix the problem it'll be really really great.
byb121 is offline   Reply With Quote
Old 12-20-2013, 02:26 AM   #7
bakerwm
Member
 
Location: China

Join Date: Sep 2010
Posts: 12
Default

leaving the "+“ line (third-line) empty, the maq will parse this sequence.

Before:
Code:
$cat  test.fastq
@SRR228083.sra.1HWI-EAS158_0001:5:1:1089:19990length=36
CACTTTGCGTAACGTACACTGGGNTCGCTGAANTAG
+SRR228083.sra.1 HWI-EAS158_0001:5:1:1089:19990 length=36
[email protected]@<4:7:>:>2;3>;>?#@###########
@SRR228083.sra.2HWI-EAS158_0001:5:1:1089:13103length=36
GCGCGGTGGTCCCACCTGACCCCNTGCCGAACNCAG
+SRR228083.sra.2 HWI-EAS158_0001:5:1:1089:13103 length=36
[email protected]@[email protected]=BCAB>[email protected]#@>[email protected]#######

$maq  fastq2bfq  test.fastq  test.bfq                             
[seq_read_fastq] Inconsistent sequence name: [email protected]<4:7:>:>2;3>;>?#@###########. Continue anyway.
[seq_read_fastq] Inconsistent sequence name: [email protected]@CCC=BCAB>[email protected]#@>[email protected]#######. Continue anyway.
-- finish writing file 'test.bfq'
-- 2 sequences were loaded.
After:
Code:
$cat test.new.fastq
@SRR228083.sra.1HWI-EAS158_0001:5:1:1089:19990length=36
CACTTTGCGTAACGTACACTGGGNTCGCTGAANTAG
+
[email protected]@<4:7:>:>2;3>;>?#@###########
@SRR228083.sra.2HWI-EAS158_0001:5:1:1089:13103length=36
GCGCGGTGGTCCCACCTGACCCCNTGCCGAACNCAG
+
[email protected]@[email protected]=BCAB>[email protected]#@>[email protected]#######

$maq  fastq2bfq test.new.fastq  test.new.bfq
-- finish writing file 'test.bfq'
-- 2 sequences were loaded.
bakerwm is offline   Reply With Quote
Reply

Tags
converter, fastq, maq

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 01:58 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO