Hello all,
I've been dowloading some Illumina/Solexa short read files from SRA such as this one, to test and get used to MAQ and BWA.
It seems the format of the provided short reads is Solexa fastq, ie.,
However, whenever I try to either convert this format to sagner format using
$ maq sol2sanger SRR002322.fastq SRR002322.sang.fastq
or try to convert this .fastq file to binary .bfq format, an extremely large warning list is shown on the terminal, spanning several thousands of errors like these,
Well, I've investigated a little, and I think I've found the origin of all this errors. All the problems concern short reads whose quality score involves an '@' symbol. For example, the three short reads matching the three errors I've just shown are
all the other sequences are converted just fine.
My bet is that MAQ scripts interprets everything after an @ as a sequence name and thus misinterprets the following lines as well. If I let the script run to the end of the file, the resulting .sagner.fastq file contains some funny short reads, apart from the normal reads like this one,
I also get a ton of for example
Note how the 'name' of this nonsensical short read is the end of the first problematic quailty score I've shown before, II9II<%IIIII6I.
So since I've searched this forum and haven't found anyone else with the same problems as me, I think I must be doing something wrong. Are the SRA files not in Solexa/Illumina fastq format? What am I missing?
Lots of thanks!
I've been dowloading some Illumina/Solexa short read files from SRA such as this one, to test and get used to MAQ and BWA.
It seems the format of the provided short reads is Solexa fastq, ie.,
Code:
@SRR002322.60 080317_CM-KID-LIV-2-REPEAT_0003:1:1:88:275 length=36 TCTGTCTCAAAAACAAAACAAAACAAAACAAAAAAA +SRR002322.60 080317_CM-KID-LIV-2-REPEAT_0003:1:1:88:275 length=36 IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIAII1
$ maq sol2sanger SRR002322.fastq SRR002322.sang.fastq
or try to convert this .fastq file to binary .bfq format, an extremely large warning list is shown on the terminal, spanning several thousands of errors like these,
Code:
[seq_read_fastq] Inconsistent sequence name: II9II<%IIIII6I. Continue anyway. [seq_read_fastq] Inconsistent sequence name: +II'IIII(). Continue anyway. [seq_read_fastq] Inconsistent sequence name: .IIIIII'IIIIIIIIIIIIIII3IIE. Continue anyway. (...)
Code:
@SRR002322.11 080317_CM-KID-LIV-2-REPEAT_0003:1:1:121:511 length=36 GTTTGGCTAAGGTTGTCTGGTAGTTAGGTGGAGTTG +SRR002322.11 080317_CM-KID-LIV-2-REPEAT_0003:1:1:121:511 length=36 IIIIIIIIIDIIHIIIIIIII[B]@II9II<%IIIII6I[/B] @SRR002322.33 080317_CM-KID-LIV-2-REPEAT_0003:1:1:110:444 length=36 TGTATTTTTAGTAGAGACGTGGTTTCACCATCTTGT +SRR002322.33 080317_CM-KID-LIV-2-REPEAT_0003:1:1:110:444 length=36 IIIIIIIII%III+IIIIIIIIIII[B]@+II'IIII()[/B] @SRR002322.63 080317_CM-KID-LIV-2-REPEAT_0003:1:1:108:770 length=36 TAAAAATGCCCTAGCCTACTTCTTACCACAAGGCAC +SRR002322.63 080317_CM-KID-LIV-2-REPEAT_0003:1:1:108:770 length=36 IIIIIIII[B]@.IIIIII'IIIIIIIIIIIIIII3IIE[/B]
My bet is that MAQ scripts interprets everything after an @ as a sequence name and thus misinterprets the following lines as well. If I let the script run to the end of the file, the resulting .sagner.fastq file contains some funny short reads, apart from the normal reads like this one,
Code:
@SRR002322.11 GTTTGGCTAAGGTTGTCTGGTAGTTAGGTGGAGTTG + !"!!!"@&.!,+&!-+7!!!3'1'%5@!!!!"!"!"
Code:
@II9II<%IIIII6I SRR.CM-KID-LIV--REPEATlengthTTTTTGCATCAAAAAGCTTTATTTCCATTTGGTCCA + %&%%%&B)0%.-)%/-9%%%5*3*(7B%%%%&%&%&!!!!!!!!!!!!!!!!!!!!!!!!!!!!
So since I've searched this forum and haven't found anyone else with the same problems as me, I think I must be doing something wrong. Are the SRA files not in Solexa/Illumina fastq format? What am I missing?
Lots of thanks!
Comment