I extracted some fastq files from sra files. Here are some lines from one of them:
In trying to figure out what quality scores are being used in a fastq file, I usually look for strings of Bs at the ends of reads (telling me that it is the old Illumina scoring system) or strings of #s at the end of reads (telling me that it is the normal Sanger scoring system). But in these NCBI-created fastq files, I don't see any #s or Bs. I have three questions:
1. Why are there no #s or Bs?
2. How can I figure out what scoring system was used here?
3. With what I consider a more normal fastq file (with lines like those pasted below), what is the best way to figure out what scoring system is being used?
For what it's worth, the Wikipedia page on the fastq file format implies that quality scores in NCBI-converted fastq files are automatically converted to the Sanger format.
Thanks.
Eric
Code:
@SRR034473.2 X8097_104:6:1:881:909 length=39 TCAAAAAATGAAGAAGAAGAAAAAAATGAAAAGGGTGCA +SRR034473.2 X8097_104:6:1:881:909 length=39 CC<7C<<CCC<<C<7C<?CC<<,;,7CC6:?0??:(??. @SRR034473.3 X8097_104:6:1:900:876 length=39 TGAAGTTCTTGTGGTTCAACCAAGTGTATTGCCAGTACT +SRR034473.3 X8097_104:6:1:900:876 length=39 C<?<CCCC77CCC?4?C<4C<<$C,C47?C<??7*<(44 @SRR034473.4 X8097_104:6:1:905:908 length=39 TTGATGTGACTTGAAGGCTTCATCTCCTTTTTAGTGATT +SRR034473.4 X8097_104:6:1:905:908 length=39 CCCCCCCCACCCA??CAACAA-CCCA?ACCAA6<A7+??
1. Why are there no #s or Bs?
2. How can I figure out what scoring system was used here?
3. With what I consider a more normal fastq file (with lines like those pasted below), what is the best way to figure out what scoring system is being used?
Code:
@SEQ_ID GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
Thanks.
Eric
Comment