As per fastq file description on wikipedia(https://en.wikipedia.org/wiki/FASTQ_format) Illumina Sequence Identifier format as :
Case A. Standard Illumina Format
Read Identifier : @HWUSI-EAS100R:6:73:941:1973#0/1
/1 indicates it is R1 i.e. Forward Read and
/2 indicates it is R2 i.e. Reverse Read
Case B. Illumina with Casava 1.8
Read Identifier : @EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG
1:Y:18:ATCACG i.e substring 1: indicates it is R1
2:Y:18:ATCACG i.e substring 2: indicates it is R2 Case C : NCBI Sequence Read Archive(SRA)
Read Identifier: @SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36
Case C: NCBI SRA fastq format
Read Identifier :
@SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36
I'm pasting 4 lines from the paired end data as :
==> SRR1583191_1.fastq <==
@SRR1583191.1 SN7001163:87:C1ME6ACXX:1:1101:1176:2038 length=101
NATCCAGTAGCCTCCTCCCCATCATCTCCCATTTCTTCTACAGGGGGACTCCCCCAGGTCTGGTAGCCCAAAGCTGCTGCTACAGCCGCCATGGGGGGGTG
+SRR1583191.1 SN7001163:87:C1ME6ACXX:1:1101:1176:2038 length=101
#1=DDFFFHHGHGIIIIIIIBFHCHIIIIIEHIIGIIGIIIIHIIIIGIIIIIIIIGHCHFEFFFCEEECBBCCCCCCCCCCCCCCCCBB9@ACABBCB09
==> SRR1583191_2.fastq <==
@SRR1583191.1 SN7001163:87:C1ME6ACXX:1:1101:1176:2038 length=101
TCCTGTTCTCCCTGCTTGGAGTCTTGGTTGCCTGTGGAAATATCAGGCATGTGAATGGGAAGGCAGGAGTAGACAGTGAATGTGGCCTACTTGATTTGAGG
+SRR1583191.1 SN7001163:87:C1ME6ACXX:1:1101:1176:2038 length=101
CCCFFFFFGHHGHJJJJIICGFGHHGGHIIIIIGFCG9CGHEHIIJJJHIGHJIIIJJIHIIIJIJJIHCEEHCEFEF3@C@CCCDBDCDDDDCCCDDDDD
Here from the Case C identifier its not clear that which substring from Read Identifier can be used to distinguish R1 & R2.
I tried looking into paired end files from SRA but I could not observe R1 or R2 identifier.
I would like to know about getting R1 R2 information from fastq file contents. Apart from the three cases I would also like to know if there are any such sub strings in other fastq read identifier formats which provides R1 R2 information.
Case A. Standard Illumina Format
Read Identifier : @HWUSI-EAS100R:6:73:941:1973#0/1
/1 indicates it is R1 i.e. Forward Read and
/2 indicates it is R2 i.e. Reverse Read
Case B. Illumina with Casava 1.8
Read Identifier : @EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG
1:Y:18:ATCACG i.e substring 1: indicates it is R1
2:Y:18:ATCACG i.e substring 2: indicates it is R2 Case C : NCBI Sequence Read Archive(SRA)
Read Identifier: @SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36
Case C: NCBI SRA fastq format
Read Identifier :
@SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36
I'm pasting 4 lines from the paired end data as :
==> SRR1583191_1.fastq <==
@SRR1583191.1 SN7001163:87:C1ME6ACXX:1:1101:1176:2038 length=101
NATCCAGTAGCCTCCTCCCCATCATCTCCCATTTCTTCTACAGGGGGACTCCCCCAGGTCTGGTAGCCCAAAGCTGCTGCTACAGCCGCCATGGGGGGGTG
+SRR1583191.1 SN7001163:87:C1ME6ACXX:1:1101:1176:2038 length=101
#1=DDFFFHHGHGIIIIIIIBFHCHIIIIIEHIIGIIGIIIIHIIIIGIIIIIIIIGHCHFEFFFCEEECBBCCCCCCCCCCCCCCCCBB9@ACABBCB09
==> SRR1583191_2.fastq <==
@SRR1583191.1 SN7001163:87:C1ME6ACXX:1:1101:1176:2038 length=101
TCCTGTTCTCCCTGCTTGGAGTCTTGGTTGCCTGTGGAAATATCAGGCATGTGAATGGGAAGGCAGGAGTAGACAGTGAATGTGGCCTACTTGATTTGAGG
+SRR1583191.1 SN7001163:87:C1ME6ACXX:1:1101:1176:2038 length=101
CCCFFFFFGHHGHJJJJIICGFGHHGGHIIIIIGFCG9CGHEHIIJJJHIGHJIIIJJIHIIIJIJJIHCEEHCEFEF3@C@CCCDBDCDDDDCCCDDDDD
Here from the Case C identifier its not clear that which substring from Read Identifier can be used to distinguish R1 & R2.
I tried looking into paired end files from SRA but I could not observe R1 or R2 identifier.
I would like to know about getting R1 R2 information from fastq file contents. Apart from the three cases I would also like to know if there are any such sub strings in other fastq read identifier formats which provides R1 R2 information.
Comment