SEQanswers

SEQanswers (http://seqanswers.com/forums/index.php)
-   Illumina/Solexa (http://seqanswers.com/forums/forumdisplay.php?f=6)
-   -   How to know whether its Read1(Forward) or Read2(Reverse) from fastq contents. (http://seqanswers.com/forums/showthread.php?t=65099)

vaibhavvsk 12-23-2015 02:23 AM

How to know whether its Read1(Forward) or Read2(Reverse) from fastq contents.
 
As per fastq file description on wikipedia(https://en.wikipedia.org/wiki/FASTQ_format) Illumina Sequence Identifier format as :
Case A. Standard Illumina Format
Read Identifier : @HWUSI-EAS100R:6:73:941:1973#0/1
/1 indicates it is R1 i.e. Forward Read and
/2 indicates it is R2 i.e. Reverse Read

Case B. Illumina with Casava 1.8
Read Identifier : @EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG
1:Y:18:ATCACG i.e substring 1: indicates it is R1
2:Y:18:ATCACG i.e substring 2: indicates it is R2 Case C : NCBI Sequence Read Archive(SRA)
Read Identifier: @SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36

Case C: NCBI SRA fastq format
Read Identifier :
@SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36

I'm pasting 4 lines from the paired end data as :

==> SRR1583191_1.fastq <==
@SRR1583191.1 SN7001163:87:C1ME6ACXX:1:1101:1176:2038 length=101
NATCCAGTAGCCTCCTCCCCATCATCTCCCATTTCTTCTACAGGGGGACTCCCCCAGGTCTGGTAGCCCAAAGCTGCTGCTACAGCCGCCATGGGGGGGTG
+SRR1583191.1 SN7001163:87:C1ME6ACXX:1:1101:1176:2038 length=101
#1=DDFFFHHGHGIIIIIIIBFHCHIIIIIEHIIGIIGIIIIHIIIIGIIIIIIIIGHCHFEFFFCEEECBBCCCCCCCCCCCCCCCCBB9@ACABBCB09

==> SRR1583191_2.fastq <==
@SRR1583191.1 SN7001163:87:C1ME6ACXX:1:1101:1176:2038 length=101
TCCTGTTCTCCCTGCTTGGAGTCTTGGTTGCCTGTGGAAATATCAGGCATGTGAATGGGAAGGCAGGAGTAGACAGTGAATGTGGCCTACTTGATTTGAGG
+SRR1583191.1 SN7001163:87:C1ME6ACXX:1:1101:1176:2038 length=101
CCCFFFFFGHHGHJJJJIICGFGHHGGHIIIIIGFCG9CGHEHIIJJJHIGHJIIIJJIHIIIJIJJIHCEEHCEFEF3@C@CCCDBDCDDDDCCCDDDDD

Here from the Case C identifier its not clear that which substring from Read Identifier can be used to distinguish R1 & R2.
I tried looking into paired end files from SRA but I could not observe R1 or R2 identifier.

I would like to know about getting R1 R2 information from fastq file contents. Apart from the three cases I would also like to know if there are any such sub strings in other fastq read identifier formats which provides R1 R2 information. :confused:

GenoMax 12-23-2015 06:50 AM

If you use

Quote:

-F | --origfmt Defline contains only original sequence name.
option when extracting the fastq files from SRA you would potentially recover original Illumina fastq header.

Jessica_L 12-23-2015 08:17 AM

None of the information in this string:

SN7001163:87:C1ME6ACXX:1:1101:1176:2038 length=101

can be used as an identifier for R1 vs R2. The fields are things like the instrument serial number, flow cell ID, lane number, tile number and X/Y coordinates of the cluster.

Genomax's suggestion to recover the original header would be the best option to get the data you're looking for.

vaibhavvsk 12-24-2015 03:23 AM

Quote:

Originally Posted by GenoMax (Post 186810)
If you use



option when extracting the fastq files from SRA you would potentially recover original Illumina fastq header.

Hey GenoMax it worked for me. Thanks Jessica_L too!


All times are GMT -8. The time now is 09:56 PM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.