I am having some confusion over an Illumina FASTQ formatted file I have been asked to assemble. Looking at the file though has lead me to some confusion. Originally I had expected two files each containing one 36bp long read for each paired-end. Instead what I got was one file with sequences and quality lines that are 77 characters long.
I had inquired from the originator of the file what is going on and they said that the file simply hadn't been split and that the lines were in fact the paired-end reads concatenated. They suggested that I simply split the sequence up and write them out into two files.
My problem is with the math, 77 is not 36*2. This leaves me wondering what is going on with the remaining 5 bases. So I would like to see if someone can clear up my confusion by answering a couple of questions.
Is this file a "standard" Illumina/Solexa sequence file?
What is the deal with the concatenated reads?
Why wouldn't I want to last 5 bases? Are they adaptors? Low-quality?
For now I am going to do as suggested and just split the 77 bases in two 36 bases sequences and toss the last 5.
Thanks for any help you can provide in clearing up my confusion.
-steve
I had inquired from the originator of the file what is going on and they said that the file simply hadn't been split and that the lines were in fact the paired-end reads concatenated. They suggested that I simply split the sequence up and write them out into two files.
My problem is with the math, 77 is not 36*2. This leaves me wondering what is going on with the remaining 5 bases. So I would like to see if someone can clear up my confusion by answering a couple of questions.
Is this file a "standard" Illumina/Solexa sequence file?
What is the deal with the concatenated reads?
Why wouldn't I want to last 5 bases? Are they adaptors? Low-quality?
For now I am going to do as suggested and just split the 77 bases in two 36 bases sequences and toss the last 5.
Thanks for any help you can provide in clearing up my confusion.
-steve
Comment