SEQanswers

SEQanswers (http://seqanswers.com/forums/index.php)
-   Bioinformatics (http://seqanswers.com/forums/forumdisplay.php?f=18)
-   -   Fastq format and Paired-end reads (http://seqanswers.com/forums/showthread.php?t=6791)

sandhya 09-09-2010 11:50 PM

Fastq format and Paired-end reads
 
Dear all,

I have recently started work on sequenced data. We have paired-end reads from Illumina in Fastq format and I had some questions about these.

1. In the fastq format, what do the numbers in the 1st line mean?
@0:1:1:34:429
GAAGNAAAAATAAAAGCATTAGNAGAAATTTGTACA
+
IIII$IIIII&IIIIIIIIIII$IIIIIIIIIIIII

2. I see that these numbers (or 1st lines) always have a one-to-one mapping between the 2 paired datasets (ie for left and right reads). Therefore is it right to say that the 1st entry in dataset1 (of left reads) is paired with the 1st entry in dataset2 (of right reads) and likewise?

Please advice.

maubp 09-10-2010 04:15 AM

Quote:

Originally Posted by sandhya (Post 25029)
Dear all,

I have recently started work on sequenced data. We have paired-end reads from Illumina in Fastq format and I had some questions about these.

1. In the fastq format, what do the numbers in the 1st line mean?
@0:1:1:34:429
GAAGNAAAAATAAAAGCATTAGNAGAAATTTGTACA
+
IIII$IIIII&IIIIIIIIIII$IIIIIIIIIIIII

In general FASTQ identifiers like FASTA identifiers can mean anything. In this case, they tell you about where on the slide this read came from, see:
http://en.wikipedia.org/wiki/FASTQ_f...ce_identifiers

Quote:

Originally Posted by sandhya (Post 25029)
2. I see that these numbers (or 1st lines) always have a one-to-one mapping between the 2 paired datasets (ie for left and right reads). Therefore is it right to say that the 1st entry in dataset1 (of left reads) is paired with the 1st entry in dataset2 (of right reads) and likewise?

Please advice.

Yes, there should be a one-to-one mapping between the forward reads file and the reverse reads file. i.e. Same fragments in same order.

P.S. It is also common for the Illumina forward reads to have a /1 suffix, and the reverse reads to have a /2 suffix. Yours don't for some reason.

sandhya 09-10-2010 05:18 AM

Thank you, maubp for the reply. In the wiki it is mentioned that Illumina uses a /1 or /2 suffix. Are there cases when the suffix is not present or does this mean these are not Illumina-generated reads?

Infact when I used Novoalign to read the fastq file, it summarised the file as 'Interpreting input files as Sanger FASTQ'. So it could be that the files are Sanger-generated. Please let me know if this is right.

simonandrews 09-11-2010 12:12 AM

Quote:

Originally Posted by sandhya (Post 25052)
Infact when I used Novoalign to read the fastq file, it summarised the file as 'Interpreting input files as Sanger FASTQ'. So it could be that the files are Sanger-generated. Please let me know if this is right.

'Sanger FastQ' refers to the encoding scheme for the quality scores in the file. There are a few different ways this can be done and Illumina have their own encoding scheme(s), so you're probably correct in thinking these don't come directly from an Illumina run. Having said that, I think the main sequence repositories convert all qualities to Sanger encoding so it could be an Illumina file which has passed through a repository.

sandhya 09-13-2010 12:40 AM

I understand the sentences separately but when I read them together I find them contradictory. Please let me know about any reading material to familiarise with these concepts. Again what does 'main sequence repositories' mean?
Nevertheless, I was able to read in the datasets using R with the 'fastq' format. So guess I can continue with the programming :)

simonandrews 09-13-2010 12:47 AM

Quote:

Originally Posted by sandhya (Post 25124)
I understand the sentences separately but when I read them together I find them contradictory. Please let me know about any reading material to familiarise with these concepts.

The wikipedia article on FastQ format summarises the different versions pretty well.

Quote:

Originally Posted by sandhya (Post 25124)
Again what does 'main sequence repositories' mean?

Places like the NCBI short read archive or the European nucleotide archive. They will keep their data in a single encoding format (Sanger) to avoid this kind of confusion, so Illumina data submitted to them will have its quality encoding changed.

Quote:

Originally Posted by sandhya (Post 25124)
Nevertheless, I was able to read in the datasets using R with the 'fastq' format. So guess I can continue with the programming :)

It's worth checking that you used the correct options. It's possible to read quality values using the wrong encoding and get no errors, but find that you've recorded the qualities incorrectly (though probably not by much in most cases).

sandhya 09-13-2010 01:50 AM

Oh I see. Thank you for forewarning me about that. I shall keep this in mind and see if there is a workaround for it in R.

stoker 08-16-2011 01:13 AM

Quote:

Originally Posted by maubp (Post 25048)
Yes, there should be a one-to-one mapping between the forward reads file and the reverse reads file. i.e. Same fragments in same order.

What about the case if my pair end fastq files have different number of reads (Illumina GAIIx)? Could you suggest any software to find common part?

luanalirac 07-03-2013 03:54 AM

FASTQ format paired-and (R1 and R2)
 
Holla, everyone!
I have a question . I am starting my work with data of the illumina, and my first challenge is combine reads R1 and R2, of the fastQ (datas raw). I can know if them are combined? each one have 14 Mb. Would like if them are sum (14MB +14Mb = about 28Mb) or I am deceived?

mastal 07-03-2013 03:58 AM

Quote:

Originally Posted by luanalirac (Post 109408)
Holla, everyone!
I have a question . I am starting my work with data of the illumina, and my first challenge is combine reads R1 and R2, of the fastQ (datas raw). I can know if them are combined? each one have 14 Mb. Would like if them are sum (14MB +14Mb = about 28Mb) or I am deceived?


Yes, if you combine the files, they should be about twice the size,
but why do you want to combine R1 and R2? What do you want to do with your data?

luanalirac 07-03-2013 04:24 AM

Quote:

Originally Posted by mastal (Post 109409)
Yes, if you combine the files, they should be about twice the size,
but why do you want to combine R1 and R2? What do you want to do with your data?

Thank you very much, You help me a lot.
I want submit to MG-RAST to annotation automatic!
This sequences are RNAm. I want see gene expression in environmental sample, and first step will annotation of the MG-RAST.
You have any suggestion?


All times are GMT -8. The time now is 12:30 PM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.