SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
fastq-dump and paired end reads moritz Bioinformatics 3 01-09-2014 01:57 AM
Cut the reads.. paired end fastq file empyrean Bioinformatics 5 06-05-2012 08:52 AM
Fastq: Paired end reads and mapping cedance Bioinformatics 7 06-18-2011 12:33 PM
paired end fastq format in bwa Protaeus Bioinformatics 4 12-09-2010 02:28 PM
BFAST input format for paired end reads lindseyjane Bioinformatics 5 12-16-2009 07:21 AM

Reply
 
Thread Tools
Old 09-09-2010, 11:50 PM   #1
sandhya
Member
 
Location: India

Join Date: Sep 2010
Posts: 11
Default Fastq format and Paired-end reads

Dear all,

I have recently started work on sequenced data. We have paired-end reads from Illumina in Fastq format and I had some questions about these.

1. In the fastq format, what do the numbers in the 1st line mean?
@0:1:1:34:429
GAAGNAAAAATAAAAGCATTAGNAGAAATTTGTACA
+
IIII$IIIII&IIIIIIIIIII$IIIIIIIIIIIII

2. I see that these numbers (or 1st lines) always have a one-to-one mapping between the 2 paired datasets (ie for left and right reads). Therefore is it right to say that the 1st entry in dataset1 (of left reads) is paired with the 1st entry in dataset2 (of right reads) and likewise?

Please advice.
sandhya is offline   Reply With Quote
Old 09-10-2010, 04:15 AM   #2
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,542
Default

Quote:
Originally Posted by sandhya View Post
Dear all,

I have recently started work on sequenced data. We have paired-end reads from Illumina in Fastq format and I had some questions about these.

1. In the fastq format, what do the numbers in the 1st line mean?
@0:1:1:34:429
GAAGNAAAAATAAAAGCATTAGNAGAAATTTGTACA
+
IIII$IIIII&IIIIIIIIIII$IIIIIIIIIIIII
In general FASTQ identifiers like FASTA identifiers can mean anything. In this case, they tell you about where on the slide this read came from, see:
http://en.wikipedia.org/wiki/FASTQ_f...ce_identifiers

Quote:
Originally Posted by sandhya View Post
2. I see that these numbers (or 1st lines) always have a one-to-one mapping between the 2 paired datasets (ie for left and right reads). Therefore is it right to say that the 1st entry in dataset1 (of left reads) is paired with the 1st entry in dataset2 (of right reads) and likewise?

Please advice.
Yes, there should be a one-to-one mapping between the forward reads file and the reverse reads file. i.e. Same fragments in same order.

P.S. It is also common for the Illumina forward reads to have a /1 suffix, and the reverse reads to have a /2 suffix. Yours don't for some reason.
maubp is offline   Reply With Quote
Old 09-10-2010, 05:18 AM   #3
sandhya
Member
 
Location: India

Join Date: Sep 2010
Posts: 11
Default

Thank you, maubp for the reply. In the wiki it is mentioned that Illumina uses a /1 or /2 suffix. Are there cases when the suffix is not present or does this mean these are not Illumina-generated reads?

Infact when I used Novoalign to read the fastq file, it summarised the file as 'Interpreting input files as Sanger FASTQ'. So it could be that the files are Sanger-generated. Please let me know if this is right.
sandhya is offline   Reply With Quote
Old 09-11-2010, 12:12 AM   #4
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

Quote:
Originally Posted by sandhya View Post
Infact when I used Novoalign to read the fastq file, it summarised the file as 'Interpreting input files as Sanger FASTQ'. So it could be that the files are Sanger-generated. Please let me know if this is right.
'Sanger FastQ' refers to the encoding scheme for the quality scores in the file. There are a few different ways this can be done and Illumina have their own encoding scheme(s), so you're probably correct in thinking these don't come directly from an Illumina run. Having said that, I think the main sequence repositories convert all qualities to Sanger encoding so it could be an Illumina file which has passed through a repository.
simonandrews is offline   Reply With Quote
Old 09-13-2010, 12:40 AM   #5
sandhya
Member
 
Location: India

Join Date: Sep 2010
Posts: 11
Default

I understand the sentences separately but when I read them together I find them contradictory. Please let me know about any reading material to familiarise with these concepts. Again what does 'main sequence repositories' mean?
Nevertheless, I was able to read in the datasets using R with the 'fastq' format. So guess I can continue with the programming
sandhya is offline   Reply With Quote
Old 09-13-2010, 12:47 AM   #6
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

Quote:
Originally Posted by sandhya View Post
I understand the sentences separately but when I read them together I find them contradictory. Please let me know about any reading material to familiarise with these concepts.
The wikipedia article on FastQ format summarises the different versions pretty well.

Quote:
Originally Posted by sandhya View Post
Again what does 'main sequence repositories' mean?
Places like the NCBI short read archive or the European nucleotide archive. They will keep their data in a single encoding format (Sanger) to avoid this kind of confusion, so Illumina data submitted to them will have its quality encoding changed.

Quote:
Originally Posted by sandhya View Post
Nevertheless, I was able to read in the datasets using R with the 'fastq' format. So guess I can continue with the programming
It's worth checking that you used the correct options. It's possible to read quality values using the wrong encoding and get no errors, but find that you've recorded the qualities incorrectly (though probably not by much in most cases).
simonandrews is offline   Reply With Quote
Old 09-13-2010, 01:50 AM   #7
sandhya
Member
 
Location: India

Join Date: Sep 2010
Posts: 11
Default

Oh I see. Thank you for forewarning me about that. I shall keep this in mind and see if there is a workaround for it in R.
sandhya is offline   Reply With Quote
Old 08-16-2011, 01:13 AM   #8
stoker
Member
 
Location: Poland

Join Date: Oct 2010
Posts: 17
Default

Quote:
Originally Posted by maubp View Post
Yes, there should be a one-to-one mapping between the forward reads file and the reverse reads file. i.e. Same fragments in same order.
What about the case if my pair end fastq files have different number of reads (Illumina GAIIx)? Could you suggest any software to find common part?
__________________
Tomasz Stokowy
www.sequencing.io.gliwice.pl
stoker is offline   Reply With Quote
Old 07-03-2013, 03:54 AM   #9
luanalirac
Junior Member
 
Location: São Paulo

Join Date: Jul 2013
Posts: 2
Post FASTQ format paired-and (R1 and R2)

Holla, everyone!
I have a question . I am starting my work with data of the illumina, and my first challenge is combine reads R1 and R2, of the fastQ (datas raw). I can know if them are combined? each one have 14 Mb. Would like if them are sum (14MB +14Mb = about 28Mb) or I am deceived?
luanalirac is offline   Reply With Quote
Old 07-03-2013, 03:58 AM   #10
mastal
Senior Member
 
Location: uk

Join Date: Mar 2009
Posts: 667
Default

Quote:
Originally Posted by luanalirac View Post
Holla, everyone!
I have a question . I am starting my work with data of the illumina, and my first challenge is combine reads R1 and R2, of the fastQ (datas raw). I can know if them are combined? each one have 14 Mb. Would like if them are sum (14MB +14Mb = about 28Mb) or I am deceived?

Yes, if you combine the files, they should be about twice the size,
but why do you want to combine R1 and R2? What do you want to do with your data?
mastal is offline   Reply With Quote
Old 07-03-2013, 04:24 AM   #11
luanalirac
Junior Member
 
Location: São Paulo

Join Date: Jul 2013
Posts: 2
Smile

Quote:
Originally Posted by mastal View Post
Yes, if you combine the files, they should be about twice the size,
but why do you want to combine R1 and R2? What do you want to do with your data?
Thank you very much, You help me a lot.
I want submit to MG-RAST to annotation automatic!
This sequences are RNAm. I want see gene expression in environmental sample, and first step will annotation of the MG-RAST.
You have any suggestion?

Last edited by luanalirac; 07-03-2013 at 04:57 AM.
luanalirac is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 05:51 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO