Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Fastq format and Paired-end reads

    Dear all,

    I have recently started work on sequenced data. We have paired-end reads from Illumina in Fastq format and I had some questions about these.

    1. In the fastq format, what do the numbers in the 1st line mean?
    @0:1:1:34:429
    GAAGNAAAAATAAAAGCATTAGNAGAAATTTGTACA
    +
    IIII$IIIII&IIIIIIIIIII$IIIIIIIIIIIII

    2. I see that these numbers (or 1st lines) always have a one-to-one mapping between the 2 paired datasets (ie for left and right reads). Therefore is it right to say that the 1st entry in dataset1 (of left reads) is paired with the 1st entry in dataset2 (of right reads) and likewise?

    Please advice.

  • #2
    Originally posted by sandhya View Post
    Dear all,

    I have recently started work on sequenced data. We have paired-end reads from Illumina in Fastq format and I had some questions about these.

    1. In the fastq format, what do the numbers in the 1st line mean?
    @0:1:1:34:429
    GAAGNAAAAATAAAAGCATTAGNAGAAATTTGTACA
    +
    IIII$IIIII&IIIIIIIIIII$IIIIIIIIIIIII
    In general FASTQ identifiers like FASTA identifiers can mean anything. In this case, they tell you about where on the slide this read came from, see:


    Originally posted by sandhya View Post
    2. I see that these numbers (or 1st lines) always have a one-to-one mapping between the 2 paired datasets (ie for left and right reads). Therefore is it right to say that the 1st entry in dataset1 (of left reads) is paired with the 1st entry in dataset2 (of right reads) and likewise?

    Please advice.
    Yes, there should be a one-to-one mapping between the forward reads file and the reverse reads file. i.e. Same fragments in same order.

    P.S. It is also common for the Illumina forward reads to have a /1 suffix, and the reverse reads to have a /2 suffix. Yours don't for some reason.

    Comment


    • #3
      Thank you, maubp for the reply. In the wiki it is mentioned that Illumina uses a /1 or /2 suffix. Are there cases when the suffix is not present or does this mean these are not Illumina-generated reads?

      Infact when I used Novoalign to read the fastq file, it summarised the file as 'Interpreting input files as Sanger FASTQ'. So it could be that the files are Sanger-generated. Please let me know if this is right.

      Comment


      • #4
        Originally posted by sandhya View Post
        Infact when I used Novoalign to read the fastq file, it summarised the file as 'Interpreting input files as Sanger FASTQ'. So it could be that the files are Sanger-generated. Please let me know if this is right.
        'Sanger FastQ' refers to the encoding scheme for the quality scores in the file. There are a few different ways this can be done and Illumina have their own encoding scheme(s), so you're probably correct in thinking these don't come directly from an Illumina run. Having said that, I think the main sequence repositories convert all qualities to Sanger encoding so it could be an Illumina file which has passed through a repository.

        Comment


        • #5
          I understand the sentences separately but when I read them together I find them contradictory. Please let me know about any reading material to familiarise with these concepts. Again what does 'main sequence repositories' mean?
          Nevertheless, I was able to read in the datasets using R with the 'fastq' format. So guess I can continue with the programming

          Comment


          • #6
            Originally posted by sandhya View Post
            I understand the sentences separately but when I read them together I find them contradictory. Please let me know about any reading material to familiarise with these concepts.
            The wikipedia article on FastQ format summarises the different versions pretty well.

            Originally posted by sandhya View Post
            Again what does 'main sequence repositories' mean?
            Places like the NCBI short read archive or the European nucleotide archive. They will keep their data in a single encoding format (Sanger) to avoid this kind of confusion, so Illumina data submitted to them will have its quality encoding changed.

            Originally posted by sandhya View Post
            Nevertheless, I was able to read in the datasets using R with the 'fastq' format. So guess I can continue with the programming
            It's worth checking that you used the correct options. It's possible to read quality values using the wrong encoding and get no errors, but find that you've recorded the qualities incorrectly (though probably not by much in most cases).

            Comment


            • #7
              Oh I see. Thank you for forewarning me about that. I shall keep this in mind and see if there is a workaround for it in R.

              Comment


              • #8
                Originally posted by maubp View Post
                Yes, there should be a one-to-one mapping between the forward reads file and the reverse reads file. i.e. Same fragments in same order.
                What about the case if my pair end fastq files have different number of reads (Illumina GAIIx)? Could you suggest any software to find common part?
                Tomasz Stokowy
                www.sequencing.io.gliwice.pl

                Comment


                • #9
                  FASTQ format paired-and (R1 and R2)

                  Holla, everyone!
                  I have a question . I am starting my work with data of the illumina, and my first challenge is combine reads R1 and R2, of the fastQ (datas raw). I can know if them are combined? each one have 14 Mb. Would like if them are sum (14MB +14Mb = about 28Mb) or I am deceived?

                  Comment


                  • #10
                    Originally posted by luanalirac View Post
                    Holla, everyone!
                    I have a question . I am starting my work with data of the illumina, and my first challenge is combine reads R1 and R2, of the fastQ (datas raw). I can know if them are combined? each one have 14 Mb. Would like if them are sum (14MB +14Mb = about 28Mb) or I am deceived?

                    Yes, if you combine the files, they should be about twice the size,
                    but why do you want to combine R1 and R2? What do you want to do with your data?

                    Comment


                    • #11
                      Originally posted by mastal View Post
                      Yes, if you combine the files, they should be about twice the size,
                      but why do you want to combine R1 and R2? What do you want to do with your data?
                      Thank you very much, You help me a lot.
                      I want submit to MG-RAST to annotation automatic!
                      This sequences are RNAm. I want see gene expression in environmental sample, and first step will annotation of the MG-RAST.
                      You have any suggestion?
                      Last edited by luanalirac; 07-03-2013, 04:57 AM.

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        Strategies for Sequencing Challenging Samples
                        by seqadmin


                        Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                        03-22-2024, 06:39 AM
                      • seqadmin
                        Techniques and Challenges in Conservation Genomics
                        by seqadmin



                        The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                        Avian Conservation
                        Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                        03-08-2024, 10:41 AM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by seqadmin, Yesterday, 06:37 PM
                      0 responses
                      10 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, Yesterday, 06:07 PM
                      0 responses
                      9 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 03-22-2024, 10:03 AM
                      0 responses
                      51 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 03-21-2024, 07:32 AM
                      0 responses
                      67 views
                      0 likes
                      Last Post seqadmin  
                      Working...
                      X