Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Strange headers of NextSeq 500 fastq reads

    Hello everyone,
    I have some reads from NextSeq 500 in fastq format with such structure of headers:
    @ERR1136327.6 NS500217:127:H72WTBGXX:2:11203:22066:4060/1
    It doesn't match the common structures of fastq headers (casava 1.8): @ <instrument‐name>:<run ID>:<flowcell ID>:<lane‐number>:<tile‐number>:**
    <x‐pos>: <y‐pos> <read number>:<is filtered>:<control number>:<barcode sequence>. Nor does it fit the older standard, which was like “@HWUSI-EAS100R:6:73:941:1973#0/1”. Do you know, what do the items in this header mean? I'm especially intriuged by the last number after the slash.

    Thanks in advance.

  • #2
    Did you downloaded this data from SRA (Fastq-dump)?

    If you use the option

    -F | --origfmt Defline contains only original sequence name.
    You should be able to retrieve fastq headers in original illumina format.

    BTW: NextSeq data requires processing by bcl2fastq v.2.x, the successor to older versions of CASAVA/bcl2fastq (v.1.x).

    Comment


    • #3
      ERR1136327.6 is a number given by the nucleotide archives (SRA or ENA). I think .6 is the read number.

      I'm guessing NS500 means it's the NestSeq 500, so H72WTBGXX is probably the flow cell ID.

      Have a look at pages 62-64 of the NestSeq system guide for a description of the flow cell and camera,swath, tile and lane numbers.

      Comment


      • #4
        Correction, I've been looking at the file, and H72WTBGXX is probably not the flow cell, as each read has a different set of numbers/letter for that part of the header.

        Comment


        • #5
          Here is a direct link for fastq version of the file at EBI SRA: ftp://ftp.sra.ebi.ac.uk/vol1/fastq/E...36327.fastq.gz

          On taking a deeper look, something strange appears to be going on with this file. It looks like the data may come from more than one machine/flowcell.

          I see these three (what appears to be) machine ID's
          Code:
          HSQ700642
          M00282
          NS500217
          and multiple possible flowcell ID's

          Code:
          H3LYMBGXX
          H3MKGBGXX
          H72GCBGXX
          H72W7BGXX
          H72WTBGXX
          H7BRNADXX
          H88PCADXX
          H8FU7ADXX
          H8JGMADXX
          On top of this there may also be something wrong with the fastq format of the file.

          You should check with SRA and/or with the data submitters to confirm.

          Comment


          • #6
            Thank you for your answers! Just in case, if someone gets in the same situation (which is rather unlikely), I wrote to the first author of this research. This research studied the ancient people’s DNA, which preserved in form of very short fragments, generally even shorted, than the length of middle NextSeq 500 reads. When such short fragments are sequenced from both ends, reads are generally the same, so they were merged by researchers. This explains, why the headers of fastq files had /1 in the ends, like the headers of the first half of paired-end reads, though the file was single, and, as the author of this research wrote, should be treated like single-end reads. Other details about EBI fastq headers format could be found here: http://www.ebi.ac.uk/ena/submit/read-data-format. Another strange thing in this story is that author wrote, that they never uploaded fastq files to the database, but only uploaded bam. So, probably, EBI automatically generated fastq files, using bam files 0_o. This is weird, but could also partly explain the structure of fastq headers.

            Comment


            • #7
              Thanks for the explanation.

              Did the authors say if they actually "merged" data from three different illumina sequencers (HiSeqSQ, MiSeq and NextSeq) and multiple flowcells in one file (in addition to merging R1/R2 reads)? Based on the flowcell ID's that appears to be so. I have not seen data merged like this yet.

              EBI always makes the fastq files available for samples (in most cases). People tend to have issues with SRA archives at times and this is a nice fall back to get the reads directly.

              Comment


              • #8
                You were right, these fastq resulted from merging data from varios runs, which were made on different sequencers. So these files are totally artifitial, automatically generated from downstream proccessed files, they are not raw reads.

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Strategies for Sequencing Challenging Samples
                  by seqadmin


                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                  03-22-2024, 06:39 AM
                • seqadmin
                  Techniques and Challenges in Conservation Genomics
                  by seqadmin



                  The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                  Avian Conservation
                  Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                  03-08-2024, 10:41 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, 03-27-2024, 06:37 PM
                0 responses
                13 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 03-27-2024, 06:07 PM
                0 responses
                12 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 03-22-2024, 10:03 AM
                0 responses
                53 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 03-21-2024, 07:32 AM
                0 responses
                69 views
                0 likes
                Last Post seqadmin  
                Working...
                X