Seqanswers Leaderboard Ad

**GenoMax** · 01-09-2016, 04:04 PM

Did you downloaded this data from SRA (Fastq-dump)?

If you use the option

-F | --origfmt Defline contains only original sequence name.

You should be able to retrieve fastq headers in original illumina format.

BTW: NextSeq data requires processing by bcl2fastq v.2.x, the successor to older versions of CASAVA/bcl2fastq (v.1.x).

**mastal** · 01-09-2016, 04:10 PM

ERR1136327.6 is a number given by the nucleotide archives (SRA or ENA). I think .6 is the read number.

I'm guessing NS500 means it's the NestSeq 500, so H72WTBGXX is probably the flow cell ID.

Have a look at pages 62-64 of the NestSeq system guide for a description of the flow cell and camera,swath, tile and lane numbers.

404 Resource at '/content/dam/illumina-support/documents/documentation/system_documentation/nextseq/nextseq-500-system-guide-15046563-01.pdf' not found: No resource found

https://support.illumina.com/content/dam/illumina-support/documents/documentation/system_documentation/nextseq/nextseq-500-system-guide-15046563-01.pdf

**mastal** · 01-09-2016, 04:34 PM

Correction, I've been looking at the file, and H72WTBGXX is probably not the flow cell, as each read has a different set of numbers/letter for that part of the header.

**GenoMax** · 01-09-2016, 05:51 PM

Here is a direct link for fastq version of the file at EBI SRA: ftp://ftp.sra.ebi.ac.uk/vol1/fastq/E...36327.fastq.gz

On taking a deeper look, something strange appears to be going on with this file. It looks like the data may come from more than one machine/flowcell.

I see these three (what appears to be) machine ID's

Code:

HSQ700642
M00282
NS500217

and multiple possible flowcell ID's

Code:

H3LYMBGXX
H3MKGBGXX
H72GCBGXX
H72W7BGXX
H72WTBGXX
H7BRNADXX
H88PCADXX
H8FU7ADXX
H8JGMADXX

On top of this there may also be something wrong with the fastq format of the file.

You should check with SRA and/or with the data submitters to confirm.

**azzzkita** · 01-11-2016, 01:45 AM

Thank you for your answers! Just in case, if someone gets in the same situation (which is rather unlikely), I wrote to the first author of this research. This research studied the ancient people’s DNA, which preserved in form of very short fragments, generally even shorted, than the length of middle NextSeq 500 reads. When such short fragments are sequenced from both ends, reads are generally the same, so they were merged by researchers. This explains, why the headers of fastq files had /1 in the ends, like the headers of the first half of paired-end reads, though the file was single, and, as the author of this research wrote, should be treated like single-end reads. Other details about EBI fastq headers format could be found here: http://www.ebi.ac.uk/ena/submit/read-data-format. Another strange thing in this story is that author wrote, that they never uploaded fastq files to the database, but only uploaded bam. So, probably, EBI automatically generated fastq files, using bam files 0_o. This is weird, but could also partly explain the structure of fastq headers.

**GenoMax** · 01-11-2016, 02:13 AM

Thanks for the explanation.

Did the authors say if they actually "merged" data from three different illumina sequencers (HiSeqSQ, MiSeq and NextSeq) and multiple flowcells in one file (in addition to merging R1/R2 reads)? Based on the flowcell ID's that appears to be so. I have not seen data merged like this yet.

EBI always makes the fastq files available for samples (in most cases). People tend to have issues with SRA archives at times and this is a nice fall back to get the reads directly.

**azzzkita** · 01-13-2016, 03:08 AM

You were right, these fastq resulted from merging data from varios runs, which were made on different sequencers. So these files are totally artifitial, automatically generated from downstream proccessed files, they are not raw reads.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 25 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 28 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 24 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Strange headers of NextSeq 500 fastq reads

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News