SEQanswers

Go Back   SEQanswers > Sequencing Technologies/Companies > Illumina/Solexa



Similar Threads
Thread Thread Starter Forum Replies Last Post
A first look at Illumina’s new NextSeq 500 AllSeq Vendor Forum 111 03-12-2020 02:25 AM
Nextseq 500 base calling Paulfrobbins Illumina/Solexa 2 03-29-2015 06:38 PM
A heads up for all NextSeq 500 users! LizD Illumina/Solexa 10 02-08-2015 08:59 AM
Questions about whole-exome sequencing on NextSeq 500 newtoseq Illumina/Solexa 3 11-02-2014 07:26 PM
no dual indices on NextSeq 500 (yet) SeqNerd Illumina/Solexa 9 10-20-2014 11:06 AM

Reply
 
Thread Tools
Old 01-09-2016, 01:34 PM   #1
azzzkita
Junior Member
 
Location: St.Petersburg

Join Date: Jan 2016
Posts: 3
Question Strange headers of NextSeq 500 fastq reads

Hello everyone,
I have some reads from NextSeq 500 in fastq format with such structure of headers:
@ERR1136327.6 NS500217:127:H72WTBGXX:2:11203:22066:4060/1
It doesn't match the common structures of fastq headers (casava 1.8): @ <instrument‐name>:<run ID>:<flowcell ID>:<lane‐number>:<tile‐number>:**
<x‐pos>: <y‐pos> <read number>:<is filtered>:<control number>:<barcode sequence>. Nor does it fit the older standard, which was like “@HWUSI-EAS100R:6:73:941:1973#0/1”. Do you know, what do the items in this header mean? I'm especially intriuged by the last number after the slash.

Thanks in advance.
azzzkita is offline   Reply With Quote
Old 01-09-2016, 03:04 PM   #2
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,030
Default

Did you downloaded this data from SRA (Fastq-dump)?

If you use the option

Quote:
-F | --origfmt Defline contains only original sequence name.
You should be able to retrieve fastq headers in original illumina format.

BTW: NextSeq data requires processing by bcl2fastq v.2.x, the successor to older versions of CASAVA/bcl2fastq (v.1.x).
GenoMax is offline   Reply With Quote
Old 01-09-2016, 03:10 PM   #3
mastal
Senior Member
 
Location: uk

Join Date: Mar 2009
Posts: 667
Default

ERR1136327.6 is a number given by the nucleotide archives (SRA or ENA). I think .6 is the read number.

I'm guessing NS500 means it's the NestSeq 500, so H72WTBGXX is probably the flow cell ID.

Have a look at pages 62-64 of the NestSeq system guide for a description of the flow cell and camera,swath, tile and lane numbers.

https://support.illumina.com/content...5046563-01.pdf
mastal is offline   Reply With Quote
Old 01-09-2016, 03:34 PM   #4
mastal
Senior Member
 
Location: uk

Join Date: Mar 2009
Posts: 667
Default

Correction, I've been looking at the file, and H72WTBGXX is probably not the flow cell, as each read has a different set of numbers/letter for that part of the header.
mastal is offline   Reply With Quote
Old 01-09-2016, 04:51 PM   #5
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,030
Default

Here is a direct link for fastq version of the file at EBI SRA: ftp://ftp.sra.ebi.ac.uk/vol1/fastq/E...36327.fastq.gz

On taking a deeper look, something strange appears to be going on with this file. It looks like the data may come from more than one machine/flowcell.

I see these three (what appears to be) machine ID's
Code:
HSQ700642
M00282
NS500217
and multiple possible flowcell ID's

Code:
H3LYMBGXX
H3MKGBGXX
H72GCBGXX
H72W7BGXX
H72WTBGXX
H7BRNADXX
H88PCADXX
H8FU7ADXX
H8JGMADXX
On top of this there may also be something wrong with the fastq format of the file.

You should check with SRA and/or with the data submitters to confirm.
GenoMax is offline   Reply With Quote
Old 01-11-2016, 12:45 AM   #6
azzzkita
Junior Member
 
Location: St.Petersburg

Join Date: Jan 2016
Posts: 3
Default

Thank you for your answers! Just in case, if someone gets in the same situation (which is rather unlikely), I wrote to the first author of this research. This research studied the ancient people’s DNA, which preserved in form of very short fragments, generally even shorted, than the length of middle NextSeq 500 reads. When such short fragments are sequenced from both ends, reads are generally the same, so they were merged by researchers. This explains, why the headers of fastq files had /1 in the ends, like the headers of the first half of paired-end reads, though the file was single, and, as the author of this research wrote, should be treated like single-end reads. Other details about EBI fastq headers format could be found here: http://www.ebi.ac.uk/ena/submit/read-data-format. Another strange thing in this story is that author wrote, that they never uploaded fastq files to the database, but only uploaded bam. So, probably, EBI automatically generated fastq files, using bam files 0_o. This is weird, but could also partly explain the structure of fastq headers.
azzzkita is offline   Reply With Quote
Old 01-11-2016, 01:13 AM   #7
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,030
Default

Thanks for the explanation.

Did the authors say if they actually "merged" data from three different illumina sequencers (HiSeqSQ, MiSeq and NextSeq) and multiple flowcells in one file (in addition to merging R1/R2 reads)? Based on the flowcell ID's that appears to be so. I have not seen data merged like this yet.

EBI always makes the fastq files available for samples (in most cases). People tend to have issues with SRA archives at times and this is a nice fall back to get the reads directly.
GenoMax is offline   Reply With Quote
Old 01-13-2016, 02:08 AM   #8
azzzkita
Junior Member
 
Location: St.Petersburg

Join Date: Jan 2016
Posts: 3
Default

You were right, these fastq resulted from merging data from varios runs, which were made on different sequencers. So these files are totally artifitial, automatically generated from downstream proccessed files, they are not raw reads.
azzzkita is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 05:11 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO