SEQanswers

Go Back   SEQanswers > Sequencing Technologies/Companies > Illumina/Solexa



Similar Threads
Thread Thread Starter Forum Replies Last Post
Illumina FASTQ Quality Scores - Missing Value Bio.X2Y Bioinformatics 24 08-29-2013 07:01 AM
Miseq index reads missing scotoma Illumina/Solexa 4 11-06-2012 01:12 PM
Rename fastq seq ID with unique identifier 454rocks Bioinformatics 2 03-28-2012 12:29 PM
Consensus part from sequence read(fastq) and align(BAM) files culmen Bioinformatics 5 12-21-2010 03:57 AM
Solexa - same sequence but unique identifier Layla Bioinformatics 5 11-27-2009 05:08 AM

Reply
 
Thread Tools
Old 01-20-2013, 04:56 PM   #1
rnaeye
Member
 
Location: East Cost

Join Date: May 2011
Posts: 79
Default Miseq FASTQ sequence identifier missing index read?

Hi,
Recently, I have been using Miseq. I have noticed that fastq header is missing index read that Hiseq/GAII had it. For example:

@SIM:1:FCX:1:15:6329:1045 1:N:0:2
TCGCACTCAACGCCCTGCATATGACAAGACAGAATC
+
<>;##=><9=AAAAAAAAAA9#:<#<;<<<????#=

It seems that index sequence is replaced by sample number. When I ran 4 barcoded samples (indexes 4,6,8, and 12), I get something like followings:
1:N:0:1
1:N:0:2
1:N:0:3
1:N:0:4

Is there a way to make software print index sequences instead of sample number. What is the reason for this change. To make files smaller? Thank you.
rnaeye is offline   Reply With Quote
Old 01-20-2013, 05:48 PM   #2
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,178
Default

Quote:
Originally Posted by rnaeye View Post
It seems that index sequence is replaced by sample number...

Is there a way to make software print index sequences instead of sample number. What is the reason for this change. To make files smaller? Thank you.
Yes, they do report sample ID in place of the barcode. There is no way to alter this behavior in the MiSeq analysis software. If you want the files to match the output of the HiSeq you can run the MiSeq BCL files through CASAVA to create FASTQ files.

Why did they do this? Simply to be difficult I think.
kmcarr is offline   Reply With Quote
Old 01-21-2013, 04:03 AM   #3
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,077
Default

Would it not be simpler to replace the index numbers with the actual sequences?
GenoMax is offline   Reply With Quote
Old 01-21-2013, 04:05 AM   #4
mcnelson.phd
Senior Member
 
Location: Connecticut

Join Date: Jul 2011
Posts: 162
Default

I'm curious as to why you would need the index sequence in the sequence header as opposed to a number? If it's because you have custom scripts that separate the sequences based on the index, then it's just as easy to modify them to handle a number. Or you can do a simple awk/perl script to replace the number with the sequence if you absolutely must have that info in the header, much simpler than running CASAVA.

BTW, if you want the full index real file, there a flag that you can add into the MiSeqReporter.config xml file.
mcnelson.phd is offline   Reply With Quote
Old 01-21-2013, 04:56 AM   #5
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,178
Default

Quote:
Originally Posted by GenoMax View Post
Would it not be simpler to replace the index numbers with the actual sequences?
Quote:
Originally Posted by mcnelson.phd View Post
I'm curious as to why you would need the index sequence in the sequence header as opposed to a number? If it's because you have custom scripts that separate the sequences based on the index, then it's just as easy to modify them to handle a number. Or you can do a simple awk/perl script to replace the number with the sequence if you absolutely must have that info in the header, much simpler than running CASAVA.
But this would not be equivalent to what your get from CASAVA. Your (GenoMax & mcnelson) suggestion is to replace the index ID with the nominal index sequence. What CASAVA records in this field is the OBSERVED index sequence, thus if you are permitting mismatches in the index, the mismatched sequence is what is reported. And running CASAVA BclToFastq on a MiSeq run takes very little time.

Quote:
BTW, if you want the full index real file, there a flag that you can add into the MiSeqReporter.config xml file.
Interesting mcn. I wasn't aware of this flag. It seems as though this output may be similar to what CASAVA is reporting.
kmcarr is offline   Reply With Quote
Old 01-21-2013, 05:13 AM   #6
mcnelson.phd
Senior Member
 
Location: Connecticut

Join Date: Jul 2011
Posts: 162
Default

Quote:
Originally Posted by kmcarr View Post
Interesting mcn. I wasn't aware of this flag. It seems as though this output may be similar to what CASAVA is reporting.
If you choose to use that flag for Reporter, it gives you the actual index read file(s). During sample demultiplexing, Reporter does use an error correction scheme to assign samples that have identifiable and correctable errors, which means for the most part sequences are correctly assigned to their correct sample.

Now, I have looked at the quality metrics for a number of index read files, and it's quite disturbing how poor index read quality is in many cases. Not only do we constantly see low level phiX contamination, but I've also seen obvious sample cross-contamination in some genomes we did once. We're working with a group looking at speciation in very closely related archaeal strains, and I've recommended to them that we do "manual" demultiplexing using my own scripts to reduce the level cross contamination.
mcnelson.phd is offline   Reply With Quote
Old 01-21-2013, 05:21 AM   #7
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,077
Default

Quote:
Originally Posted by kmcarr View Post
But this would not be equivalent to what your get from CASAVA. Your (GenoMax & mcnelson) suggestion is to replace the index ID with the nominal index sequence. What CASAVA records in this field is the OBSERVED index sequence, thus if you are permitting mismatches in the index, the mismatched sequence is what is reported. And running CASAVA BclToFastq on a MiSeq run takes very little time.
Good catch. I missed that finer technical point when I thought about the simplest solution for numerical indexs.

If OP does not have easy access to MiSeq/CASAVA then simple sequence replacements would still be a practical work around.
GenoMax is offline   Reply With Quote
Old 01-22-2013, 07:24 AM   #8
bbeitzel
Member
 
Location: Ft. Detrick, MD

Join Date: Aug 2008
Posts: 50
Default

Quote:
Originally Posted by mcnelson.phd View Post
Now, I have looked at the quality metrics for a number of index read files, and it's quite disturbing how poor index read quality is in many cases. Not only do we constantly see low level phiX contamination, but I've also seen obvious sample cross-contamination in some genomes we did once. We're working with a group looking at speciation in very closely related archaeal strains, and I've recommended to them that we do "manual" demultiplexing using my own scripts to reduce the level cross contamination.
We are seeing the same thing on our MiSeq runs. We were doing some pathogen identification runs, and were seeing cross-contamination in demultiplexed reads (ie. reads from "known" samples run on the same flow cell were showing up in reads from "unknowns".) At first we thought that we were cross contaminating during library prep, but we also see a lot of PhiX showing up in unknowns. If we were somehow cross contaminating during library prep, the PhiX should still show up in the unindexed reads file. The fact that it shows up with indexed reads makes me think that it is a problem with demultiplexing. Forcing the index reads to have average quality > Q30 before demultiplexing cleans this up somewhat, but not completely.
bbeitzel is offline   Reply With Quote
Old 01-22-2013, 07:53 AM   #9
mcnelson.phd
Senior Member
 
Location: Connecticut

Join Date: Jul 2011
Posts: 162
Default

Quote:
Originally Posted by bbeitzel View Post
We are seeing the same thing on our MiSeq runs. We were doing some pathogen identification runs, and were seeing cross-contamination in demultiplexed reads (ie. reads from "known" samples run on the same flow cell were showing up in reads from "unknowns".) At first we thought that we were cross contaminating during library prep, but we also see a lot of PhiX showing up in unknowns. If we were somehow cross contaminating during library prep, the PhiX should still show up in the unindexed reads file. The fact that it shows up with indexed reads makes me think that it is a problem with demultiplexing. Forcing the index reads to have average quality > Q30 before demultiplexing cleans this up somewhat, but not completely.
With the phiX a lot of that is due to the lack of an index on the phiX v3 control DNA. For amplicons where we use a lot of phiX in the run, we were using the v2 phiX that comes with the Multiplexing kit for the HiSeq as that has a TruSeq index on it and thus the overall quality of all index reads was much better. But now with the MiSeq hardware upgrades we've had issues the two times we tried to use the indexed v2 phiX because of fragment size so we're stuck using the v3 phiX without the index.

What appears to happen, and this is for both the index and the reads themselves, is that RTA seems to default assign A's to clusters where it can't determine what the sequence is because there's either no signal (e.g phiX during the index read or if you sequenced fully though a small fragment). So for phiX, during the index read, most of those clusters get AAAAAA..., but in some cases the cluster is close enough to another one that has an index and that signal gets picked up for the phiX, hence faulty assignment.

The only way I see around this is to make sure that all fragments on the flow-cell have an index, and that they're all error correcting (which I believe the TruSeq and Nextera indices are) and then do a quality pass on the index before demultiplexing. Our examination showed that a Q30 was overly strict, getting rid of too many reads, while a Q20 kept >90%. There's still very low level contamination, but I guess that's just something we'll have to live with unless Illumina drastically changes their sequencing methodology.
mcnelson.phd is offline   Reply With Quote
Old 01-28-2013, 01:57 PM   #10
aboyfromnowhere
Junior Member
 
Location: Mississippi

Join Date: Jan 2013
Posts: 3
Default

Quote:
Originally Posted by bbeitzel View Post
We are seeing the same thing on our MiSeq runs. We were doing some pathogen identification runs, and were seeing cross-contamination in demultiplexed reads (ie. reads from "known" samples run on the same flow cell were showing up in reads from "unknowns".) At first we thought that we were cross contaminating during library prep, but we also see a lot of PhiX showing up in unknowns. If we were somehow cross contaminating during library prep, the PhiX should still show up in the unindexed reads file. The fact that it shows up with indexed reads makes me think that it is a problem with demultiplexing. Forcing the index reads to have average quality > Q30 before demultiplexing cleans this up somewhat, but not completely.
Been dealing with this exact same problem today, finding PhiX reads in our de novo assemblies. Probably a dumb question, but how do you force the index reads to have an average quality > Q30?

What we're more concerned about though is sequence from one strain somehow being indexed/assembled with another. Does anyone know of a way to check for/prevent this? Thanks.
aboyfromnowhere is offline   Reply With Quote
Old 01-28-2013, 03:46 PM   #11
mcnelson.phd
Senior Member
 
Location: Connecticut

Join Date: Jul 2011
Posts: 162
Default

Unfortunately, there's no way to force Reporter to do any filtering based on the quality of the index read, so you're forced to do it with your own scripts. It's not really that hard: average quality over all index bases must be > X, no bases can have a quality < Y, index must have 0 ambiguous bases.

As far as cross-contamination, it depends on how much of a problem you have as to whether or not it will affect your assemblies. Given that any contamination should be pretty low, I just ignore it for most of our de novo assembly and resequencing/mapping runs. If I were doing variant calling though, I'd have to implement some sort of index quality cut-off based on how closely related the strains are expected to be.
mcnelson.phd is offline   Reply With Quote
Old 01-28-2013, 04:02 PM   #12
aboyfromnowhere
Junior Member
 
Location: Mississippi

Join Date: Jan 2013
Posts: 3
Default

Quote:
Originally Posted by mcnelson.phd View Post
Unfortunately, there's no way to force Reporter to do any filtering based on the quality of the index read, so you're forced to do it with your own scripts. It's not really that hard: average quality over all index bases must be > X, no bases can have a quality < Y, index must have 0 ambiguous bases.

As far as cross-contamination, it depends on how much of a problem you have as to whether or not it will affect your assemblies. Given that any contamination should be pretty low, I just ignore it for most of our de novo assembly and resequencing/mapping runs. If I were doing variant calling though, I'd have to implement some sort of index quality cut-off based on how closely related the strains are expected to be.
Hey, thanks for the reply. I'm going to have to break out some coding books then - trying to learn some perl at the moment.

For PhiX we got a single 5386 bp contig, at around between 70 X and 300 X coverage, depending on the run (so that was sequencing 5 and 2 S. pneumoniae genomes, respectively). So given that we've got a perfect size contig, at pretty high coverage, we're pretty nervous about cross-contamination from the strains themselves. Will give the index quality filtering a try though, to see if that has an effect. Thanks.

EDIT: Is this something you do with paired-end reads? If so, once you've deleted a read due to low index quality, how do you deal with its paired read in the corresponding file, given that you need to keep the order the same?

Last edited by aboyfromnowhere; 01-28-2013 at 04:23 PM.
aboyfromnowhere is offline   Reply With Quote
Old 01-29-2013, 04:04 AM   #13
mcnelson.phd
Senior Member
 
Location: Connecticut

Join Date: Jul 2011
Posts: 162
Default

Quote:
Originally Posted by aboyfromnowhere View Post
Hey, thanks for the reply. I'm going to have to break out some coding books then - trying to learn some perl at the moment.

For PhiX we got a single 5386 bp contig, at around between 70 X and 300 X coverage, depending on the run (so that was sequencing 5 and 2 S. pneumoniae genomes, respectively). So given that we've got a perfect size contig, at pretty high coverage, we're pretty nervous about cross-contamination from the strains themselves. Will give the index quality filtering a try though, to see if that has an effect. Thanks.

EDIT: Is this something you do with paired-end reads? If so, once you've deleted a read due to low index quality, how do you deal with its paired read in the corresponding file, given that you need to keep the order the same?
Paired-end should have no real affect on the index quality scores. I guess index 2 could have lower average quality because of the re-synthesis, but I've never looked at that.

I've been using a custom script that uses bowtie2 as its back-end to map and remove any phiX reads from 16S runs that we do. It doesn't catch every read, but by my estimate it's effective at removing >90% of all phiX reads from a sample.

If you want to go the quick route to get a script put together for index quality filtering, I'd suggest using a shell script that takes advantage of one of the many quality trimming tools already available. You would essentially do a quality trim on the index read(s), then you can filter the reads with a bad index out of your read 1/2 files, then proceed to demultiplexing. FastX toolkit can handle the quality trimming of the index and the demultiplexing, and then all you need is a simple filtering script that shouldn't be too hard to whip up in perl. Wrap it all up in a shell script to make it all work in one go and there you have it.
mcnelson.phd is offline   Reply With Quote
Old 01-29-2013, 05:31 AM   #14
aboyfromnowhere
Junior Member
 
Location: Mississippi

Join Date: Jan 2013
Posts: 3
Default

Quote:
Originally Posted by mcnelson.phd View Post
Paired-end should have no real affect on the index quality scores. I guess index 2 could have lower average quality because of the re-synthesis, but I've never looked at that.

I've been using a custom script that uses bowtie2 as its back-end to map and remove any phiX reads from 16S runs that we do. It doesn't catch every read, but by my estimate it's effective at removing >90% of all phiX reads from a sample.

If you want to go the quick route to get a script put together for index quality filtering, I'd suggest using a shell script that takes advantage of one of the many quality trimming tools already available. You would essentially do a quality trim on the index read(s), then you can filter the reads with a bad index out of your read 1/2 files, then proceed to demultiplexing. FastX toolkit can handle the quality trimming of the index and the demultiplexing, and then all you need is a simple filtering script that shouldn't be too hard to whip up in perl. Wrap it all up in a shell script to make it all work in one go and there you have it.
No, I didn't mean it would effect the index score of the paired read. My understanding was that for assembly, paired reads need to be in the same position in the forward and reverse files to be recognised as pairs. If you filter one of them (say the forward read) out due to a poor index score, do you just delete the reverse to maintain the order for the other reads in the file?

I'll give the FastX/shell script method a go. Thanks for the suggestions.
aboyfromnowhere is offline   Reply With Quote
Old 01-29-2013, 05:35 AM   #15
mcnelson.phd
Senior Member
 
Location: Connecticut

Join Date: Jul 2011
Posts: 162
Default

Quote:
Originally Posted by aboyfromnowhere View Post
No, I didn't mean it would effect the index score of the paired read. My understanding was that for assembly, paired reads need to be in the same position in the forward and reverse files to be recognised as pairs. If you filter one of them (say the forward read) out due to a poor index score, do you just delete the reverse to maintain the order for the other reads in the file?
If you're filtering based on the index quality, you would have to filter out both read 1 and read 2. That's actually pretty simple because you're removing the same reads from both files as opposed to having to keep read 1 and read 2 unified while doing sequence trimming on those files.
mcnelson.phd is offline   Reply With Quote
Old 02-06-2013, 09:50 AM   #16
obenauer
Junior Member
 
Location: Memphis, TN

Join Date: May 2009
Posts: 4
Default

We have been having multiplexing problems too, with reads matching one influenza strain showing up barcoded as a different strain. After months of troubleshooting, I found the explanation of indexes being mis-assigned due to mixed clusters on the flow cell helpful, as found in Kircher et al., Nucleic Acids Research 2011, 40:e3, "Double indexing overcomes inaccuracies in multiplex sequencing on the Illumina platform".

You can filter all the reads, whether single reads or paired ends, by their index base qualities to reduce sample-to-sample contamination. I have programs for this, filter_multiplexed_reads.pl and filter_multiplexed_pairs.pl. I don't know if it's possible to post them here, but send me an email if any of you are interested (john dot obenauer at stjude dot org).

A major disadvantage of filtering this way is that you throw a lot of good data out with the bad. In a recent MiSeq run, filtering removed 25% fo the reads. Older data from the Genome Analyzer 2 was even worse, with 90% or more of the reads being removed.

An alternative we have been trying is sparse-matrix double barcoding, described in that same Kircher 2011 paper. Using the 96-plex Nextera XT kit (catalog FC-131-1002), we use only 24 (or fewer) of the barcode combinations at a time. This way, if mixed clusters cause mis-assignment of one of the two barcodes, it will create an invalid barcode pair that the Illumina software removes by default. We're still evaluating this, but the hope is that it will clean up our multiplexed data.
obenauer is offline   Reply With Quote
Old 05-25-2015, 09:40 AM   #17
leleon
Junior Member
 
Location: Chile

Join Date: Jan 2012
Posts: 1
Default

Quote:
Originally Posted by obenauer View Post
You can filter all the reads, whether single reads or paired ends, by their index base qualities to reduce sample-to-sample contamination. I have programs for this, filter_multiplexed_reads.pl and filter_multiplexed_pairs.pl. I don't know if it's possible to post them here, but send me an email if any of you are interested (john dot obenauer at stjude dot org).
Dear obeanuer,

Do you still can share these scripts?
In that case, do you have any other email?
Thanks in advance

Luis
leleon is offline   Reply With Quote
Old 05-25-2015, 12:21 PM   #18
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

I wrote a program for filtering reads by barcode quality, available with BBMap. It requires you to have the barcodes as a fastq file, and is run in two steps:

Code:
mergebarcodes.sh in=reads.fq bar=barcodes.fq out=merged.fq

filterbarcodes.sh in=merged.fq out=clean.fq maq=15
...where "maq" is the minimum average Phred quality of barcodes to retain. In my testing, removal of about 25% of the reads reduced cross-contamination by about 50%, which is not great but not terrible either. Discarding chastity-failed reads and reads with barcode mismatches seems more important.
Brian Bushnell is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 05:20 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO