![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Illumina FASTQ Quality Scores - Missing Value | Bio.X2Y | Bioinformatics | 24 | 08-29-2013 08:01 AM |
Miseq index reads missing | scotoma | Illumina/Solexa | 4 | 11-06-2012 02:12 PM |
Rename fastq seq ID with unique identifier | 454rocks | Bioinformatics | 2 | 03-28-2012 01:29 PM |
Consensus part from sequence read(fastq) and align(BAM) files | culmen | Bioinformatics | 5 | 12-21-2010 04:57 AM |
Solexa - same sequence but unique identifier | Layla | Bioinformatics | 5 | 11-27-2009 06:08 AM |
![]() |
|
Thread Tools |
![]() |
#1 |
Member
Location: East Cost Join Date: May 2011
Posts: 79
|
![]()
Hi,
Recently, I have been using Miseq. I have noticed that fastq header is missing index read that Hiseq/GAII had it. For example: @SIM:1:FCX:1:15:6329:1045 1:N:0:2 TCGCACTCAACGCCCTGCATATGACAAGACAGAATC + <>;##=><9=AAAAAAAAAA9#:<#<;<<<????#= It seems that index sequence is replaced by sample number. When I ran 4 barcoded samples (indexes 4,6,8, and 12), I get something like followings: 1:N:0:1 1:N:0:2 1:N:0:3 1:N:0:4 Is there a way to make software print index sequences instead of sample number. What is the reason for this change. To make files smaller? Thank you. |
![]() |
![]() |
![]() |
#2 | |
Senior Member
Location: USA, Midwest Join Date: May 2008
Posts: 1,178
|
![]() Quote:
Why did they do this? Simply to be difficult I think. |
|
![]() |
![]() |
![]() |
#3 |
Senior Member
Location: East Coast USA Join Date: Feb 2008
Posts: 7,087
|
![]()
Would it not be simpler to replace the index numbers with the actual sequences?
|
![]() |
![]() |
![]() |
#4 |
Senior Member
Location: Connecticut Join Date: Jul 2011
Posts: 162
|
![]()
I'm curious as to why you would need the index sequence in the sequence header as opposed to a number? If it's because you have custom scripts that separate the sequences based on the index, then it's just as easy to modify them to handle a number. Or you can do a simple awk/perl script to replace the number with the sequence if you absolutely must have that info in the header, much simpler than running CASAVA.
BTW, if you want the full index real file, there a flag that you can add into the MiSeqReporter.config xml file. |
![]() |
![]() |
![]() |
#5 | |||
Senior Member
Location: USA, Midwest Join Date: May 2008
Posts: 1,178
|
![]() Quote:
Quote:
Quote:
|
|||
![]() |
![]() |
![]() |
#6 | |
Senior Member
Location: Connecticut Join Date: Jul 2011
Posts: 162
|
![]() Quote:
Now, I have looked at the quality metrics for a number of index read files, and it's quite disturbing how poor index read quality is in many cases. Not only do we constantly see low level phiX contamination, but I've also seen obvious sample cross-contamination in some genomes we did once. We're working with a group looking at speciation in very closely related archaeal strains, and I've recommended to them that we do "manual" demultiplexing using my own scripts to reduce the level cross contamination. |
|
![]() |
![]() |
![]() |
#7 | |
Senior Member
Location: East Coast USA Join Date: Feb 2008
Posts: 7,087
|
![]() Quote:
If OP does not have easy access to MiSeq/CASAVA then simple sequence replacements would still be a practical work around. |
|
![]() |
![]() |
![]() |
#8 | |
Member
Location: Ft. Detrick, MD Join Date: Aug 2008
Posts: 50
|
![]() Quote:
|
|
![]() |
![]() |
![]() |
#9 | |
Senior Member
Location: Connecticut Join Date: Jul 2011
Posts: 162
|
![]() Quote:
What appears to happen, and this is for both the index and the reads themselves, is that RTA seems to default assign A's to clusters where it can't determine what the sequence is because there's either no signal (e.g phiX during the index read or if you sequenced fully though a small fragment). So for phiX, during the index read, most of those clusters get AAAAAA..., but in some cases the cluster is close enough to another one that has an index and that signal gets picked up for the phiX, hence faulty assignment. The only way I see around this is to make sure that all fragments on the flow-cell have an index, and that they're all error correcting (which I believe the TruSeq and Nextera indices are) and then do a quality pass on the index before demultiplexing. Our examination showed that a Q30 was overly strict, getting rid of too many reads, while a Q20 kept >90%. There's still very low level contamination, but I guess that's just something we'll have to live with unless Illumina drastically changes their sequencing methodology. |
|
![]() |
![]() |
![]() |
#10 | |
Junior Member
Location: Mississippi Join Date: Jan 2013
Posts: 3
|
![]() Quote:
What we're more concerned about though is sequence from one strain somehow being indexed/assembled with another. Does anyone know of a way to check for/prevent this? Thanks. |
|
![]() |
![]() |
![]() |
#11 |
Senior Member
Location: Connecticut Join Date: Jul 2011
Posts: 162
|
![]()
Unfortunately, there's no way to force Reporter to do any filtering based on the quality of the index read, so you're forced to do it with your own scripts. It's not really that hard: average quality over all index bases must be > X, no bases can have a quality < Y, index must have 0 ambiguous bases.
As far as cross-contamination, it depends on how much of a problem you have as to whether or not it will affect your assemblies. Given that any contamination should be pretty low, I just ignore it for most of our de novo assembly and resequencing/mapping runs. If I were doing variant calling though, I'd have to implement some sort of index quality cut-off based on how closely related the strains are expected to be. |
![]() |
![]() |
![]() |
#12 | |
Junior Member
Location: Mississippi Join Date: Jan 2013
Posts: 3
|
![]() Quote:
For PhiX we got a single 5386 bp contig, at around between 70 X and 300 X coverage, depending on the run (so that was sequencing 5 and 2 S. pneumoniae genomes, respectively). So given that we've got a perfect size contig, at pretty high coverage, we're pretty nervous about cross-contamination from the strains themselves. Will give the index quality filtering a try though, to see if that has an effect. Thanks. EDIT: Is this something you do with paired-end reads? If so, once you've deleted a read due to low index quality, how do you deal with its paired read in the corresponding file, given that you need to keep the order the same? Last edited by aboyfromnowhere; 01-28-2013 at 05:23 PM. |
|
![]() |
![]() |
![]() |
#13 | |
Senior Member
Location: Connecticut Join Date: Jul 2011
Posts: 162
|
![]() Quote:
I've been using a custom script that uses bowtie2 as its back-end to map and remove any phiX reads from 16S runs that we do. It doesn't catch every read, but by my estimate it's effective at removing >90% of all phiX reads from a sample. If you want to go the quick route to get a script put together for index quality filtering, I'd suggest using a shell script that takes advantage of one of the many quality trimming tools already available. You would essentially do a quality trim on the index read(s), then you can filter the reads with a bad index out of your read 1/2 files, then proceed to demultiplexing. FastX toolkit can handle the quality trimming of the index and the demultiplexing, and then all you need is a simple filtering script that shouldn't be too hard to whip up in perl. Wrap it all up in a shell script to make it all work in one go and there you have it. |
|
![]() |
![]() |
![]() |
#14 | |
Junior Member
Location: Mississippi Join Date: Jan 2013
Posts: 3
|
![]() Quote:
I'll give the FastX/shell script method a go. Thanks for the suggestions. |
|
![]() |
![]() |
![]() |
#15 | |
Senior Member
Location: Connecticut Join Date: Jul 2011
Posts: 162
|
![]() Quote:
|
|
![]() |
![]() |
![]() |
#16 |
Junior Member
Location: Memphis, TN Join Date: May 2009
Posts: 4
|
![]()
We have been having multiplexing problems too, with reads matching one influenza strain showing up barcoded as a different strain. After months of troubleshooting, I found the explanation of indexes being mis-assigned due to mixed clusters on the flow cell helpful, as found in Kircher et al., Nucleic Acids Research 2011, 40:e3, "Double indexing overcomes inaccuracies in multiplex sequencing on the Illumina platform".
You can filter all the reads, whether single reads or paired ends, by their index base qualities to reduce sample-to-sample contamination. I have programs for this, filter_multiplexed_reads.pl and filter_multiplexed_pairs.pl. I don't know if it's possible to post them here, but send me an email if any of you are interested (john dot obenauer at stjude dot org). A major disadvantage of filtering this way is that you throw a lot of good data out with the bad. In a recent MiSeq run, filtering removed 25% fo the reads. Older data from the Genome Analyzer 2 was even worse, with 90% or more of the reads being removed. An alternative we have been trying is sparse-matrix double barcoding, described in that same Kircher 2011 paper. Using the 96-plex Nextera XT kit (catalog FC-131-1002), we use only 24 (or fewer) of the barcode combinations at a time. This way, if mixed clusters cause mis-assignment of one of the two barcodes, it will create an invalid barcode pair that the Illumina software removes by default. We're still evaluating this, but the hope is that it will clean up our multiplexed data. |
![]() |
![]() |
![]() |
#17 | |
Junior Member
Location: Chile Join Date: Jan 2012
Posts: 1
|
![]() Quote:
Do you still can share these scripts? In that case, do you have any other email? Thanks in advance Luis |
|
![]() |
![]() |
![]() |
#18 |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,707
|
![]()
I wrote a program for filtering reads by barcode quality, available with BBMap. It requires you to have the barcodes as a fastq file, and is run in two steps:
Code:
mergebarcodes.sh in=reads.fq bar=barcodes.fq out=merged.fq filterbarcodes.sh in=merged.fq out=clean.fq maq=15 |
![]() |
![]() |
![]() |
Thread Tools | |
|
|