SEQanswers

Go Back   SEQanswers > Sequencing Technologies/Companies > Illumina/Solexa



Similar Threads
Thread Thread Starter Forum Replies Last Post
RTA v2.8 : Conflicts with low complexity sequence nickp Illumina/Solexa 2 06-04-2014 09:19 AM
SCS/RTA upgrade Q-score of 41 fastx toolkit crash seqfast Bioinformatics 1 08-22-2011 07:15 AM
Scs 2.9/rta 1.9 protist Bioinformatics 0 04-19-2011 02:22 AM
RTA hangs in the middle of GAIIX run rocksd Illumina/Solexa 3 10-13-2010 07:28 AM
Running GERALD.pl on RTA PE data - wrong Summary.htm JakobHedegaard Illumina/Solexa 4 01-08-2010 12:56 AM

Reply
 
Thread Tools
Old 11-22-2011, 06:33 AM   #1
pmiguel
Senior Member
 
Location: Purdue University, West Lafayette, Indiana

Join Date: Aug 2008
Posts: 2,283
Thumbs down Using RTA to re-process images.

I made a cursory attempt to use RTA to re-process a run from the image files (.tif). I'll go into the whys and wherefores of doing such a thing below. I just wanted to note that in my hands nothing productive came out of it.

Just in case my description might be of use to future explorers of this brave new realm:
It was easy to find the directory where RTA resides. Double-clicked on it and the GUI launches. I point it a the image directory I want process. I hit start. RTA tells me that run has already been processed and declines to run again. I tried isolating the images away from the previous processing directory, but I am kind of at sea here. No idea what files are needed. So I just tinker around with some of the config files until RTA agrees that there is a run to process. One major problem: it mysteriously displays 32 panels/lane instead of the v3 48. Since the run will be repeated once the reagents arrive anyway, I decide to just start the processing anyway. I let it run for ~16 hours. It does seem to be doing something, but nothing useful. So I terminate the processing.
I have some ideas as to how one might proceed from here. Illumina tech support is not eager to guide me. They offered the older version of OLB, that still did image analysis. Rick gave that a shot and it did not look like it was going to be productive either. But, I still think RTA might be cajoled into actually processing image data again. Maybe one could create a faux run to build the right file/config file structure, then link the old image data there and point RTA at it.

Okay, why would anyone want to do this? Up to some point it was possible to do image analysis offline. This provided one avenue to avoid Illumina's Achilles's heel: lack of what I'll refer to as "variety" in the first 4 cycles causes cluster calling to work poorly or fail altogether.

Might as well upgrade the above statement to a full denunciation: Illumina, you get many things right, but you get a big "FAIL" on this issue. The work-around is to make sure there sufficient "variety" in cycles 1-4 of your library. But, c'mon, that is a dim-witted way to build a system!

So, a secondary work-around was developed for the Genome Analyzer where you re-do image analysis on a re-ordered set of images. That is, you have the software do cluster calling starting at a later cycle of your run -- past where the low "variety" sequence is. Then restore the actual order of the cycles later on.

I came to the Illumina party only recently, so I never partook of this practice (called "GOAT-fooling, etc.) Plus, initially it was stated that it would not work on HiSeqs (or HiScanSQs, by extension) because save the images was not possible.

But it is possible to save the image files. Not necessarily wise (especially if you are storing them locally), but possible. So, I thought it might be easy just to fire up RTA and re-process a run, should the need occur. And, if this is possible, then fooling RTA would likely be possible as well.

Anyway, as it turns out, I did not have a cluster-calling issue that drove me to give it a try. I had, well what would you call it? Basically some issue caused high pre-phasing for 2 lanes of a run. One of these lanes was our phiX control lane. I am not sure what the benefits of specifying a control lane for a run is, but the down-side is that if anything goes wrong with that one lane, your run is hosed. So, our run was hosed, even though only a single data lane had the pre-phasing issue. Quality values take a dive around cycle 30 despite all other metrics looking good for lanes 1-6.

So, this is not a big deal, if you have your "intensity files" saved. You just fire up your OLB (Off-Line Basecaller) and specify no control lane. Not sure why that "save intensity file" box did not get checked for this run. I mean we were saving the image files, why not the intensity files as well? Anyway, I thought it should be possible to re-generate the intensity files through RTA. And, as I say above, I still think it is possible. But not trivial enough for me to pursue it further at this time...

(BTW, I don't want to appear to be hammering Illumina too hard here. Their tech support actually works well and they are replacing all the run reagents. Given the alternative, that counts for a lot.)

--
Phillip
pmiguel is offline   Reply With Quote
Old 11-23-2011, 12:20 AM   #2
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 869
Default

So we've done lots of work on this type of library and I guess our experience has been that as long as you've saved images these types of run are always able to be saved.

I actually have a lot of sympathy with illumina, as the steps you'd need to take to ensure you can rescue these runs would probably break more runs than they saved if everyone started doing them. Saving images for a GAII run is just about OK, but will undoubtedly fill up some people's disk space and cause the run to stall. As you said, saving HiC images is actually possible (and you can use OLB to reanalyse them), but Illumina say that it isn't due to the enormous amounts of data you need to move around. Even deferring cluster calling to later cycles in RTA is tricky because you need to store all images up to that point, and you can't start processing and discarding them until you've done the cluster calling, at which point you're way behind. You also need to know in advance whether biased composition is likely to be a problem to set these parameters before a run starts.

One thing we've seen from a few different groups now is that people often think they have a cluster detection problem, when actually it's a base call calibration problem. Biased composition will mess up cluster calling, but this is only disastrously bad if the bias is very strong and the cluster density is very high. The other thing biased composition does is to mess up the channel and phasing calibrations if you don't use a control lane. If you're getting plenty of clusters, but they're all failing the QC filters, then you can normally rescue the run by rerunning Bustard from the intensity files but specifying a fixed matrix and phasing parameters. If the qualities over the biased bases are rubbish then the EAMS filter can also knock down the qualities over the rest of the read, so we'd also normally run with --no-eams (which used to be an undocumented option, but I think is official in the latest version).
simonandrews is offline   Reply With Quote
Old 11-23-2011, 12:29 AM   #3
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 869
Default

If you want an illustration of the problems caused by biased sequences messing up base calling, have a look at slides 5 and 6 in the following PDF:

http://bioinfo-core.org/images/b/ba/...r_pipeline.pdf

Slide 5 shows the composition of a library, which was both biased and contained 5' barcodes. This came to us from an external sequencing service who had just run the normal pipeline calibrating every lane individually.

Slide 6 shows two histograms showing the distribution of sequences in the positions which should have contained the barcodes. The bottom panel was from the initial analysis, where no sequences passed QC, and all Phred scores were terrible. The top panel shows the same intensity files reanalysed with a fixed calibration matrix and phasing. You can see that the real barcodes now hugely dominate the library. From getting no sequences at all past GERALD in the initial analysis we got over 20 million from the reanalysis.

The moral is that Illumina's pipeline makes certain assumptions about the composition of your library, and if it doesn't meet them then the base calling will be very wrong. If you know better than the pipeline, then don't be afraid to tell it!
simonandrews is offline   Reply With Quote
Old 01-26-2012, 10:24 PM   #4
amitra
Junior Member
 
Location: galveston

Join Date: May 2009
Posts: 8
Default

Sorry to crosspost! I would have the opportunity to work with my Sequencing facility to sequence a sample with reads starting with a specific 11nt barcode 'TCGAGGTAGTA', and would like to save all the image files to my NAS during the sequencing run.
Can the experts kindly point out what parameters to set exactly on the HiSeq controller software so that it saves images? The sequencing facility personnel say that only thumbnails can be saved, but after checking out this thread, I believe that the images can also be saved.
Thanks in advance, Abi
amitra is offline   Reply With Quote
Old 01-26-2012, 10:42 PM   #5
MRSeq
Member
 
Location: USA

Join Date: May 2009
Posts: 11
Default

Hello Philip and Simon,

I am very interested in this subject, we have the same problem - almost no diversity for the first 11 (!) bases. So far, we are getting horrible results with standard Illumina analysis, in our limited experience the problem is actually worse on HiSeq than it was on GAII. Our facilities tried lower the cluster density, mix sample half-to-half with phiX and re-basecalling start from .cif files, to no avail.

Now we are starting to work with the new facility and they would allow us to save image and change run parameters (such as delay template generation) if we only will be able to do it. They have enough storage to comfortably store images - or at least we believe it so. We expect 4TB data out of single 100G flowcell SE 50bp run, they have 6TB storage now with possibility to extend it to at least 9TB, likely more.

I would be very interested in all the tips you can give me - and in the details of Phillip travails. How did you tinker RTA config file? Why did you think old OLB was not doing well?

Simon, are you saying that for low diversity samples phiX control line is crucial? Our facility is planning to run RNA-seq (from another organism) is some of the lanes - can that be used as a control? What about same organism sample from another run - providing we saved images?

I am very interested in all technical details related to re-processing the HiSeq data from the .tif files. If you think it is to technical for this forum please send me a private message

Last edited by MRSeq; 01-26-2012 at 11:48 PM.
MRSeq is offline   Reply With Quote
Old 01-28-2012, 09:56 AM   #6
pmiguel
Senior Member
 
Location: Purdue University, West Lafayette, Indiana

Join Date: Aug 2008
Posts: 2,283
Default

I am unaware of anyone having re-processed HiSeq data from the .tifs. So, while it is possible to save the images, they are not really useful for anything.

I still think most of the problem arises from mis-focusing of the instrument when it encounters a blank G/T or A/C tile. But this may be HiScanSQ specific.

--
Phillip
pmiguel is offline   Reply With Quote
Old 03-08-2012, 12:10 PM   #7
amitra
Junior Member
 
Location: galveston

Join Date: May 2009
Posts: 8
Default

Thanks so much for the detailed information on RTA behavior. Hopefully somebody would be able to devise a pipeline for a low-diversity sequencing run with HiSeq.
amitra is offline   Reply With Quote
Old 03-08-2012, 11:17 PM   #8
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 869
Default

Quote:
Originally Posted by amitra View Post
Sorry to crosspost! I would have the opportunity to work with my Sequencing facility to sequence a sample with reads starting with a specific 11nt barcode 'TCGAGGTAGTA', and would like to save all the image files to my NAS during the sequencing run.
Can the experts kindly point out what parameters to set exactly on the HiSeq controller software so that it saves images? The sequencing facility personnel say that only thumbnails can be saved, but after checking out this thread, I believe that the images can also be saved.
Thanks in advance, Abi
I think I missed this update when it first came out, but if you have a library where *all* of the sequences start with the same sequence then the easiest way to sequence this is to make up a custom sequencing primer which primes over the biased sequence so that the sequencing run starts after the bias has finished. Obviously this won't work if the library is being mixed with other libraries in the same lane, but in that case the other libraries will increase the diversity so you won't have such a problem in the first place.
simonandrews is offline   Reply With Quote
Old 08-27-2012, 10:57 AM   #9
victorsor
Member
 
Location: Madrid

Join Date: Dec 2011
Posts: 13
Default

Quote:
Originally Posted by simonandrews View Post
So we've done lots of work on this type of library and I guess our experience has been that as long as you've saved images these types of run are always able to be saved.

I actually have a lot of sympathy with illumina, as the steps you'd need to take to ensure you can rescue these runs would probably break more runs than they saved if everyone started doing them. Saving images for a GAII run is just about OK, but will undoubtedly fill up some people's disk space and cause the run to stall. As you said, saving HiC images is actually possible (and you can use OLB to reanalyse them), but Illumina say that it isn't due to the enormous amounts of data you need to move around. Even deferring cluster calling to later cycles in RTA is tricky because you need to store all images up to that point, and you can't start processing and discarding them until you've done the cluster calling, at which point you're way behind. You also need to know in advance whether biased composition is likely to be a problem to set these parameters before a run starts.

One thing we've seen from a few different groups now is that people often think they have a cluster detection problem, when actually it's a base call calibration problem. Biased composition will mess up cluster calling, but this is only disastrously bad if the bias is very strong and the cluster density is very high. The other thing biased composition does is to mess up the channel and phasing calibrations if you don't use a control lane. If you're getting plenty of clusters, but they're all failing the QC filters, then you can normally rescue the run by rerunning Bustard from the intensity files but specifying a fixed matrix and phasing parameters. If the qualities over the biased bases are rubbish then the EAMS filter can also knock down the qualities over the rest of the read, so we'd also normally run with --no-eams (which used to be an undocumented option, but I think is official in the latest version).
In our current run we have a problem related that you mentioned. Cluster are identified (about 500.000 per tile) but 0% pass-filter. We see that en 4 of the 5 first cycles P90 A, C, G, T is 0 but after cycle 6 it look well. Can we reprocess the data with OLB from CIF files? How can we specify a fixed matrix? With which values? In OLB manual we have read that matrix can be made with more than 5 cycles. If we used ten cycles (for example) problem can be corrected?

In this moment read1 is running. If we stop run prior read2 and change option to save images, and resume recipe, can help us in any way?


Thanks a lot for your help.
victorsor is offline   Reply With Quote
Old 08-27-2012, 11:55 PM   #10
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 869
Default

Quote:
Originally Posted by victorsor View Post
In our current run we have a problem related that you mentioned. Cluster are identified (about 500.000 per tile) but 0% pass-filter.
Whether you can fix this will depend on how it's gone wrong. Simply knowing that RTA found clusters doesn't mean it got them right. With the biased initial sequence the problem is that the analysed region ends up spanning two or more clusters so that later on you get a mixed signal and the cluster is correctly rejected by the purity filter. Since none of your clusters pass filter this isn't likely to be the entire cause here, but it may be that a significant number of clusters can't be fixed in software. Messed up cluster detection can only be fixed from images, not cif files.

Quote:
Originally Posted by victorsor View Post
We see that en 4 of the 5 first cycles P90 A, C, G, T is 0 but after cycle 6 it look well. Can we reprocess the data with OLB from CIF files?
Yes. I guess you have two choices - you could either specify a fixed matrix and turn off the killer Bs, or you could just ignore the first 5 bases all together and reprocess from base 6 onwards using the normal parameters.


Quote:
Originally Posted by victorsor View Post
How can we specify a fixed matrix? With which values? In OLB manual we have read that matrix can be made with more than 5 cycles. If we used ten cycles (for example) problem can be corrected?
Again, you have a few options. The easiest fix, if this is feasible in your data, is to simply specify a control lane from the same flowcell which is unaffected by the sequence bias (--control-lane=X when running bustard). This then generates all of the run parameters from that lane and applies them to all other lanes.

If you don't have this then you can supply a fixed matrix using the --matrix option in bustard. We've just taken the example one out of the OLB manual (it's on P16 in the v1.9 manual) when we've done this.

You can also expand the number of cycles used to define the matrix using the --matrix-cycles option. Given that you have 5 biased bases I don't think this is worth pursuing as you're unlikely to get a nice matrix if those are included.

Whichever of these options you choose you should probably also try adding the --no-eamss option when you run configureBclToFastq. This will disable the automatic lowering of quality values after they have substantially dipped in a read, so that if you have a read whose quality initially dips and then recovers the qualities will also recover rather than being pinned to a low value.

Quote:
Originally Posted by victorsor View Post
In this moment read1 is running. If we stop run prior read2 and change option to save images, and resume recipe, can help us in any way?
This won't help. If your cluster calling is messed up then it's too late to fix this, and anything else can be fixed from cif files.
simonandrews is offline   Reply With Quote
Old 08-28-2012, 02:49 AM   #11
victorsor
Member
 
Location: Madrid

Join Date: Dec 2011
Posts: 13
Default thanks to Simon

Thanks alot for your rapid and extended answer.
We have a control lane unaffected by the bias, therefore i think that "redefine" the matrix is not an option. Is it correct?
Techsupport suggested us an option that I think is as one of you propose:
Delete the initial 5 cycles data (C1.1, C2.1, C3.1, C4.1, C5.1 in all 8 lanes directories) and then rename the cycle directories such that Cycle 6 become cycle 1 (C6.1 -> C1.1, C7.1 -> C2.1, etc, etc) The final dataset is a few cycles short. Therefore, if you had previously 100+7+100 you can change this to 95+7+100 using the following command line on the bustard.py command

--cycles=1-95,96-102,103-202
victorsor is offline   Reply With Quote
Old 08-28-2012, 04:00 AM   #12
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,764
Default

If you have a control lane that is unaffected then Simon already indicated the easiest option. I concur with him about trying the "--control-lane=X" option first.

The other solution suggested by techsupport is "extreme".

Quote:
Originally Posted by victorsor View Post
We have a control lane unaffected by the bias, therefore i think that "redefine" the matrix is not an option. Is it correct?
Techsupport suggested us an option that I think is as one of you propose:
Delete the initial 5 cycles data (C1.1, C2.1, C3.1, C4.1, C5.1 in all 8 lanes directories) and then rename the cycle directories such that Cycle 6 become cycle 1 (C6.1 -> C1.1, C7.1 -> C2.1, etc, etc) The final dataset is a few cycles short. Therefore, if you had previously 100+7+100 you can change this to 95+7+100 using the following command line on the bustard.py command

--cycles=1-95,96-102,103-202

Last edited by GenoMax; 08-28-2012 at 04:03 AM.
GenoMax is offline   Reply With Quote
Old 08-28-2012, 05:29 AM   #13
victorsor
Member
 
Location: Madrid

Join Date: Dec 2011
Posts: 13
Default

Probably i do not explained well. A control lane is already defined in the current run.
victorsor is offline   Reply With Quote
Old 08-28-2012, 09:21 AM   #14
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,764
Default

Is this a GAIIx run (based on your comments about saving "images", I assume it is)?

The option "--control-lane=X" is to be used with bustard after the run completes. This does require that the lane "X" be marked as "control" during the run (which you have done).

If the cluster detection is messed up (are you able to see well separated clusters or are they very dense in the images, sort of like overgrown bacterial colonies) then your options are limited for salvaging this run, as Simon has already indicated.


Quote:
Originally Posted by victorsor View Post
Probably i do not explained well. A control lane is already defined in the current run.
GenoMax is offline   Reply With Quote
Old 08-28-2012, 12:35 PM   #15
victorsor
Member
 
Location: Madrid

Join Date: Dec 2011
Posts: 13
Default

Quote:
Originally Posted by GenoMax View Post
Is this a GAIIx run (based on your comments about saving "images", I assume it is)?
Yes is a GAIIx run

Quote:
Originally Posted by GenoMax View Post
The option "--control-lane=X" is to be used with bustard after the run completes. This does require that the lane "X" be marked as "control" during the run (which you have done).
Thanks for your comments. All that can help us to not waste all data is appreciated.
However, if in the current run already the option of use a control lane is used, in what can help us reprocess with the same option without any other change?

Sorry if i do not understand you.

Quote:
Originally Posted by GenoMax View Post
If the cluster detection is messed up (are you able to see well separated clusters or are they very dense in the images, sort of like overgrown bacterial colonies) then your options are limited for salvaging this run, as Simon has already indicated.
Is difficult to delimit between high number of cluster and overload. I think that not is a problem of overload because in another lane with a similar library mixed 50%-50% with a non-biased sample we obtain the expected number of cluster.
Additionaly, in first cycles during incorporation step we saw a very, very high number of spots in red and much lower in green. After a few cycles, the number of red and green spots looks similar.
victorsor is offline   Reply With Quote
Old 08-29-2012, 01:04 AM   #16
sklages
Senior Member
 
Location: Berlin, DE

Join Date: May 2008
Posts: 620
Default

Quote:
Originally Posted by GenoMax View Post
The option "--control-lane=X" is to be used with bustard after the run completes. This does require that the lane "X" be marked as "control" during the run (which you have done).
IMHO this is not necessary; Lane X should look good, that's more important :-) We do this from time to time with chipseq data.
sklages is offline   Reply With Quote
Old 08-29-2012, 01:27 AM   #17
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 869
Default

Quote:
Originally Posted by victorsor View Post
Probably i do not explained well. A control lane is already defined in the current run.
If there was a control lane defined then there should be no need to go messing about with matrix or phasing settings (assuming the control lane looked OK).

If you've still got no clusters then the only remaining remedy would be to run the bclToFastq with the --no-eamss option to see if the qualities temporarily dip and then recover.
simonandrews is offline   Reply With Quote
Old 08-29-2012, 04:33 AM   #18
victorsor
Member
 
Location: Madrid

Join Date: Dec 2011
Posts: 13
Default

Thank you all. We will try to do your suggestions when run finish. I hope that not all data go to waste...
victorsor is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 06:20 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO