SEQanswers

Go Back   SEQanswers > Sequencing Technologies/Companies > Illumina/Solexa



Similar Threads
Thread Thread Starter Forum Replies Last Post
CASAVA 1.7 Install Help aligenie Bioinformatics 29 05-23-2014 02:58 AM
automated pipeline casava or FASTQ script/program sdj Illumina/Solexa 3 01-12-2012 06:02 AM
help. Casava 1.8 demultiplexing senpeng Illumina/Solexa 1 09-19-2011 07:40 AM
CASAVA v1.8 with indels tonio100680 Bioinformatics 3 08-19-2011 04:53 AM
Demultiplexing and CASAVA 1.7 tonio100680 Bioinformatics 14 06-16-2011 10:48 PM

Reply
 
Thread Tools
Old 01-28-2009, 02:43 AM   #1
dvh
Member
 
Location: london, uk

Join Date: Jul 2008
Posts: 35
Default CASAVA, Pipeline 1.3

I've just looked through the just released CASAVA manual. Whilst it would seem to have some new tools for visualising/calling SNPs and RNAseq, it seems totally dependent on ELAND alignments.

We havent used ELAND since we started read lengths of 45bp+. We didnt find it very good for >32bp.

Am I missing something here?

david
dvh is offline   Reply With Quote
Old 01-28-2009, 05:41 AM   #2
rs705
Junior Member
 
Location: USA

Join Date: Sep 2008
Posts: 6
Default

Are you committed to CASAVA? If not, can you tell me what applications you are interested in?
rs705 is offline   Reply With Quote
Old 01-28-2009, 11:23 AM   #3
swbarnes2
Senior Member
 
Location: San Diego

Join Date: May 2008
Posts: 912
Default

Couldn't you use something like Bowtie, which yields a similar kind of output, and bend it into ELAND format?
swbarnes2 is offline   Reply With Quote
Old 01-28-2009, 12:51 PM   #4
apfejes
Senior Member
 
Location: Oakland, California

Join Date: Feb 2008
Posts: 236
Default

I thought CASAVA was an Illumina product, as is Eland. I don't think you're missing anything - of course they want you to use their products end to end. (= On the other hand, even the WTSS SNP & Exon expression software I wrote handles more than one input format, so I think it's just Illumina trying to bring people back into the Eland fold.

Frankly, there are so many SNP callers out there, until I see some solid reason to switch to CASAVA (and back to Eland), its not even on my radar.
__________________
The more you know, the more you know you don't know. —Aristotle
apfejes is offline   Reply With Quote
Old 01-29-2009, 09:51 AM   #5
bioinfosm
Senior Member
 
Location: USA

Join Date: Jan 2008
Posts: 482
Default

I agree with you

but - quick, one-stop, vendor-supported and visualization support for investigators are some reasons pro it, umm maybe, as I have not had the chance to look yet
bioinfosm is offline   Reply With Quote
Old 01-29-2009, 09:55 AM   #6
apfejes
Senior Member
 
Location: Oakland, California

Join Date: Feb 2008
Posts: 236
Default

Well, for people just getting into the game, I'm sure it'll be easy to set up and get running.

That's how Microsoft managed to get 95% of the internet using population using Internet Explorer for a while.... (-;
__________________
The more you know, the more you know you don't know. —Aristotle
apfejes is offline   Reply With Quote
Old 02-01-2009, 07:40 AM   #7
GRT
Junior Member
 
Location: UK

Join Date: Jul 2008
Posts: 5
Default qseq.txt format

Quote:
Originally Posted by dvh View Post
I've just looked through the just released CASAVA manual. Whilst it would seem to have some new tools for visualising/calling SNPs and RNAseq, it seems totally dependent on ELAND alignments.

We havent used ELAND since we started read lengths of 45bp+. We didnt find it very good for >32bp.

Am I missing something here?

david
Also seq.txt & prb.txt now "optional" bustard output, default being qseq.txt, but not much info on this format in the pipeline manual. As we haven't updated the software yet, does anyone have some new qseq.txt files to play with information of the q scores used?
GRT is offline   Reply With Quote
Old 02-01-2009, 04:44 PM   #8
zee
NGS specialist
 
Location: Malaysia

Join Date: Apr 2008
Posts: 249
Default

For RNAseq there are systems such as ERANGE and FindFeatures (Vancouver SR package).
ERANGE seems quite limited to specific genomes and I'm working with certain genomes that have no reference sequence.
I have not tried FindFeatures.

It would be good to have a generic system to do tag counting in samples given a set of known exon positions and mapping results from alignment to whole genome, mRNA and exon junctions.
zee is offline   Reply With Quote
Old 02-01-2009, 06:25 PM   #9
apfejes
Senior Member
 
Location: Oakland, California

Join Date: Feb 2008
Posts: 236
Default

FindFeatures is a fairly simple program. I don't think anyone outside of the BC Genome Science Centre is using it - although if anyone has the urge to try it, I'm more than happy to provide support.

Anthony
__________________
The more you know, the more you know you don't know. —Aristotle
apfejes is offline   Reply With Quote
Old 02-09-2009, 02:56 AM   #10
coxtonyj
Junior Member
 
Location: Cambridge, UK

Join Date: Apr 2008
Posts: 8
Default

Quote:
Originally Posted by apfejes View Post
Well, for people just getting into the game, I'm sure it'll be easy to set up and get running.

That's how Microsoft managed to get 95% of the internet using population using Internet Explorer for a while.... (-;
Hi apfejes

Disclaimer: I work at Illumina and am one of the developers of CASAVA, but these are my personal opinions.

As I see it, the beauty of sequencing data is that once you've got it into As,Cs,Gs and Ts it becomes a 'commodity item' and I think trying to compete with the combined brainpower of the entire sequencing community by trying to 'lock users in' beyond that stage would be extremely tough, and it's not clear to me if we would gain much by doing so.

CASAVA is more meant to make it easier to process datasets on 'human genome resequencing' scales - a human genome at say 30x sequence coverage presents logistical issues beyond those associated with, say, a ChipSeq dataset of a couple of Gbases (and I in no way wish to trivialize those, I know this is already a dauntingly large dataset in many ways) and now we are not so far away from "1 run (from whatever platform) = 1 genome" we don't want these to stand in the way of the science. Ideally algorithm developers would be able concentrate on algorithms and not file formats and so forth.

The idea is that 'under the hood' CASAVA handles the necessary sorting, binning and filtering of reads. SNP callers and other downstream applications then access the alignment data they need by making function calls to a library.

The software evolved from the code we used for our Yoruba genome analysis and can be used as a standalone genome analysis tool. The currently released version only includes the SNP calling module but internally we have modules for e.g. short indel and structural variant detection that we are looking to move towards release. CASAVA is also used as a backend to provide input data for the Genome Studio software we are releasing.

I would actually be very happy if people were to use CASAVA to process MAQ and/or BowTie data and I imagine it would be quite straightforward to write a parser, lack of time is the only reason we haven't looked at this ourselves.

Cheers

Tony
coxtonyj is offline   Reply With Quote
Old 02-09-2009, 10:47 AM   #11
apfejes
Senior Member
 
Location: Oakland, California

Join Date: Feb 2008
Posts: 236
Default

Hi Tony,

Thanks for the reply - I hadn't meant to imply that Illumina was working towards some grand evil plan to take over the sequence analysis space, as microsoft has done in the past with the Windows desktop - only that Illumina is providing a tool the way that microsoft did, where it will now be easier to use the one that comes with the tool "out of the box" than to move on to something else. (And that's not necessarily a bad thing.)

As far as it not having parsers because you haven't had time to write them, I certainly understand the phenomenon - I've run into it several times myself. If the software were open source, or the source code were publicly available, others might be willing to contribute those missing parts, which would be an option for allowing other aligners to be used. (I suspect that's not in illumina's best interest, however, so I'm not really expecting to see that.)

In any case, I think the major issue I have is that I have only heard much about CASAVA second hand in meetings and otherwise, so I'm likely missing key information. Perhaps you can point us to some literature on the web that would be able to fill in the missing pieces for the rest of us. I'd certainly appreciate reading more than just marketing pieces - which I haven't yet come across. Is there something I've missed out there?

Anthony
__________________
The more you know, the more you know you don't know. —Aristotle
apfejes is offline   Reply With Quote
Old 02-13-2009, 04:29 AM   #12
coxtonyj
Junior Member
 
Location: Cambridge, UK

Join Date: Apr 2008
Posts: 8
Default

Hi apfejes

Thanks for the reply, you make several good points. At the moment the software is available on the same basis as our existing 'analysis pipeline' software package - ie instrument owners can download it free, including access to the source code. Unfortunately (much as I might like to) it's not for me to comment on whether our policy on that might change in the future.

We've presented posters on it at a couple of conferences recently and there's a sizeable manual that comes with it. As it's a new venture I think we're adopting somewhat of a softly softly approach to releasing it - some people will try it whether you publicize it or not, and that gives us feedback that we can add to the ideas we already have about how it can evolve to best meet users' needs. I think you're right though that a tech note aimed at the kind of folks who read this board would be a good idea.

We're not really proprietary about which aligners or other tools people use - it's their data after all. Personally I see things moving towards more of a decoupling between alignment tools and downstream tools (SNP callers and so forth) that use alignments. I think the SAMTools project is a very positive step in that direction, it seems to me it has many of the same aims as CASAVA.

Cheers

Tony
coxtonyj is offline   Reply With Quote
Old 02-23-2009, 06:15 PM   #13
sparks
Senior Member
 
Location: Kuala Lumpur, Malaysia

Join Date: Mar 2008
Posts: 126
Default

Hi Tony,
I've been given a couple of qseq.txt files to align for clients and the format looks pretty simple except for the quality values. I'm seeing a lot of B's in the quality string and it looks like this is the lowest quality value. In earlier _sequence.txt files quality values were in form log(p/(1-p)) + '@' and codes went as low as ';'
These qseq.txt files look like you may be using phred type log(p) + '@'. Any chance you could enlighten us.

Thanks, Colin

Last edited by sparks; 02-23-2009 at 06:15 PM. Reason: formula correction
sparks is offline   Reply With Quote
Old 02-24-2009, 12:25 AM   #14
coxtonyj
Junior Member
 
Location: Cambridge, UK

Join Date: Apr 2008
Posts: 8
Default

Hi Colin

You have it spot on, they are now in Phred format. Just to state it fully for the benefit of others: ASCII='@'+10*log10(1/p), p being the estimated probability of the base being in error. This change was made as of Pipeline 1.3.

Cheers

Tony
coxtonyj is offline   Reply With Quote
Old 02-24-2009, 06:26 AM   #15
sparks
Senior Member
 
Location: Kuala Lumpur, Malaysia

Join Date: Mar 2008
Posts: 126
Default

Hi Tony,
Thanks for the that, I'm sure you are right though some Illumina documentation being sent out with export files still talks about -5 being a valid quality value so you guys should check your documentation.
I've also noticed in the qseq files I have that the lowest code is a B which translates to a Phred score of 2. This happens even for bases called as '.'. If Perr was 0.75 then Phred would be 1.24 so it looks like you round up to 2. This is might be of interest to people who are using qualities in alignment and in SNP calling. I did like the previous Solexa scale as it gave a finer resolution for higher Perr values.

Thanks again., Colin
sparks is offline   Reply With Quote
Old 02-26-2009, 01:20 AM   #16
coxtonyj
Junior Member
 
Location: Cambridge, UK

Join Date: Apr 2008
Posts: 8
Default

Hi Colin

Quote:
Originally Posted by sparks View Post
Hi Tony,
Thanks for the that, I'm sure you are right though some Illumina documentation being sent out with export files still talks about -5 being a valid quality value so you guys should check your documentation.
OK, thanks for bringing that to my attention.

Quote:
Originally Posted by sparks View Post
I've also noticed in the qseq files I have that the lowest code is a B which translates to a Phred score of 2. This happens even for bases called as '.'. If Perr was 0.75 then Phred would be 1.24 so it looks like you round up to 2. This is might be of interest to people who are using qualities in alignment and in SNP calling. I did like the previous Solexa scale as it gave a finer resolution for higher Perr values.

Thanks again., Colin
Yes, Phred Q1 translates to just over 20% probability of the called base being correct. In the absence of further information, the natural assumption is that the three non-called bases are equiprobable, but that then means for a Q1 base the three non-called bases are each more likely than the called base - this can mess up your stats! It probably doesn't matter so much what the Q-value of an 'N' is set to, but I guess they are being set to Q2 for consistency.

Personally I've tended to find that if the error probability is higher enough for the divergence of the scoring schemes to be an issue then the base is probably best ignored for many purposes.

There are certainly plusses and minuses to both scoring schemes. The original reason for going with the 'Solexa' log-odds scheme was that, unlike the Phred scheme, it naturally extends to a 4-values-per-base scoring scheme. We've ended up using only a single value per base, but I know some folks in the community remain keen on having more than one qv per base.

Cheers

Tony
coxtonyj is offline   Reply With Quote
Old 02-26-2009, 12:35 PM   #17
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,177
Default

Quote:
Originally Posted by coxtonyj View Post
Hi Colin

You have it spot on, they are now in Phred format. Just to state it fully for the benefit of others: ASCII='@'+10*log10(1/p), p being the estimated probability of the base being in error. This change was made as of Pipeline 1.3.

Cheers

Tony
Out of curiosity why did you stick with ASCII(Q+64) instead of the standard ASCII(Q+33)? It results in the minor annoyance of having to remember to convert before use in programs which are expecting Sanger FASTQ. It also means that there are now three types of FASTQ files floating about; standard Sanger FASTQ with quality scores expressed as ASCII(Qphred+33), Solexa FASTQ with ASCII(Qsolexa+64) and Solexa FASTQ with ASCII(Qphred+64).

Last edited by kmcarr; 02-26-2009 at 12:39 PM. Reason: Added thought
kmcarr is offline   Reply With Quote
Old 03-02-2009, 08:00 AM   #18
coxtonyj
Junior Member
 
Location: Cambridge, UK

Join Date: Apr 2008
Posts: 8
Default

That is a fair point. The need to convert has always been present of course. We did give this some thought at the time and as I recall the rationale was that any code (ours or others) that was expecting Qsolexa+64 would probably still work if given Qphred+64, but that the conversion to Qphred+33 was at least now just a simple subtraction. But perhaps we should have bitten the bullet and gone with Qphred+33.
coxtonyj is offline   Reply With Quote
Old 03-30-2009, 11:25 PM   #19
fadista
Member
 
Location: Malmö

Join Date: Sep 2008
Posts: 37
Default sol2sanger

Hi,

Just want to be sure here:

1 - Is the sol2sanger function of maq 0.7.1 not working for solexa pipeline 1.3?

2 - If not, how can I convert the scores that I already computed (sol2sanger of maq 0.7.1 with solexa pipeline 1.3) to the sanger phred score system?


Best regards,
João
fadista is offline   Reply With Quote
Old 04-14-2009, 08:07 AM   #20
acnoll
Member
 
Location: Kansas City

Join Date: Mar 2008
Posts: 14
Default non unique sequences in sorted.txt file?

When working in a tag counting context there will be many instances of a given read sequence (e.g. for digital gene expression). I have noticed an odd behavior from eland/GA pipeline from glancing at the s_N_sorted.txt files (SE reads). There are cases where eland reports different locations for a specific sequence but the pipeline still includes it as part of the sorted.txt file. Could this be due to differences in base quality for different instances of the sequence or perhaps even the way the genome was squashed? Has any else seen this?
acnoll is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 06:09 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO