Hi people, just wanted some advice on a situation.
I have a dataset of 454 genome sequencing that has been enriched for gene-rich regions by using biotinylated RNA as a bait .. i.e. anything that sticks to the RNA gets selected for. This was then sequenced on a 454.
I know that normally people would use a reference genome, but there isn't one and won't be one for the forseeable future. I also can't map to a transcriptome because there isn't one and because the point of the exercise is to improve a bunch of gene models I have, which are based on ESTs and are thus likely to be missing introns.
So I am stuck with de novo assembly. Initial assemblies I have done (using velvet) are not awful but the sequencing is obviously highly hetereogeneous in the coverage level. I know a lot of assembly programs assume homogeneous coverage (eg. CABOG states flat-out that it's rubbish for exome sequencing & other hetereogeneous coverage sequencing).
I am basically just preprocessing on quality scores, splitting the reads where there's a probable homopolymer caused by 454-ness, and doing assemblies in velvet using a very low estimated coverage (across a range of kmers).
Then I assess the quality of the assembly based on whether a bunch of known-to-be-good gene models are in there, since N50 etc isn't really applicable in this case.. or is it?
Is this an OK approach? Any comments / suggestions appreciated.
Cheers!
I have a dataset of 454 genome sequencing that has been enriched for gene-rich regions by using biotinylated RNA as a bait .. i.e. anything that sticks to the RNA gets selected for. This was then sequenced on a 454.
I know that normally people would use a reference genome, but there isn't one and won't be one for the forseeable future. I also can't map to a transcriptome because there isn't one and because the point of the exercise is to improve a bunch of gene models I have, which are based on ESTs and are thus likely to be missing introns.
So I am stuck with de novo assembly. Initial assemblies I have done (using velvet) are not awful but the sequencing is obviously highly hetereogeneous in the coverage level. I know a lot of assembly programs assume homogeneous coverage (eg. CABOG states flat-out that it's rubbish for exome sequencing & other hetereogeneous coverage sequencing).
I am basically just preprocessing on quality scores, splitting the reads where there's a probable homopolymer caused by 454-ness, and doing assemblies in velvet using a very low estimated coverage (across a range of kmers).
Then I assess the quality of the assembly based on whether a bunch of known-to-be-good gene models are in there, since N50 etc isn't really applicable in this case.. or is it?
Is this an OK approach? Any comments / suggestions appreciated.
Cheers!
Comment