![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Fungal genome annotation | Fitoedu | Bioinformatics | 2 | 02-03-2015 09:12 PM |
Strategy for genome assembly | mbseq | Bioinformatics | 5 | 10-25-2012 06:04 AM |
Fungal refseq & data analysis | nguyendofx | Bioinformatics | 7 | 02-29-2012 05:06 AM |
PubMed: A new strategy for genome assembly using short sequence reads and reduced rep | Newsbot! | Literature Watch | 1 | 11-18-2010 12:52 AM |
PubMed: Assessment of soil fungal communities using pyrosequencing. | Newsbot! | Literature Watch | 0 | 06-24-2010 02:00 AM |
![]() |
|
Thread Tools |
![]() |
#1 |
Junior Member
Location: MGEL Join Date: Feb 2011
Posts: 4
|
![]()
So we want to develop a sequencing strategy for a 50 Mbp Ascomycota genome (plant pathogen). This genome has no other references and the size is based upon other Ascomycota genome sizes. We do not know how much variation within the species or G+C content.
Can you help me develop a sequencing strategy on a budget??? Illumina GAIIx seems to be the most widely used and best supported... So this will likely be the platform of choice. It seems that a single lane of data from the GAIIx will be sufficient in achieving enough data for a draft assembly....96X coverage.... assuming 30-50X coverage is required for assembly. Our goals are to create a draft assembly and ultimately a final high-quality assembly.... find microsatellite markers to identify variation within and among the species.... possibly find SNP's for the same purpose or QTL.... determine gene structure for later RNA-Seq or EST analysis.... comparison of genome-wide relationships with other fungi....???? Anything else???? Our ultimate goal is to find host-pathogen relationships.... Which will help eliminate the pathogen in the host species So as far as I can tell..... #1 Isolate the genomic DNA from a single haploid culture of the fungus I think that coming from a single haploid culture will help in the assembly process....but will eliminate the possibility of finding SNP's. Will this also eliminate finding any microsatellites??? Should I instead combiine many isolates, since a single lane from the GAIIx will yield 96X coverage??? #2 Will using paired-end sequences provide for a better assembly? Yes...right??? Will paired-end reads provide better microsatellite detection?? Is it worth the cost for our immediate goals of microsatellite detection and determining gene structure??? #3 After you receive the sequence data, you must filter and trim the data based on quality scores...this helps eliminate bad sequences from confusing the assembly programs....right??? Anyone have any favorite programs for this.... Galaxy...FASTX....???? #4 Once the sequences are “cleaned”...you must remove the repeat regions.... right?? This reduces the complexity of assembly programs....right?? Anyone have any favorite programs???....RepeatMasker Will de novo repeat finders essentially find what I am looking for....microsatellites??? de novo repeat finders??? http://nar.oxfordjournals.org/conten...ks981.abstract http://www.ncbi.nlm.nih.gov/pubmed/2...?dopt=Abstract http://www.ncbi.nlm.nih.gov/pubmed/2...?dopt=Abstract http://www.ncbi.nlm.nih.gov/pubmed/18782453 #5 I believe that our collaborators are familiar with Velvet and Abyss, so these programs should be able to assemble the genome..... Any other favorite assemblers??? But are there better options for variant detection? genotyping-by-sequencing?? cortex_var??? RAD-sequencing?? These require a different experimental design than the one being proposed...I know...but are they cost effective??? Please correct me on any mistake in judgment.... Thank you |
![]() |
![]() |
![]() |
#2 |
Senior Member
Location: Boston area Join Date: Nov 2007
Posts: 747
|
![]()
With the new 2x250 MiSeq chemistry, you might actually get better assemblies for around the same price as one lane of GAIIx -- it would be worth asking around. One flowcell on MiSeq with that chemistry should be enough to get a draft assembly.
Paired end will definitely yield better assemblies. What you really want to shoot for is to size the fragments so that they overlap in the middle by about 50-60 bp. Clearly, you'll need to pick a chemistry before you can do that sizing. The huge benefit is that you can then use a tool such as FLASH to merge many of the reads; if you size carefully it may well be 75% or more. This means you have a lot of very long reads, plus their quality is improved in the overlap region (where it would otherwise be very low). I haven't experimented with combining FLASH with trimming; I think in general you don't want to trim first, though you might want to trim the reads that can't be paired. Yes, a single haploid genome will by definition eliminate SNPs and any other true genetic variants; indeed, that data will be a good test of background noise in your variant calling scheme. Haploid is definitely easier to assembly, and as suggested before easier to debug. For SNPs, you may well want to think about RAD-Seq or similar approaches with a pool of DNA from diverse samples; mapping these reads back the haploid reference will mine a lot more variants than a single diploid could produce. Given that the cost of library preparation has come down a lot, you might also contemplate sequencing multiple diverse haploid strains. An interesting question, which I have not explored, is whether in this case you are better doing one ~100X genome or assembling 2 individual 50X genomes and then merging the assemblies with Minimus2 or similar. Ray is an excellent assembler for large datasets, particularly if you have access to a cluster. If you don't have access to a cluster, it is pretty easy to set one up on the Amazon cloud using Star::Cluster & run very briefly there. Unless it has changed substantially (I haven't used it in half a decade), RepeatMasker isn't suitable for discovering repeats; it's a tool for applying a known repeat library to clear out repeats. I suppose simple repeats are universal, and perhaps microsatellites as well. There are tools out there for repeat discovery, but I don't claim any familiarity with them. Good luck! |
![]() |
![]() |
![]() |
#3 | |
Senior Member
Location: Purdue University, West Lafayette, Indiana Join Date: Aug 2008
Posts: 2,315
|
![]() Quote:
For a fungal genome 20% of a 2x100 PE HiSeq lane will generally produce on the order of 5-8 billion bases of sequence. That is comfortably in the 100x range. Not sure how much it would cost to buy that amount of sequence in general, but you would be looking at $275 in reagents. Whereas reagents for a 2x250 MiSeq run would run $1000 and generate -- well with the v2 upgrade the same amount of sequence. Of course the MiSeq run will take a couple of days, whereas the HiSeq run is closer to 2 weeks. BTW, for HiSeq 2x100 PE data we often get our best ABySS assemblies at kmers around 80. Which, if I am not mistaken, is a much higher kmer than most would consider. -- Phillip |
|
![]() |
![]() |
![]() |
#4 | ||
Junior Member
Location: MGEL Join Date: Feb 2011
Posts: 4
|
![]()
OK thanks for your advice... So from what I understand now (after talking to a fellow grad-student)...
The GAIIx is falling out of favor....due to its high cost compared to the HiSeq and MiSeq He told me that I would have a hard time finding a partner to share a flowcell on the GAIIx.... So its either fill up all lanes of the flow cell or choose another seqencer. Quote:
Quote:
Last edited by bdbart; 10-24-2012 at 01:47 PM. |
||
![]() |
![]() |
![]() |
#5 |
Senior Member
Location: Boston area Join Date: Nov 2007
Posts: 747
|
![]()
Alas, I don't have any good data for 2x250 -- the first couple runs didn't work very well, as the genomes I'm working on have crazy %GC & the v2 chemistry didn't do well. We'll try again at some point, perhaps suffering a big phiX spike-in.
I should do a proper assessment, but I do believe from assemblies I've run that 2x150 assemblies at high coverage (with FLASH) are superior to 2x100 assemblies at similar coverage. But it is definitely true that if you can ride along with someone else's run, you'll save a lot of money using HiSeq, and you could put that money towards something more valuable in your project (such as sequencing multiple strains). |
![]() |
![]() |
![]() |
#6 |
Member
Location: Flagstaff, AZ Join Date: Sep 2012
Posts: 55
|
![]()
Haploid data will still find your repeats. I like Imperfect repeat finder (http://ssr.nwisrl.ars.usda.gov/) as many useful SSRs don't have a perfect repeat motif, but I think it has a limit on the amount of sequence it will process at once. Another option is WebSat (http://wsmartins.net/websat/), but it only finds perfect repeats and if memory serves, it processes even less data than Imperfect repeat finder. Once you get a draft assembly, plop in portions of your contigs and the program will show you where you have repeats and what they are. Sample broadly across your assembly and you should cover as much of the genome with your markers as you desire. One advantage I forgot to mention about Websat is it has wonderful integration with Primer3 so designing primers for your new markers is outrageously simplified.
Andy |
![]() |
![]() |
![]() |
Thread Tools | |
|
|