SEQanswers

Go Back   SEQanswers > General



Similar Threads
Thread Thread Starter Forum Replies Last Post
Fungal genome annotation Fitoedu Bioinformatics 2 02-03-2015 09:12 PM
Strategy for genome assembly mbseq Bioinformatics 5 10-25-2012 06:04 AM
Fungal refseq & data analysis nguyendofx Bioinformatics 7 02-29-2012 05:06 AM
PubMed: A new strategy for genome assembly using short sequence reads and reduced rep Newsbot! Literature Watch 1 11-18-2010 12:52 AM
PubMed: Assessment of soil fungal communities using pyrosequencing. Newsbot! Literature Watch 0 06-24-2010 02:00 AM

Reply
 
Thread Tools
Old 10-22-2012, 04:34 PM   #1
bdbart
Junior Member
 
Location: MGEL

Join Date: Feb 2011
Posts: 4
Default 50Mbp fungal genome strategy

So we want to develop a sequencing strategy for a 50 Mbp Ascomycota genome (plant pathogen). This genome has no other references and the size is based upon other Ascomycota genome sizes. We do not know how much variation within the species or G+C content.

Can you help me develop a sequencing strategy on a budget???

Illumina GAIIx seems to be the most widely used and best supported... So this will likely be the platform of choice.

It seems that a single lane of data from the GAIIx will be sufficient in achieving enough data for a draft assembly....96X coverage.... assuming 30-50X coverage is required for assembly.

Our goals are to create a draft assembly and ultimately a final high-quality assembly.... find microsatellite markers to identify variation within and among the species.... possibly find SNP's for the same purpose or QTL.... determine gene structure for later RNA-Seq or EST analysis.... comparison of genome-wide relationships with other fungi....???? Anything else????

Our ultimate goal is to find host-pathogen relationships.... Which will help eliminate the pathogen in the host species

So as far as I can tell.....

#1 Isolate the genomic DNA from a single haploid culture of the fungus

I think that coming from a single haploid culture will help in the assembly process....but will eliminate the possibility of finding SNP's. Will this also eliminate finding any microsatellites???

Should I instead combiine many isolates, since a single lane from the GAIIx will yield 96X coverage???

#2 Will using paired-end sequences provide for a better assembly? Yes...right???

Will paired-end reads provide better microsatellite detection?? Is it worth the cost for our immediate goals of microsatellite detection and determining gene structure???

#3 After you receive the sequence data, you must filter and trim the data based on quality scores...this helps eliminate bad sequences from confusing the assembly programs....right???

Anyone have any favorite programs for this.... Galaxy...FASTX....????

#4 Once the sequences are “cleaned”...you must remove the repeat regions.... right?? This reduces the complexity of assembly programs....right??

Anyone have any favorite programs???....RepeatMasker

Will de novo repeat finders essentially find what I am looking for....microsatellites???

de novo repeat finders???

http://nar.oxfordjournals.org/conten...ks981.abstract
http://www.ncbi.nlm.nih.gov/pubmed/2...?dopt=Abstract
http://www.ncbi.nlm.nih.gov/pubmed/2...?dopt=Abstract
http://www.ncbi.nlm.nih.gov/pubmed/18782453


#5 I believe that our collaborators are familiar with Velvet and Abyss, so these programs should be able to assemble the genome.....

Any other favorite assemblers???

But are there better options for variant detection?
genotyping-by-sequencing??
cortex_var???
RAD-sequencing??

These require a different experimental design than the one being proposed...I know...but are they cost effective???

Please correct me on any mistake in judgment.... Thank you
bdbart is offline   Reply With Quote
Old 10-22-2012, 07:06 PM   #2
krobison
Senior Member
 
Location: Boston area

Join Date: Nov 2007
Posts: 747
Default

With the new 2x250 MiSeq chemistry, you might actually get better assemblies for around the same price as one lane of GAIIx -- it would be worth asking around. One flowcell on MiSeq with that chemistry should be enough to get a draft assembly.

Paired end will definitely yield better assemblies. What you really want to shoot for is to size the fragments so that they overlap in the middle by about 50-60 bp. Clearly, you'll need to pick a chemistry before you can do that sizing. The huge benefit is that you can then use a tool such as FLASH to merge many of the reads; if you size carefully it may well be 75% or more. This means you have a lot of very long reads, plus their quality is improved in the overlap region (where it would otherwise be very low). I haven't experimented with combining FLASH with trimming; I think in general you don't want to trim first, though you might want to trim the reads that can't be paired.

Yes, a single haploid genome will by definition eliminate SNPs and any other true genetic variants; indeed, that data will be a good test of background noise in your variant calling scheme. Haploid is definitely easier to assembly, and as suggested before easier to debug.

For SNPs, you may well want to think about RAD-Seq or similar approaches with a pool of DNA from diverse samples; mapping these reads back the haploid reference will mine a lot more variants than a single diploid could produce. Given that the cost of library preparation has come down a lot, you might also contemplate sequencing multiple diverse haploid strains. An interesting question, which I have not explored, is whether in this case you are better doing one ~100X genome or assembling 2 individual 50X genomes and then merging the assemblies with Minimus2 or similar.


Ray is an excellent assembler for large datasets, particularly if you have access to a cluster. If you don't have access to a cluster, it is pretty easy to set one up on the Amazon cloud using Star::Cluster & run very briefly there.

Unless it has changed substantially (I haven't used it in half a decade), RepeatMasker isn't suitable for discovering repeats; it's a tool for applying a known repeat library to clear out repeats. I suppose simple repeats are universal, and perhaps microsatellites as well. There are tools out there for repeat discovery, but I don't claim any familiarity with them.

Good luck!
krobison is offline   Reply With Quote
Old 10-23-2012, 04:41 AM   #3
pmiguel
Senior Member
 
Location: Purdue University, West Lafayette, Indiana

Join Date: Aug 2008
Posts: 2,315
Default

Quote:
Originally Posted by krobison View Post
With the new 2x250 MiSeq chemistry, you might actually get better assemblies for around the same price as one lane of GAIIx -- it would be worth asking around. One flowcell on MiSeq with that chemistry should be enough to get a draft assembly.

Paired end will definitely yield better assemblies. What you really want to shoot for is to size the fragments so that they overlap in the middle by about 50-60 bp. Clearly, you'll need to pick a chemistry before you can do that sizing. The huge benefit is that you can then use a tool such as FLASH to merge many of the reads; if you size carefully it may well be 75% or more. This means you have a lot of very long reads, plus their quality is improved in the overlap region (where it would otherwise be very low).
Not sure I agree with this assessment. I want to believe it, because we have MiSeqs. But it seems based on that idea that lower coverage with longer (lower quality) reads will yield better results than shorter higher quality reads. Do you have any evidence to support this?

For a fungal genome 20% of a 2x100 PE HiSeq lane will generally produce on the order of 5-8 billion bases of sequence. That is comfortably in the 100x range. Not sure how much it would cost to buy that amount of sequence in general, but you would be looking at $275 in reagents. Whereas reagents for a 2x250 MiSeq run would run $1000 and generate -- well with the v2 upgrade the same amount of sequence.

Of course the MiSeq run will take a couple of days, whereas the HiSeq run is closer to 2 weeks.

BTW, for HiSeq 2x100 PE data we often get our best ABySS assemblies at kmers around 80. Which, if I am not mistaken, is a much higher kmer than most would consider.

--
Phillip
pmiguel is offline   Reply With Quote
Old 10-24-2012, 01:44 PM   #4
bdbart
Junior Member
 
Location: MGEL

Join Date: Feb 2011
Posts: 4
Default

OK thanks for your advice... So from what I understand now (after talking to a fellow grad-student)...

The GAIIx is falling out of favor....due to its high cost compared to the HiSeq and MiSeq

He told me that I would have a hard time finding a partner to share a flowcell on the GAIIx.... So its either fill up all lanes of the flow cell or choose another seqencer.

Quote:
For a fungal genome 20% of a 2x100 PE HiSeq lane will generally produce on the order of 5-8 billion bases of sequence. That is comfortably in the 100x range.
So i would have to share a lane of data with someone else??? Is that common practice?? Or would I have to develop a strategy to efficiently use and entire lane of data?....i.e.... Sequence multiple isolates...etc...

Quote:
Not sure I agree with this assessment. I want to believe it, because we have MiSeqs. But it seems based on that idea that lower coverage with longer (lower quality) reads will yield better results than shorter higher quality reads. Do you have any evidence to support this?
I not sure what you don't agree with.... But from what I understand...longer reads should generate better assemblies.... and his paired-end strategy will essentially create longer reads.... quality is not being reduced, quality is only being enhanced in the overlap regions

Last edited by bdbart; 10-24-2012 at 01:47 PM.
bdbart is offline   Reply With Quote
Old 10-25-2012, 06:56 PM   #5
krobison
Senior Member
 
Location: Boston area

Join Date: Nov 2007
Posts: 747
Default

Alas, I don't have any good data for 2x250 -- the first couple runs didn't work very well, as the genomes I'm working on have crazy %GC & the v2 chemistry didn't do well. We'll try again at some point, perhaps suffering a big phiX spike-in.

I should do a proper assessment, but I do believe from assemblies I've run that 2x150 assemblies at high coverage (with FLASH) are superior to 2x100 assemblies at similar coverage. But it is definitely true that if you can ride along with someone else's run, you'll save a lot of money using HiSeq, and you could put that money towards something more valuable in your project (such as sequencing multiple strains).
krobison is offline   Reply With Quote
Old 11-30-2012, 06:01 AM   #6
LVAndrews
Member
 
Location: Flagstaff, AZ

Join Date: Sep 2012
Posts: 55
Default

Haploid data will still find your repeats. I like Imperfect repeat finder (http://ssr.nwisrl.ars.usda.gov/) as many useful SSRs don't have a perfect repeat motif, but I think it has a limit on the amount of sequence it will process at once. Another option is WebSat (http://wsmartins.net/websat/), but it only finds perfect repeats and if memory serves, it processes even less data than Imperfect repeat finder. Once you get a draft assembly, plop in portions of your contigs and the program will show you where you have repeats and what they are. Sample broadly across your assembly and you should cover as much of the genome with your markers as you desire. One advantage I forgot to mention about Websat is it has wonderful integration with Primer3 so designing primers for your new markers is outrageously simplified.

Andy
LVAndrews is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 09:51 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO