Seqanswers Leaderboard Ad

**krobison** · 10-22-2012, 07:06 PM

With the new 2x250 MiSeq chemistry, you might actually get better assemblies for around the same price as one lane of GAIIx -- it would be worth asking around. One flowcell on MiSeq with that chemistry should be enough to get a draft assembly.

Paired end will definitely yield better assemblies. What you really want to shoot for is to size the fragments so that they overlap in the middle by about 50-60 bp. Clearly, you'll need to pick a chemistry before you can do that sizing. The huge benefit is that you can then use a tool such as FLASH to merge many of the reads; if you size carefully it may well be 75% or more. This means you have a lot of very long reads, plus their quality is improved in the overlap region (where it would otherwise be very low). I haven't experimented with combining FLASH with trimming; I think in general you don't want to trim first, though you might want to trim the reads that can't be paired.

Yes, a single haploid genome will by definition eliminate SNPs and any other true genetic variants; indeed, that data will be a good test of background noise in your variant calling scheme. Haploid is definitely easier to assembly, and as suggested before easier to debug.

For SNPs, you may well want to think about RAD-Seq or similar approaches with a pool of DNA from diverse samples; mapping these reads back the haploid reference will mine a lot more variants than a single diploid could produce. Given that the cost of library preparation has come down a lot, you might also contemplate sequencing multiple diverse haploid strains. An interesting question, which I have not explored, is whether in this case you are better doing one ~100X genome or assembling 2 individual 50X genomes and then merging the assemblies with Minimus2 or similar.

Ray is an excellent assembler for large datasets, particularly if you have access to a cluster. If you don't have access to a cluster, it is pretty easy to set one up on the Amazon cloud using Star::Cluster & run very briefly there.

Unless it has changed substantially (I haven't used it in half a decade), RepeatMasker isn't suitable for discovering repeats; it's a tool for applying a known repeat library to clear out repeats. I suppose simple repeats are universal, and perhaps microsatellites as well. There are tools out there for repeat discovery, but I don't claim any familiarity with them.

Good luck!

**pmiguel** · 10-23-2012, 04:41 AM

Originally posted by krobison View Post

With the new 2x250 MiSeq chemistry, you might actually get better assemblies for around the same price as one lane of GAIIx -- it would be worth asking around. One flowcell on MiSeq with that chemistry should be enough to get a draft assembly.

Paired end will definitely yield better assemblies. What you really want to shoot for is to size the fragments so that they overlap in the middle by about 50-60 bp. Clearly, you'll need to pick a chemistry before you can do that sizing. The huge benefit is that you can then use a tool such as FLASH to merge many of the reads; if you size carefully it may well be 75% or more. This means you have a lot of very long reads, plus their quality is improved in the overlap region (where it would otherwise be very low).

Not sure I agree with this assessment. I want to believe it, because we have MiSeqs. But it seems based on that idea that lower coverage with longer (lower quality) reads will yield better results than shorter higher quality reads. Do you have any evidence to support this?

For a fungal genome 20% of a 2x100 PE HiSeq lane will generally produce on the order of 5-8 billion bases of sequence. That is comfortably in the 100x range. Not sure how much it would cost to buy that amount of sequence in general, but you would be looking at $275 in reagents. Whereas reagents for a 2x250 MiSeq run would run $1000 and generate -- well with the v2 upgrade the same amount of sequence.

Of course the MiSeq run will take a couple of days, whereas the HiSeq run is closer to 2 weeks.

BTW, for HiSeq 2x100 PE data we often get our best ABySS assemblies at kmers around 80. Which, if I am not mistaken, is a much higher kmer than most would consider.

--
Phillip

**bdbart** · 10-24-2012, 01:44 PM

OK thanks for your advice... So from what I understand now (after talking to a fellow grad-student)...

The GAIIx is falling out of favor....due to its high cost compared to the HiSeq and MiSeq

He told me that I would have a hard time finding a partner to share a flowcell on the GAIIx.... So its either fill up all lanes of the flow cell or choose another seqencer.

For a fungal genome 20% of a 2x100 PE HiSeq lane will generally produce on the order of 5-8 billion bases of sequence. That is comfortably in the 100x range.

So i would have to share a lane of data with someone else??? Is that common practice?? Or would I have to develop a strategy to efficiently use and entire lane of data?....i.e.... Sequence multiple isolates...etc...

Not sure I agree with this assessment. I want to believe it, because we have MiSeqs. But it seems based on that idea that lower coverage with longer (lower quality) reads will yield better results than shorter higher quality reads. Do you have any evidence to support this?

I not sure what you don't agree with.... But from what I understand...longer reads should generate better assemblies.... and his paired-end strategy will essentially create longer reads.... quality is not being reduced, quality is only being enhanced in the overlap regions

**krobison** · 10-25-2012, 06:56 PM

Alas, I don't have any good data for 2x250 -- the first couple runs didn't work very well, as the genomes I'm working on have crazy %GC & the v2 chemistry didn't do well. We'll try again at some point, perhaps suffering a big phiX spike-in.

I should do a proper assessment, but I do believe from assemblies I've run that 2x150 assemblies at high coverage (with FLASH) are superior to 2x100 assemblies at similar coverage. But it is definitely true that if you can ride along with someone else's run, you'll save a lot of money using HiSeq, and you could put that money towards something more valuable in your project (such as sequencing multiple strains).

**LVAndrews** · 11-30-2012, 07:01 AM

Haploid data will still find your repeats. I like Imperfect repeat finder (http://ssr.nwisrl.ars.usda.gov/) as many useful SSRs don't have a perfect repeat motif, but I think it has a limit on the amount of sequence it will process at once. Another option is WebSat (http://wsmartins.net/websat/), but it only finds perfect repeats and if memory serves, it processes even less data than Imperfect repeat finder. Once you get a draft assembly, plop in portions of your contigs and the program will show you where you have repeats and what they are. Sample broadly across your assembly and you should cover as much of the genome with your markers as you desire. One advantage I forgot to mention about Websat is it has wonderful integration with Primer3 so designing primers for your new markers is outrageously simplified.

Andy

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, Yesterday, 11:49 AM	0 responses 15 views 0 likes	Last Post by seqadmin Yesterday, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 61 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

50Mbp fungal genome strategy

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News