Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • 50Mbp fungal genome strategy

    So we want to develop a sequencing strategy for a 50 Mbp Ascomycota genome (plant pathogen). This genome has no other references and the size is based upon other Ascomycota genome sizes. We do not know how much variation within the species or G+C content.

    Can you help me develop a sequencing strategy on a budget???

    Illumina GAIIx seems to be the most widely used and best supported... So this will likely be the platform of choice.

    It seems that a single lane of data from the GAIIx will be sufficient in achieving enough data for a draft assembly....96X coverage.... assuming 30-50X coverage is required for assembly.

    Our goals are to create a draft assembly and ultimately a final high-quality assembly.... find microsatellite markers to identify variation within and among the species.... possibly find SNP's for the same purpose or QTL.... determine gene structure for later RNA-Seq or EST analysis.... comparison of genome-wide relationships with other fungi....???? Anything else????

    Our ultimate goal is to find host-pathogen relationships.... Which will help eliminate the pathogen in the host species

    So as far as I can tell.....

    #1 Isolate the genomic DNA from a single haploid culture of the fungus

    I think that coming from a single haploid culture will help in the assembly process....but will eliminate the possibility of finding SNP's. Will this also eliminate finding any microsatellites???

    Should I instead combiine many isolates, since a single lane from the GAIIx will yield 96X coverage???

    #2 Will using paired-end sequences provide for a better assembly? Yes...right???

    Will paired-end reads provide better microsatellite detection?? Is it worth the cost for our immediate goals of microsatellite detection and determining gene structure???

    #3 After you receive the sequence data, you must filter and trim the data based on quality scores...this helps eliminate bad sequences from confusing the assembly programs....right???

    Anyone have any favorite programs for this.... Galaxy...FASTX....????

    #4 Once the sequences are “cleaned”...you must remove the repeat regions.... right?? This reduces the complexity of assembly programs....right??

    Anyone have any favorite programs???....RepeatMasker

    Will de novo repeat finders essentially find what I am looking for....microsatellites???

    de novo repeat finders???


    Polymorphic microsatellite markers from Korean water deer were successfully identified using NGS without any prior sequence information and deposited into the public database. Thus, the methods described herein represent a rapid and low-cost way to investigate the population genetics of endangered/n …

    Identification of microsatellites, or simple sequence repeats (SSRs), can be a time-consuming and costly investment requiring enrichment, cloning, and sequencing of candidate loci. Recently, however, high throughput sequencing (with or without prior enrichment for specific SSR loci) has been utilize …

    The ReRep approach for identification of repetitive elements in GSS datasets proved to be straightforward and efficient. Several potential repetitive sequences were found in a L. braziliensis GSS dataset generated in our laboratory, and further validated by the analysis of a more complete genomic da …



    #5 I believe that our collaborators are familiar with Velvet and Abyss, so these programs should be able to assemble the genome.....

    Any other favorite assemblers???

    But are there better options for variant detection?
    genotyping-by-sequencing??
    cortex_var???
    RAD-sequencing??

    These require a different experimental design than the one being proposed...I know...but are they cost effective???

    Please correct me on any mistake in judgment.... Thank you

  • #2
    With the new 2x250 MiSeq chemistry, you might actually get better assemblies for around the same price as one lane of GAIIx -- it would be worth asking around. One flowcell on MiSeq with that chemistry should be enough to get a draft assembly.

    Paired end will definitely yield better assemblies. What you really want to shoot for is to size the fragments so that they overlap in the middle by about 50-60 bp. Clearly, you'll need to pick a chemistry before you can do that sizing. The huge benefit is that you can then use a tool such as FLASH to merge many of the reads; if you size carefully it may well be 75% or more. This means you have a lot of very long reads, plus their quality is improved in the overlap region (where it would otherwise be very low). I haven't experimented with combining FLASH with trimming; I think in general you don't want to trim first, though you might want to trim the reads that can't be paired.

    Yes, a single haploid genome will by definition eliminate SNPs and any other true genetic variants; indeed, that data will be a good test of background noise in your variant calling scheme. Haploid is definitely easier to assembly, and as suggested before easier to debug.

    For SNPs, you may well want to think about RAD-Seq or similar approaches with a pool of DNA from diverse samples; mapping these reads back the haploid reference will mine a lot more variants than a single diploid could produce. Given that the cost of library preparation has come down a lot, you might also contemplate sequencing multiple diverse haploid strains. An interesting question, which I have not explored, is whether in this case you are better doing one ~100X genome or assembling 2 individual 50X genomes and then merging the assemblies with Minimus2 or similar.


    Ray is an excellent assembler for large datasets, particularly if you have access to a cluster. If you don't have access to a cluster, it is pretty easy to set one up on the Amazon cloud using Star::Cluster & run very briefly there.

    Unless it has changed substantially (I haven't used it in half a decade), RepeatMasker isn't suitable for discovering repeats; it's a tool for applying a known repeat library to clear out repeats. I suppose simple repeats are universal, and perhaps microsatellites as well. There are tools out there for repeat discovery, but I don't claim any familiarity with them.

    Good luck!

    Comment


    • #3
      Originally posted by krobison View Post
      With the new 2x250 MiSeq chemistry, you might actually get better assemblies for around the same price as one lane of GAIIx -- it would be worth asking around. One flowcell on MiSeq with that chemistry should be enough to get a draft assembly.

      Paired end will definitely yield better assemblies. What you really want to shoot for is to size the fragments so that they overlap in the middle by about 50-60 bp. Clearly, you'll need to pick a chemistry before you can do that sizing. The huge benefit is that you can then use a tool such as FLASH to merge many of the reads; if you size carefully it may well be 75% or more. This means you have a lot of very long reads, plus their quality is improved in the overlap region (where it would otherwise be very low).
      Not sure I agree with this assessment. I want to believe it, because we have MiSeqs. But it seems based on that idea that lower coverage with longer (lower quality) reads will yield better results than shorter higher quality reads. Do you have any evidence to support this?

      For a fungal genome 20% of a 2x100 PE HiSeq lane will generally produce on the order of 5-8 billion bases of sequence. That is comfortably in the 100x range. Not sure how much it would cost to buy that amount of sequence in general, but you would be looking at $275 in reagents. Whereas reagents for a 2x250 MiSeq run would run $1000 and generate -- well with the v2 upgrade the same amount of sequence.

      Of course the MiSeq run will take a couple of days, whereas the HiSeq run is closer to 2 weeks.

      BTW, for HiSeq 2x100 PE data we often get our best ABySS assemblies at kmers around 80. Which, if I am not mistaken, is a much higher kmer than most would consider.

      --
      Phillip

      Comment


      • #4
        OK thanks for your advice... So from what I understand now (after talking to a fellow grad-student)...

        The GAIIx is falling out of favor....due to its high cost compared to the HiSeq and MiSeq

        He told me that I would have a hard time finding a partner to share a flowcell on the GAIIx.... So its either fill up all lanes of the flow cell or choose another seqencer.

        For a fungal genome 20% of a 2x100 PE HiSeq lane will generally produce on the order of 5-8 billion bases of sequence. That is comfortably in the 100x range.
        So i would have to share a lane of data with someone else??? Is that common practice?? Or would I have to develop a strategy to efficiently use and entire lane of data?....i.e.... Sequence multiple isolates...etc...

        Not sure I agree with this assessment. I want to believe it, because we have MiSeqs. But it seems based on that idea that lower coverage with longer (lower quality) reads will yield better results than shorter higher quality reads. Do you have any evidence to support this?
        I not sure what you don't agree with.... But from what I understand...longer reads should generate better assemblies.... and his paired-end strategy will essentially create longer reads.... quality is not being reduced, quality is only being enhanced in the overlap regions
        Last edited by bdbart; 10-24-2012, 01:47 PM.

        Comment


        • #5
          Alas, I don't have any good data for 2x250 -- the first couple runs didn't work very well, as the genomes I'm working on have crazy %GC & the v2 chemistry didn't do well. We'll try again at some point, perhaps suffering a big phiX spike-in.

          I should do a proper assessment, but I do believe from assemblies I've run that 2x150 assemblies at high coverage (with FLASH) are superior to 2x100 assemblies at similar coverage. But it is definitely true that if you can ride along with someone else's run, you'll save a lot of money using HiSeq, and you could put that money towards something more valuable in your project (such as sequencing multiple strains).

          Comment


          • #6
            Haploid data will still find your repeats. I like Imperfect repeat finder (http://ssr.nwisrl.ars.usda.gov/) as many useful SSRs don't have a perfect repeat motif, but I think it has a limit on the amount of sequence it will process at once. Another option is WebSat (http://wsmartins.net/websat/), but it only finds perfect repeats and if memory serves, it processes even less data than Imperfect repeat finder. Once you get a draft assembly, plop in portions of your contigs and the program will show you where you have repeats and what they are. Sample broadly across your assembly and you should cover as much of the genome with your markers as you desire. One advantage I forgot to mention about Websat is it has wonderful integration with Primer3 so designing primers for your new markers is outrageously simplified.

            Andy

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Essential Discoveries and Tools in Epitranscriptomics
              by seqadmin




              The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
              04-22-2024, 07:01 AM
            • seqadmin
              Current Approaches to Protein Sequencing
              by seqadmin


              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
              04-04-2024, 04:25 PM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, Yesterday, 11:49 AM
            0 responses
            15 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-24-2024, 08:47 AM
            0 responses
            16 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-11-2024, 12:08 PM
            0 responses
            61 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 10:19 PM
            0 responses
            60 views
            0 likes
            Last Post seqadmin  
            Working...
            X