Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Plantagora

    Hi All-

    I'm new, but I thought it would be good to let people know about Plantagora, which is a project that I've been part of for the past year. It's purpose is to find the best approaches to sequencing a new genome using next gen sequencing and whole genome assembly. It is oriented towards plant genomes, but for the most part, the information, tools, etc. applies to all species. It's inspiration was the realization that even with a lot of good sequencing coverage, it can still be difficult or impossible to come up with a good genome sequence.

    For the Plantagora project, we created simulated reads modeling those from the Illumina or 454 sequencing platforms. The source of the sequences was primarily rice chromosome one, but we also used some whole plant genomes, also. We used several different assemblers, depending on the data, e.g. Newbler, ABySS, and SOAPdeNovo. The resulting assemblies are evaluated using a very long list of metrics, some being statistics about the contigs and scaffolds, others are derived by alignment to the original sequence to provide various metrics about the fidelity of the assemblies.

    The results of these studies, of which there are thousands, are entered into a database that is available for download. There is also a graphing tool, so that you can generate custom graphs from the data. The tools used to create the data are also posted. All of this is more or less now available on our new website: plantagora.org (http://www.plantagora.org/) We hope people will make use of it, because that's what it's there for! It was funded by NSF, but is now being taken over by the iPlant Collaborative, another NSF-funded project. It should be of great use to those considering a new genome sequencing project, and those of you working on whole genome assembly.

  • #2
    Nice tool

    Plantagora is a really useful tool for simulations. I want to use the scripts for denovo assembling a genome though.If I provide the assembly_run.sh script with my own sets of reads, would it assemble it into a regular assembly?

    Also, does the Plantagora use Abyss for hybrid Illumina+454 assembly? How do I set it up for that?

    Thanks
    Flobpf

    Comment


    • #3
      Hi-
      Thanks for the comment. The assembly_run.sh (which I didn't write) was designed to work with the Plantagora datasets to run multiple assembly runs. It may be most useful to look at the script and (if you can -- I'm not exactly expert at this) either edit it to your needs, or try using it with some of your own datasets with your own inputs and settings. In the end, though, if you're not doing a lot of different runs, then you can take the commands as they are written in the script and put in your own settings as you want and run the assemblies directly. For example, for abyss, the command in there is time mpirun -np 4 abyss-pe $params name=$header. You can leave out the time command if you don't want to time it, and in some cases you may not want or need to use mpi for a parallel run (which in this example is set to run on 4 processors. You have to have openmpi installed to do it. I have been studying Abyss and it has a lot of subprograms that it uses, one of which is abyss-pe. Abyss has to be installed and abyss-pe in the path environment to run the command. Otherwise you can try running it with just --help and it will tell you about the options. The options set in the file as it is distributed on the website (I think) are -j2 n=2 k=$k, where k is the kmer size which is something you may want to try to optimize, because it can make a big difference. You may already know a lot of this, but some of it is not too obvious when you first look at Abyss.

      The interesting thing about abyss-pe is that it is a makefile, and it can be edited and you can also run the commands it uses independently, because it really just runs through a series of commands that invoke some of the other subprograms that also have to be in the path environment for abyss-pe to run properly -- they are in a bunch of subfolders of the abyss install. I believe the default command series will be spit out if you give it the option --dry-run. You can break down the commands and even replace some of them with other aligners or mappers, like bowtie. I'm trying to figure out at this point how best to use this.

      In any case, Plantagora uses Abyss for the hybrid Illumina+454 assemblies, and some of them produce scaffolds even over 100,000 bp, although the scaffoldN50's are considerably lower than this. Abyss is one of the few assemblers that can readily make use of the combined data. I have been told by another group that you can convert Illumina reads to .sff files and use them with Newbler. They had trouble running the combo so far, but that is because the memory usage is really heavy for this combination. I don't know how efficiently Newbler can use the smaller reads, either. It does not use a de Bruijn graph or kmers, like the small read assemblers generally do. But it may work fine under some conditions.

      Comment


      • #4
        Thanks Roger. Thats answers a lot of my questions.

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Essential Discoveries and Tools in Epitranscriptomics
          by seqadmin




          The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
          04-22-2024, 07:01 AM
        • seqadmin
          Current Approaches to Protein Sequencing
          by seqadmin


          Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
          04-04-2024, 04:25 PM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, 04-11-2024, 12:08 PM
        0 responses
        59 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 10:19 PM
        0 responses
        57 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 09:21 AM
        0 responses
        53 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-04-2024, 09:00 AM
        0 responses
        56 views
        0 likes
        Last Post seqadmin  
        Working...
        X