Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • de novo assembly with MIRA and 454 single-end reads. Too much contigs

    Hi everybody!
    I'm new in NGS data processing and also in this great community.

    I have two sets of reads 454 single-end from Roche GS junior, the source organism is a yeast with an expected genome size of 12 Mb. i need to assembly this reads and finally obtain a draft of the genome. The problem is that i am getting too many contigs

    I extracted the reads from the sff files using the script sff_extract and i got 211981 reads of 40-630 bp with an average lenght of ~500 bp. i made the quality control by trimming ends using a threshold quality value of 35 and discarding all reads shorter than 100 bp. Finally i got 180585 reads of 100-469 bp (the sequence lenght distribution is pretty irregular).

    Then, i tried to do a denovo assembly with MIRA. like a first view, i ran

    mira --job=denovo,genome,accurate,454 --project=yeast 454_SETTINGS -LR:ft=fastq -LR:fqqo=33

    and i got:

    Avg. total coverage: 35.45

    Large contigs:
    ============
    Number of contigs: 17
    Total consensus: 58823
    Largest contig: 9561
    N50 contig size: 6322
    N90 contig size: 1575
    N95 contig size: 1281

    All contigs:
    ============
    Number of contigs: 10941
    Total consensus: 6832983
    Largest contig: 9561
    N50 contig size: 724
    N90 contig size: 357
    N95 contig size: 301

    Like there were too many contigs, i manipulate several parameters of MIRA to obtain better results. In general all looks like: if i sacrifice coverage, i got more "total consensus", but still a lot of contigs.

    here is one of the better assemblies i got:

    mira --job=denovo,genome,accurate,454 --project=yeast 454_SETTINGS -LR:ft=fastq -SKr=60 -AL:mrs=60 -AS:mrpc=10


    Avg. total coverage: 18.45

    Large contigs:
    =============
    Number of contigs: 655
    Total consensus: 665610
    Largest contig: 8794
    N50 contig size: 1043
    N90 contig size: 572
    N95 contig size: 539

    All contigs:
    ============
    Number of contigs: 5699
    Total consensus: 4835995
    Largest contig: 8794
    N50 contig size: 900
    N90 contig size: 532
    N95 contig size: 468

    i still think there is too many contigs. i thought that this two sets of reads (two sequencing supposedly) were enough to do a good draft, but now i doubt about this. i hope you could help me .

    Greetings and thanks in advance.

  • #2
    You're best bet is to join the Mira mailing list...it is very active and the author will most likely respond v.quickly

    Comment


    • #3
      Originally posted by JackieBadger View Post
      You're best bet is to join the Mira mailing list...it is very active and the author will most likely respond v.quickly
      Hi JackieBadger, thanks for the advice, i am gonna do this also.

      Comment


      • #4
        Have you tried using Newbler?

        In my experience it didn't produce as many small contigs as MIRA.

        Comment


        • #5
          In addition...It looks like you are using v3.4...
          The latest development releases are a "totally different beast" and are so worth checking out...

          I also think that many people advise letting the assembler do most/all of the trimming if possible

          Comment


          • #6
            I am not sure whether your complaint about the number of contigs, or total consensus length, are the right way to look at the problem. The question should be, which assembly is the right one for my question? Long contigs may have misassemblies, for example.

            If you are using a yeast for which there is a close relative finished genome, you could compare your assembly to it and score completeness, errors etc. Or use reference-free assembly validation tools.

            Comment


            • #7
              Hi everyone and thanks for your answers. I tried the solutions you said here and also i complemented these with some ideas of the MIRA mailing list.

              1) Do not trim the reads myself. i did it and effectively i got better results, but still a lot of contigs (4643 contigs; 11082975 bp of total consensus; N50 3348. this is for 'large contigs', 'all contigs' is very similar)

              2) Already said here, try Newbler. I got results just a little better than in the try i did with MIRA in 1). (3905 contigs; 11135342 bp of total consensus; N50 4044. this for large contigs)

              3) Try a reference assembly. I dont do this yet, but i think is my best bet.

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Current Approaches to Protein Sequencing
                by seqadmin


                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                04-04-2024, 04:25 PM
              • seqadmin
                Strategies for Sequencing Challenging Samples
                by seqadmin


                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                03-22-2024, 06:39 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, 04-11-2024, 12:08 PM
              0 responses
              30 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 10:19 PM
              0 responses
              32 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 09:21 AM
              0 responses
              28 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-04-2024, 09:00 AM
              0 responses
              52 views
              0 likes
              Last Post seqadmin  
              Working...
              X