Seqanswers Leaderboard Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • vanillasky
    Member
    • Mar 2014
    • 42

    Assembled contigs vs short reads

    I have recently finished assembling some metagenome sequences and after assigning function to my contigs I see that most genes belong to three specific types of microorganisms. I also submitted the unassembled short reads to MG-RAST to get an overview of functional genes. However when I look through the MG-RAST results the genes that are most abundant are not necessarily the same ones that dominate the assembled contigs. I was wondering why the two types of information wouldn't match?
  • Brian Bushnell
    Super Moderator
    • Jan 2014
    • 2709

    #2
    The number of reads is based on the abundance of specific community members, while the number of contigs is based on the overall diversity of the community. If 99% of the organisms are one species of bacteria with gene X, then gene X might be the most abundant gene based on read mapping. But if there are 1000 other species in the community making up the other 1% of the population, and none of them have gene X but all of them have gene Y, then you might get 1000 different versions of gene Y contigs.

    Also, sometimes the dominant organism does not assemble very well because it may have lots of different strains, which confuse the assembler.

    Comment

    • vanillasky
      Member
      • Mar 2014
      • 42

      #3
      Thank you for your response. In this case I know from the short read information that the sample is very diverse with many genes (x,y, z etc) with different functional roles and one set of genes with a specific functional role shows up as highly abundant. When I look at the assembled results, most the genes with functional roles are from three microorganisms. Why the difference between the number of functional genes with different roles in the short read analysis vs the assembled one?

      Comment

      • Brian Bushnell
        Super Moderator
        • Jan 2014
        • 2709

        #4
        Well... another possibility is that most of the metagenome simply didn't assemble do to low depth. Sometimes it can be useful to normalize the data prior to assembly, or use an iterative approach where you subsample, assemble, map to the assembly, then assemble the unmapped reads. Or use a different assembler. How did you do the assembly?

        Comment

        • vanillasky
          Member
          • Mar 2014
          • 42

          #5
          I used Velvet and Metavelvet to do the assemblies and kmergenie to find the coverage cut-off to use in the assembly. There were about 12 million reads that went into the assembly. My reads had lengths that were between 70-110bp long. I ended up using a coverage of 6, kmer length of 33 and insert length cut off of 400 plus I opted for scaffolding. This combination provided me with the longest contigs and N50 of 350bp which is the best that I could get.

          Comment

          • Brian Bushnell
            Super Moderator
            • Jan 2014
            • 2709

            #6
            Ahh... that's a very low coverage metagenome; I'm not surprised only the most abundant organisms assembled. Metagenomes (especially complex ones) are much harder to assemble than isolates, and thus have greater demands on data - high depth, long reads, low error-rates. You should probably try to get more data, or else try different metagenome assemblers such as Megahit.

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Pathogen Surveillance with Advanced Genomic Tools
              by seqadmin




              The COVID-19 pandemic highlighted the need for proactive pathogen surveillance systems. As ongoing threats like avian influenza and newly emerging infections continue to pose risks, researchers are working to improve how quickly and accurately pathogens can be identified and tracked. In a recent SEQanswers webinar, two experts discussed how next-generation sequencing (NGS) and machine learning are shaping efforts to monitor viral variation and trace the origins of infectious...
              03-24-2025, 11:48 AM
            • seqadmin
              New Genomics Tools and Methods Shared at AGBT 2025
              by seqadmin


              This year’s Advances in Genome Biology and Technology (AGBT) General Meeting commemorated the 25th anniversary of the event at its original venue on Marco Island, Florida. While this year’s event didn’t include high-profile musical performances, the industry announcements and cutting-edge research still drew the attention of leading scientists.

              The Headliner
              The biggest announcement was Roche stepping back into the sequencing platform market. In the years since...
              03-03-2025, 01:39 PM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, 03-20-2025, 05:03 AM
            0 responses
            49 views
            0 reactions
            Last Post seqadmin  
            Started by seqadmin, 03-19-2025, 07:27 AM
            0 responses
            57 views
            0 reactions
            Last Post seqadmin  
            Started by seqadmin, 03-18-2025, 12:50 PM
            0 responses
            50 views
            0 reactions
            Last Post seqadmin  
            Started by seqadmin, 03-03-2025, 01:15 PM
            0 responses
            201 views
            0 reactions
            Last Post seqadmin  
            Working...