Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Gene duplication events vs. duplicate scaffolds

    Hi all,
    Recently I've been looking at a draft genome assembly a collaborator has sent me. The first thing I notice is that the size of the assembly is really big 3-4X the size of the estimated genome size.

    I filter out some contaminate scaffolds but this doesn't account for much of the size difference.

    I then do a All-vs-All blast and find that there are a number (thousands) of smaller scaffolds (6-14kb) that have ~100% identity matches to larger scaffolds for >95% of their length. For near perfect matches 99-100% identity it seems reasonable to assume these are 'duplicate scaffolds' and I can set them aside.

    I have about 7 mbases of such scaffolds with 100% identity.
    If I low the stringency to 99% identity or better I have ~30 mbases.

    I am wondering if anyone had insight into a good way to set/determine a cutoff for percent identity and percent length for these 'duplicate' scaffolds.


    I don't want to be throwing out recent gene duplications - but at the same time I don't want to confound gene family expansion analysis with repeat scaffolds and alleles.

    Anyone have any insight?

    cheers,
    t

  • #2
    What kind of organism is it? Some are very repetitive. Also, highly heterozygous polyploid organisms can generate multiple redundant scaffolds.

    It could be that you need to start over from the reads and do your own assembly with an assembler (like dipSpades) or settings that are better designed for the input data.

    Comment


    • #3
      Brian,
      I appreciate your input. We have now gone through a few rounds of sequencing and assembly - our latest assembly is using Illumina 'longreads' >1500kb in addition to a number of SIMP PE libraries for scaffolding.

      The organism has 5 chromosomes and is diploid. Its supposed to have a small ~70Mb genome, so I had originally suspected that that most genes would occur in single copy.

      Also, the organism in question is parthenogenic and the cultures used were supposedly generated from a single individual.

      Maybe I'll repose my question as this: "is there some strategy/criteria for telling if two very similar scaffolds are the result of sequencing error (or some other artificial source) rather than alleles or gene duplications?"

      Comment


      • #4
        Unfortunately, the answer to that question is "no". At least, not objectively with complete confidence. But if you look at the rate at which exact (or almost-exact) repeats occur in related organisms, as a function of size, you should be able to get an idea of whether that's realistic. I would not expect there to be thousands of 6k+ exact repeats in any organism, for example, but there are several branches of life I have not worked on.

        I think organisms with simpler expression control tend to require more copies of genes that need regulation (or need to be highly expressed), but that wouldn't explain replications of 6-14k sequences that are (presumably) substantially larger than single genes. I encourage you to generate and post a kmer frequency histogram; that will allow visual estimation of the genome size, and expected genome fraction with 1 copy, 2 copies, 3 copies, etc.

        By the way, I have a program called "dedupe" that's explicitly designed to remove redundant scaffolds (both exact duplicates, containments, and optionally inexact containments down to some minimum percent identity) from assemblies, and we run it routinely. Why? Well... in a metagenome, for example, they don't add any useful information. And usually (with some of our protocols), such things are assembly artifacts. I don't know why your assembly is 3-4x your target size, but if it was around double, I would have suggested you simply remove all duplicates on the assumption they are caused by the ploidy.

        Comment


        • #5
          Brian - thanks for your insightful response. I'll go ahead and try generating a kmer plot.

          Your comment brings up a point that I am still a bit stuck on - that being what to call "duplicate." I was speaking with a genomics person at my institution whose opinion was that any large scaffold (6-14kb) that had >95% identity with another scaffold was probably a product of heterozygousity and to throw out these scaffolds.

          On the flip side of that coin anything <95% identity he thought could be a potential paralog and should be retained.

          I guess I am curious since it sounds like you do this type of filtering a lot if these percentages sound good to you.

          Comment


          • #6
            95% sounds pretty low; I think we normally we use a cutoff of around 99% (depending on this situation; other times we use 100%). For example, in transcriptome assembly, I think our cutoff is 99% identity and the shorter contig must be at least 80% of the length of the longer one to be discarded. But I don't know why those numbers were chosen or if they're optimal.

            For genomes, it depends on the organism - humans have a SNP rate of around 1/1000, while some fungi can have a rate of over 1/50. I think one we sequenced even had a rate of 1/22. So, you could look at the heterozygosity of related organisms and establish your cutoff that way; if relatives tend to have chromosomes with 99.5% identity to each other, then tossing things with only 95% identity may not be wise.

            And as I mentioned, heterozygosity explains an assembly 2x the size you expected, but not 4x, so it's still a little mysterious.

            Another thing you might consider - sometimes chromosomes get duplicated, and then the two copies evolve into different chromosomes. But for a while, the two copies will be very similar. So you could, for example, have an organism with 5 chromosomes, 2 of which are near-identical because the duplication was recent. That's another thing you may be able to check by looking at related species, and the numbers of chromosomes they have.

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Essential Discoveries and Tools in Epitranscriptomics
              by seqadmin


              The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
              Yesterday, 07:01 AM
            • seqadmin
              Current Approaches to Protein Sequencing
              by seqadmin


              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
              04-04-2024, 04:25 PM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, 04-11-2024, 12:08 PM
            0 responses
            39 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 10:19 PM
            0 responses
            41 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 09:21 AM
            0 responses
            35 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-04-2024, 09:00 AM
            0 responses
            55 views
            0 likes
            Last Post seqadmin  
            Working...
            X