Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Genes with multiple copies assembling as single contig

    Hi all,

    I'm doing a denovo assembly of a cyanobacterial genome with SPades, all is working well but when there are multiple copies of a gene (e.g. 16srRNA gene), it appears that all reads associated with that gene are being mapped to a single contig.

    Coverage of these contigs appears to correspond quite well to number of expected copies in the genome (i.e. normal coverage ~50x, for a contig with a gene with four copies, coverage ~200x).

    Does anyone know of a method to prevent this from happening so that each of the copies assemble separately in different contigs?

    Cheers
    N

  • #2
    Hi Cyanoevo,

    I have the exact same problem. Did you ever find an answer?

    Cheers,

    Eduardo

    Comment


    • #3
      You can try mapping reads to a 16S copy, then clustering the reads that mapped, then assembling the clusters. This will work if the reads are sufficiently long (for Illumina, merging them may be useful) and the 16S are sufficiently different. If not, you'll just get one cluster. You probably need overlapping 2x250bp reads at a minimum (insert size around 400bp+) to have a good chance.

      You can cluster like this with Dedupe (packaged with BBMap):

      dedupe.sh in=merged.fq -Xmx30g am=f ac=f fo c rnc=f mcs=50 mo=350 pto pattern=cluster_%.fq

      The "mo=350" specifies a min overlap of 350bp. This should be around 80%-90% of your read length. If you have single-ended 250bp reads, set it to 200; if you have merged reads with an insert size of around 400bp, try 350. If you have 100bp non-overlapping reads, don't bother, they're too short.

      For this kind of situation, which is very sensitive to chimeras, I recommend merging reads with BBMerge using the "vstrict" flag.

      Comment


      • #4
        Thanks Brian, I'll give it a try. I anticipate that I'm going to get one cluster because the reads are seemingly identical. It's suggestive that the coverage for the rRNA operon is about 3 times the coverage of the neighboring genes so at a minimum I'll report that in the submission.
        I guess the alternative would be going back to the wet lab to check how many copies there are.

        Cheers,

        Eduardo

        Comment


        • #5
          Originally posted by ecastron View Post
          Thanks Brian, I'll give it a try. I anticipate that I'm going to get one cluster because the reads are seemingly identical. It's suggestive that the coverage for the rRNA operon is about 3 times the coverage of the neighboring genes so at a minimum I'll report that in the submission.
          I guess the alternative would be going back to the wet lab to check how many copies there are.
          Eduardo
          Since you're having multiple copies of 16s then you need to have sufficiently long insert length to allow assembler resolve this repetitive region. Otherwise, indeed, everything will be inside single contig. Given the length of 16s you'd need at least mate pairs of > 2-3kb insert length or long reads (PacBio / Nanopore).

          Comment


          • #6
            Thanks for the reply! That was my impression; that I wouldn't be able to resolve it with 300bp insert library but only with mate pairs or long read technology.

            Cheers,

            @ecastron

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Recent Advances in Sequencing Analysis Tools
              by seqadmin


              The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...
              Today, 07:48 AM
            • seqadmin
              Essential Discoveries and Tools in Epitranscriptomics
              by seqadmin




              The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
              04-22-2024, 07:01 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, Today, 07:17 AM
            0 responses
            11 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 05-02-2024, 08:06 AM
            0 responses
            19 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-30-2024, 12:17 PM
            0 responses
            20 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-29-2024, 10:49 AM
            0 responses
            28 views
            0 likes
            Last Post seqadmin  
            Working...
            X