Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • vallejov
    Member
    • Jul 2011
    • 10

    How to determine chimeras in my de novo assembly?

    Hi all,

    I would like to QC my de novo transcriptome assembly (no reference genome available ) by looking for chimeric transcripts. Ideally, I would like to calculate a % of chimeric transcripts present in my assembly. I would appreciate any and all suggestions about how I might go about doing this.

    Thanks,
    Veronica
  • kmcarr
    Senior Member
    • May 2008
    • 1181

    #2
    Hi Veronica,

    I'm sorry to say that I don't have a good, easy method to identify chimeras in de novo assembled putative transcripts. To be honest, normally I acknowledge that it is likely there will be chimeras but don't do anything to identify them.

    Here are some theoretical methods:

    Use BLASTX alignment of a reference protein set and examine results to see if multiple proteins align to different segments of the putative transcript.

    Use ORF prediction software on the putative transcripts. If multiple large ORFs are identified, BLAST the translated protein sequences to test if all of them are consistent (i.e. the multiple ORFs in different frames on the same strand may result from frame shifts introduced by misassembly).

    Align the original RNA-Seq reads to your putative transcripts and examine how even the depth of coverage is across the length of the transcript. A contig which has dramatically different coverage at one end vs. the other, or if the two ends have deep coverage separated by a region of very shallow coverage between them may be a chimera.

    Comment

    • jordi
      Member
      • Apr 2009
      • 49

      #3
      Hi all!
      I am now dealing with this issue. I mean, how to determine chimeras in a RNA-Seq assembly without a reference transcriptome.
      I've read that we are not able to tackle chimeras from different genes without a reference. Instead of this, self-chimeras could be detected with repeated regions in the same contig.
      However a sudden change in the coverage in a certain contig sequence could aid to estimate the number of chimeras in an assembly project. Here is my question: given a coverage of a transcript, how to set a threshold to determine that a change in the coverage could points to a chimera? According to kmcarr comment: what is a " dramatically different coverage "?? And how determining it??
      Thank you very much for your help!!

      Comment

      • martin2
        Member
        • Nov 2010
        • 42

        #4
        Originally posted by kmcarr View Post
        I'm sorry to say that I don't have a good, easy method to identify chimeras in de novo assembled putative transcripts. To be honest, normally I acknowledge that it is likely there will be chimeras but don't do anything to identify them.

        Here are some theoretical methods:

        Use BLASTX alignment of a reference protein set and examine results to see if multiple proteins align to different segments of the putative transcript.

        Use ORF prediction software on the putative transcripts. If multiple large ORFs are identified, BLAST the translated protein sequences to test if all of them are consistent (i.e. the multiple ORFs in different frames on the same strand may result from frame shifts introduced by misassembly).
        The most important check is whether you have full-length matches. Often, an N/C-terminus will be placed on a different contig/scaffold compared to the core of protein. In diploid/polypoloid organisms due to sequencing errors you won't even find a definite answer whether a fragment of a transcript originated from locus 1 or 2 or 3, provided they all have 95-100% identity (and they do at least in some places, thanks to the recent whole-genome duplication events). There are many cases like this. This is one of the reasons why I always say that using NGS one can never, ever, get a correct answer in case of alternatively spliced genes. Unless we sequence a transcript as a whole pice, it is all just a guesswork. A short, 80nt long overlap between two reads does not justify for a conclusion that exon C and D are present in a same transscript. Assembler will always propose that A-B-C-D-E-F-G are in a transcript but hardly ever reveal that actually only A-B-E-F and A-B-C-F-G are expressed. With high coverages the situation could be more optimistic but here it depends on the number of biological and lab replicates, not just on a number of emPCR droplets or clusters derived from same PCR experiment. Although instructing an assembler to watch uniformity of coverage is cheating a bit I believe it helps at least in some cases.

        Second, important check is for seemingly new exon extensions or truncations, and for seemingly "unremoved" introns, just breaking a multiple sequence alignment of your favourite gene.

        Originally posted by kmcarr View Post
        Align the original RNA-Seq reads to your putative transcripts and examine how even the depth of coverage is across the length of the transcript. A contig which has dramatically different coverage at one end vs. the other, or if the two ends have deep coverage separated by a region of very shallow coverage between them may be a chimera.
        From my experience, the chimeras are even left in combined, shotgun + paired-end datasets, even in combined technologies, like Illumina+454. That is a nightmare for me. I have some idea why is that and what "requirements" need to be fullfilled so that they remain. Luckily, in other cases removal of chimeras results in longer contigs/scaffolds, less contig/scaffold counts, better N50/N90 numbers. But the numbers are not 10x better, you have to understand that if you ban one chimeric join you split 1 contig into 2, so the assembler is starting with much worse outlook initially and has to find completely different assembly paths. Once you accept the situation, it is pleasing that in the end one receives a bit better assembly in terms of these semi-usefull numbers. But the scaffolds/contigs are different.

        Depends what lab protocol you have used to obtain the data but maybe you would appreciate a commercial service from me? I can properly trim datasets from some complex protocols, with almost no overtrimming and no misses. See http://www.bioinformatics.cz/softwar...rted-protocols . Although I developed that for 454-based datasets I could help with data from some other technologies. Depends.
        Last edited by martin2; 03-04-2014, 06:28 AM. Reason: Check for exon/introns lengths as well.

        Comment

        • JackieBadger
          Senior Member
          • Mar 2009
          • 385

          #5
          MIRA assembler can detect chimeras I believe

          Comment

          Latest Articles

          Collapse

          • SEQadmin2
            Nine Things a Sample Prep Scientist Thinks About Before Sequencing
            by SEQadmin2


            I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.


            Here are nine questions we think about, in roughly the order they matter, before...
            06-18-2026, 07:11 AM
          • SEQadmin2
            From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
            by SEQadmin2


            Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


            The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
            ...
            06-02-2026, 10:05 AM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by SEQadmin2, 06-17-2026, 06:09 AM
          0 responses
          24 views
          0 reactions
          Last Post SEQadmin2  
          Started by SEQadmin2, 06-09-2026, 11:58 AM
          0 responses
          42 views
          0 reactions
          Last Post SEQadmin2  
          Started by SEQadmin2, 06-05-2026, 10:09 AM
          0 responses
          48 views
          0 reactions
          Last Post SEQadmin2  
          Started by SEQadmin2, 06-04-2026, 08:59 AM
          0 responses
          49 views
          0 reactions
          Last Post SEQadmin2  
          Working...