Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to determine chimeras in my de novo assembly?

    Hi all,

    I would like to QC my de novo transcriptome assembly (no reference genome available ) by looking for chimeric transcripts. Ideally, I would like to calculate a % of chimeric transcripts present in my assembly. I would appreciate any and all suggestions about how I might go about doing this.

    Thanks,
    Veronica

  • #2
    Hi Veronica,

    I'm sorry to say that I don't have a good, easy method to identify chimeras in de novo assembled putative transcripts. To be honest, normally I acknowledge that it is likely there will be chimeras but don't do anything to identify them.

    Here are some theoretical methods:

    Use BLASTX alignment of a reference protein set and examine results to see if multiple proteins align to different segments of the putative transcript.

    Use ORF prediction software on the putative transcripts. If multiple large ORFs are identified, BLAST the translated protein sequences to test if all of them are consistent (i.e. the multiple ORFs in different frames on the same strand may result from frame shifts introduced by misassembly).

    Align the original RNA-Seq reads to your putative transcripts and examine how even the depth of coverage is across the length of the transcript. A contig which has dramatically different coverage at one end vs. the other, or if the two ends have deep coverage separated by a region of very shallow coverage between them may be a chimera.

    Comment


    • #3
      Hi all!
      I am now dealing with this issue. I mean, how to determine chimeras in a RNA-Seq assembly without a reference transcriptome.
      I've read that we are not able to tackle chimeras from different genes without a reference. Instead of this, self-chimeras could be detected with repeated regions in the same contig.
      However a sudden change in the coverage in a certain contig sequence could aid to estimate the number of chimeras in an assembly project. Here is my question: given a coverage of a transcript, how to set a threshold to determine that a change in the coverage could points to a chimera? According to kmcarr comment: what is a " dramatically different coverage "?? And how determining it??
      Thank you very much for your help!!

      Comment


      • #4
        Originally posted by kmcarr View Post
        I'm sorry to say that I don't have a good, easy method to identify chimeras in de novo assembled putative transcripts. To be honest, normally I acknowledge that it is likely there will be chimeras but don't do anything to identify them.

        Here are some theoretical methods:

        Use BLASTX alignment of a reference protein set and examine results to see if multiple proteins align to different segments of the putative transcript.

        Use ORF prediction software on the putative transcripts. If multiple large ORFs are identified, BLAST the translated protein sequences to test if all of them are consistent (i.e. the multiple ORFs in different frames on the same strand may result from frame shifts introduced by misassembly).
        The most important check is whether you have full-length matches. Often, an N/C-terminus will be placed on a different contig/scaffold compared to the core of protein. In diploid/polypoloid organisms due to sequencing errors you won't even find a definite answer whether a fragment of a transcript originated from locus 1 or 2 or 3, provided they all have 95-100% identity (and they do at least in some places, thanks to the recent whole-genome duplication events). There are many cases like this. This is one of the reasons why I always say that using NGS one can never, ever, get a correct answer in case of alternatively spliced genes. Unless we sequence a transcript as a whole pice, it is all just a guesswork. A short, 80nt long overlap between two reads does not justify for a conclusion that exon C and D are present in a same transscript. Assembler will always propose that A-B-C-D-E-F-G are in a transcript but hardly ever reveal that actually only A-B-E-F and A-B-C-F-G are expressed. With high coverages the situation could be more optimistic but here it depends on the number of biological and lab replicates, not just on a number of emPCR droplets or clusters derived from same PCR experiment. Although instructing an assembler to watch uniformity of coverage is cheating a bit I believe it helps at least in some cases.

        Second, important check is for seemingly new exon extensions or truncations, and for seemingly "unremoved" introns, just breaking a multiple sequence alignment of your favourite gene.

        Originally posted by kmcarr View Post
        Align the original RNA-Seq reads to your putative transcripts and examine how even the depth of coverage is across the length of the transcript. A contig which has dramatically different coverage at one end vs. the other, or if the two ends have deep coverage separated by a region of very shallow coverage between them may be a chimera.
        From my experience, the chimeras are even left in combined, shotgun + paired-end datasets, even in combined technologies, like Illumina+454. That is a nightmare for me. I have some idea why is that and what "requirements" need to be fullfilled so that they remain. Luckily, in other cases removal of chimeras results in longer contigs/scaffolds, less contig/scaffold counts, better N50/N90 numbers. But the numbers are not 10x better, you have to understand that if you ban one chimeric join you split 1 contig into 2, so the assembler is starting with much worse outlook initially and has to find completely different assembly paths. Once you accept the situation, it is pleasing that in the end one receives a bit better assembly in terms of these semi-usefull numbers. But the scaffolds/contigs are different.

        Depends what lab protocol you have used to obtain the data but maybe you would appreciate a commercial service from me? I can properly trim datasets from some complex protocols, with almost no overtrimming and no misses. See http://www.bioinformatics.cz/softwar...rted-protocols . Although I developed that for 454-based datasets I could help with data from some other technologies. Depends.
        Last edited by martin2; 03-04-2014, 06:28 AM. Reason: Check for exon/introns lengths as well.

        Comment


        • #5
          MIRA assembler can detect chimeras I believe

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Current Approaches to Protein Sequencing
            by seqadmin


            Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
            04-04-2024, 04:25 PM
          • seqadmin
            Strategies for Sequencing Challenging Samples
            by seqadmin


            Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
            03-22-2024, 06:39 AM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, 04-11-2024, 12:08 PM
          0 responses
          18 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 10:19 PM
          0 responses
          22 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 09:21 AM
          0 responses
          16 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-04-2024, 09:00 AM
          0 responses
          47 views
          0 likes
          Last Post seqadmin  
          Working...
          X