SEQanswers

Go Back   SEQanswers > Applications Forums > RNA Sequencing



Similar Threads
Thread Thread Starter Forum Replies Last Post
Inquiry: minimum length of reads for referece-based assembly or de novo assembly sunfuhui Bioinformatics 1 10-04-2013 10:28 AM
de novo assembly vs. reference assembly fadista General 3 02-16-2011 12:11 AM
How can I determine order of scaffolds from de novo sequencing? odysseus Bioinformatics 0 03-21-2010 09:05 PM
How can I determine order of scaffolds from de novo sequencing? odysseus Introductions 0 03-21-2010 05:45 AM

Reply
 
Thread Tools
Old 01-21-2013, 09:01 AM   #1
vallejov
Member
 
Location: Michigan

Join Date: Jul 2011
Posts: 10
Default How to determine chimeras in my de novo assembly?

Hi all,

I would like to QC my de novo transcriptome assembly (no reference genome available ) by looking for chimeric transcripts. Ideally, I would like to calculate a % of chimeric transcripts present in my assembly. I would appreciate any and all suggestions about how I might go about doing this.

Thanks,
Veronica
vallejov is offline   Reply With Quote
Old 01-30-2013, 12:21 PM   #2
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,173
Default

Hi Veronica,

I'm sorry to say that I don't have a good, easy method to identify chimeras in de novo assembled putative transcripts. To be honest, normally I acknowledge that it is likely there will be chimeras but don't do anything to identify them.

Here are some theoretical methods:

Use BLASTX alignment of a reference protein set and examine results to see if multiple proteins align to different segments of the putative transcript.

Use ORF prediction software on the putative transcripts. If multiple large ORFs are identified, BLAST the translated protein sequences to test if all of them are consistent (i.e. the multiple ORFs in different frames on the same strand may result from frame shifts introduced by misassembly).

Align the original RNA-Seq reads to your putative transcripts and examine how even the depth of coverage is across the length of the transcript. A contig which has dramatically different coverage at one end vs. the other, or if the two ends have deep coverage separated by a region of very shallow coverage between them may be a chimera.
kmcarr is offline   Reply With Quote
Old 02-13-2014, 05:13 AM   #3
jordi
Member
 
Location: València, Spain

Join Date: Apr 2009
Posts: 48
Default

Hi all!
I am now dealing with this issue. I mean, how to determine chimeras in a RNA-Seq assembly without a reference transcriptome.
I've read that we are not able to tackle chimeras from different genes without a reference. Instead of this, self-chimeras could be detected with repeated regions in the same contig.
However a sudden change in the coverage in a certain contig sequence could aid to estimate the number of chimeras in an assembly project. Here is my question: given a coverage of a transcript, how to set a threshold to determine that a change in the coverage could points to a chimera? According to kmcarr comment: what is a " dramatically different coverage "?? And how determining it??
Thank you very much for your help!!
jordi is offline   Reply With Quote
Old 03-04-2014, 06:26 AM   #4
martin2
Member
 
Location: Prague, Czech Republic

Join Date: Nov 2010
Posts: 40
Default

Quote:
Originally Posted by kmcarr View Post
I'm sorry to say that I don't have a good, easy method to identify chimeras in de novo assembled putative transcripts. To be honest, normally I acknowledge that it is likely there will be chimeras but don't do anything to identify them.

Here are some theoretical methods:

Use BLASTX alignment of a reference protein set and examine results to see if multiple proteins align to different segments of the putative transcript.

Use ORF prediction software on the putative transcripts. If multiple large ORFs are identified, BLAST the translated protein sequences to test if all of them are consistent (i.e. the multiple ORFs in different frames on the same strand may result from frame shifts introduced by misassembly).
The most important check is whether you have full-length matches. Often, an N/C-terminus will be placed on a different contig/scaffold compared to the core of protein. In diploid/polypoloid organisms due to sequencing errors you won't even find a definite answer whether a fragment of a transcript originated from locus 1 or 2 or 3, provided they all have 95-100% identity (and they do at least in some places, thanks to the recent whole-genome duplication events). There are many cases like this. This is one of the reasons why I always say that using NGS one can never, ever, get a correct answer in case of alternatively spliced genes. Unless we sequence a transcript as a whole pice, it is all just a guesswork. A short, 80nt long overlap between two reads does not justify for a conclusion that exon C and D are present in a same transscript. Assembler will always propose that A-B-C-D-E-F-G are in a transcript but hardly ever reveal that actually only A-B-E-F and A-B-C-F-G are expressed. With high coverages the situation could be more optimistic but here it depends on the number of biological and lab replicates, not just on a number of emPCR droplets or clusters derived from same PCR experiment. Although instructing an assembler to watch uniformity of coverage is cheating a bit I believe it helps at least in some cases.

Second, important check is for seemingly new exon extensions or truncations, and for seemingly "unremoved" introns, just breaking a multiple sequence alignment of your favourite gene.

Quote:
Originally Posted by kmcarr View Post
Align the original RNA-Seq reads to your putative transcripts and examine how even the depth of coverage is across the length of the transcript. A contig which has dramatically different coverage at one end vs. the other, or if the two ends have deep coverage separated by a region of very shallow coverage between them may be a chimera.
From my experience, the chimeras are even left in combined, shotgun + paired-end datasets, even in combined technologies, like Illumina+454. That is a nightmare for me. I have some idea why is that and what "requirements" need to be fullfilled so that they remain. Luckily, in other cases removal of chimeras results in longer contigs/scaffolds, less contig/scaffold counts, better N50/N90 numbers. But the numbers are not 10x better, you have to understand that if you ban one chimeric join you split 1 contig into 2, so the assembler is starting with much worse outlook initially and has to find completely different assembly paths. Once you accept the situation, it is pleasing that in the end one receives a bit better assembly in terms of these semi-usefull numbers. But the scaffolds/contigs are different.

Depends what lab protocol you have used to obtain the data but maybe you would appreciate a commercial service from me? I can properly trim datasets from some complex protocols, with almost no overtrimming and no misses. See http://www.bioinformatics.cz/softwar...rted-protocols . Although I developed that for 454-based datasets I could help with data from some other technologies. Depends.

Last edited by martin2; 03-04-2014 at 06:28 AM. Reason: Check for exon/introns lengths as well.
martin2 is offline   Reply With Quote
Old 03-04-2014, 07:06 AM   #5
JackieBadger
Senior Member
 
Location: Halifax, Nova Scotia

Join Date: Mar 2009
Posts: 381
Default

MIRA assembler can detect chimeras I believe
JackieBadger is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 05:23 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO