SEQanswers

Go Back   SEQanswers > Sequencing Technologies/Companies > 454 Pyrosequencing



Similar Threads
Thread Thread Starter Forum Replies Last Post
454 + Illumina Combined Assembly Kmart Bioinformatics 9 11-08-2011 06:41 AM
Assembly for both 454 and GAIIx rtp115 Bioinformatics 3 08-26-2010 01:27 AM
454 assembly using consed mjleaks Bioinformatics 7 08-28-2009 03:52 AM
de novo 454 assembly strob Bioinformatics 8 01-21-2009 11:26 AM
454 Assembly viewer pr0t3us 454 Pyrosequencing 2 11-06-2008 09:04 PM

Reply
 
Thread Tools
Old 11-29-2013, 12:27 AM   #1
blue mood
Member
 
Location: China

Join Date: Jul 2011
Posts: 22
Default 454 RNA assembly

Hello all,

I'm working on a transcriptome which is sequenced by Roche 454. At first I used to assemble it using Newber but the result was awful. Then I tried iAssembler and got a result that looks all right. But when I mapped reads back, only about 30% reads could be located on the assembly.

Is there anyone who can help me with this issue? Many thanks.
blue mood is offline   Reply With Quote
Old 11-29-2013, 12:37 AM   #2
sisch
Member
 
Location: Dusseldorf, Germany

Join Date: Jun 2011
Posts: 29
Default

We investigated the effect of the assembly algorithm on the assembly quality of 454 data some time ago (see J.exp.bot.: Critical assessment of assembly strategies for non-model species mRNA-Seq data and application of next-generation sequencing to the comparison of C(3) and C(4) species.).

However, to cut a long story short, I would suggest 1) read cleaning with a decent quality cut-off (Phred 20 or higher) and 2) assembly with CAP3 or TGICL, which essentially is CAP3 on preclustered reads.

Best
Simon
sisch is offline   Reply With Quote
Old 11-29-2013, 12:52 AM   #3
blue mood
Member
 
Location: China

Join Date: Jul 2011
Posts: 22
Default

Hi Simon,

Thanks for your suggestions.
Can you offer some softwares that can do the quality control? I used SeqClean to trim vectors but it can't do quality control.
The iAssembler used MIRA and CAP3. I also used TGICL after that. But I still don't understand why so few reads aligned to the assembly.
blue mood is offline   Reply With Quote
Old 11-29-2013, 02:14 AM   #4
sisch
Member
 
Location: Dusseldorf, Germany

Join Date: Jun 2011
Posts: 29
Default

Sorry, I just realized my post was a bit short.

So we usually use Fastq/Fasta quality trimming and filtering from the Galaxy-FASTX-toolkit (http://hannonlab.cshl.edu/fastx_toolkit/)
There are multiple possible reasons for your low mapping efficiency:
1) Mapping tool too stringent
2) Assembly errors
3) Only unique mapping reads reported, while having many repetitive contigs

As there is no golden rule to transcriptomics (as far as I know), you might want to share some information about your project and discuss it with the people here. (e.g. model vs non-model, animal/plant/etc., experimental design: qualitative vs. quantitative, etc.). The more you are able and willing to share, the more you can get out of it here

Best,
Simon
sisch is offline   Reply With Quote
Old 11-29-2013, 07:19 PM   #5
blue mood
Member
 
Location: China

Join Date: Jul 2011
Posts: 22
Default

About my project, I have 2 samples of shrimp, one is infected by virus and the other one is as control. My purpose is to find gene expression changes under infection.
thanks

Robin
blue mood is offline   Reply With Quote
Old 12-02-2013, 03:58 AM   #6
flxlex
Moderator
 
Location: Oslo, Norway

Join Date: Nov 2008
Posts: 415
Default

454's own program, newbler (gsAssembly, runAssembly) has a '-cDNA' option. Did you use that? It is very unfortunate this program was not tested in the study by @sisch - as on paper it seems a nice approach.
flxlex is offline   Reply With Quote
Old 12-04-2013, 02:52 AM   #7
sisch
Member
 
Location: Dusseldorf, Germany

Join Date: Jun 2011
Posts: 29
Default

@flxlex Yes, it is a pity that newbler was not included

@blue_mood
I don't really overview your field, as I am coming from plant science, so please bare with my ignorance. Just to add some more thoughts:

- Is there ANY genome available that is reasonably, closely related. In plants this might be as far diverged as 50 million years to still yield good results.
- Do you know the viral genome? (i.e. can you tell which transcripts are of viral origin)
- I've seen datasets with virus-induced transcriptional changes, where you would find many differences in splicing variants. These are hardly detectable in de novo approaches. Thus, you should have a genome reference in the backend.
sisch is offline   Reply With Quote
Old 12-07-2013, 01:03 PM   #8
martin2
Member
 
Location: Prague, Czech Republic

Join Date: Nov 2010
Posts: 40
Default

Hi Robin,
I developed my own tool to do QC and adapter/artifact/MID removal and I offer cleanup work as a commercial service. I think I may dare to say that I really have an overview based on more than 1700 454-based datasets from worldwide. I collected lots of artifacts from all those datasets and learned what one has to remove to get better assemblies, found several funny mistakes which happened time to time in some labs, some software-driven errors and notably, bad designs of certain lab protocols.

In respect to publicly available tools ... and your question what you could use for your work? None. They just don't do the right thing, at all. Nobody told those programmers what to look for properly and because I know what they are missing I can only say that they never tested properly their software. ( In contrary, I have been constantly rewriting my tool several years and kept hitting yet another new artifact or new adapter in datasets weekly. Frustrating. Nowadays I perform at least several hundreds of queries for each read in a dataset before deciding what is in a particular read. Sadly in some cases that needs to be a few thousands of queries before I can judge what and how to trim. And not only, I have to use several aligners to be able to find what I want to becuase it is sometimes obfuscated too much my sequencing errors.

Forget about QC based on PHRED values. It is merely useless. CAP3 is a usual trick to squash reads into some consecutive sequence once you realize you are unable to get reads merged together. It is good to get on average 400nt long contigs for the purpose of your paper. Interestingly, reviewers let such papers become published although raw read length was for example 310nt on average (FLXti). Anyway, CAP3 merges closely related copies of same gene together. Just think how many whole-genome duplications underwent your bug already. Do you have 4 or 8 loci of your favourite gene in the genome? Why do you think CAP3 would not mix them into a single bunch of splice variants? Maybe even it makes up just one or two splice variants, discards 3 SNPs and drop 7 3'-UTR exons out of those 8. It should have been banned for a long while applying CAP3 on nextgen data. This is a last resort you should ever try.

I see several shrimp datasets in NCBI SRA including one infected with some virus. Is that the dataset you are talking about here? Drop me an email if you want me to cleanup the data for you.
martin2 is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 05:42 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO