Seqanswers Leaderboard Ad

**dnusol** · 07-07-2011, 02:57 AM

Hi blindtiger454,
have you had any luck on this?? I am in the same situation. I have run Velvet with 4 different k-mers and want to run oases on the combination, but I am not sure if the merging is done before or after oases, and what and how is to be merged. The Oases manual does not specify how to do such thing. I am trying to get info from the Oases mailing list with no answer so far.

**blindtiger454** · 07-07-2011, 11:18 AM

We merged the separate oases output. It created longer transcripts, but you lose the loci/transcript information, and are left with many sequences to annotate. We used CD-HIT-EST to merge, and are writing a simple script that will recapture the loci/transcript information.

**Jenzo** · 07-11-2011, 12:12 AM

Originally posted by blindtiger454 View Post

We merged the separate oases output. It created longer transcripts, but you lose the loci/transcript information, and are left with many sequences to annotate. We used CD-HIT-EST to merge, and are writing a simple script that will recapture the loci/transcript information.

I also would have merged the oases output, since oases runs on velvetg's output and merging contigs.fa from several velvetg runs will not allow oases to work as intended.
Perhaps you could try to merge with TGICL/CAP3-Package, since our experience shows that CD-HIT-EST works not optimal, i.e. it is not able to eliminate all redundancy.

How are you planing to resolve loci/transcript information?

**vbiaudet** · 07-11-2011, 12:41 AM

oases + TGICL

Hi,
We used also oases + TGICL on RNASeq, it's seems working, we started with 73400 contigs from oases and after TGICL we got back 39000 contigs but we didn't know if it's correct because after oases the contigs contained N and it's not sure that TGICL was well working with N.
bye, VB

**blindtiger454** · 07-13-2011, 06:04 PM

We are dealing with plant transcriptome, there is a lot of small variability between paralogues and allelic sister genes. We didn't want to lose that variability, so we stuck with strict CD-HIT alignment parameters (100%). However we are trying the TGICL tool as I write this. But i agree with vbiaudet, the N's produced by Oases scare me.
Regarding CD-HIT, it outputs a file listing which sequences collapsed into which representative sequence. We appended the kmer value to each transcript name. We basically just parse the file, if a loci # from a certain kmer was collapsed, but has other transcript variants that didn't flag as redundant, we assign a gene ID linking that transcript to the representative sequence. It is hard to explain, but trivial to understand. I have never used TGICL before, so maybe it will provide better cluster information than CD-HIT did.

**masterpiece** · 07-18-2011, 08:19 PM

oasis + cap3

Hi blindtiger454 and all

I'm also working with the same thing, currently running my dataset on oases.

I believed most of papers used velvet instead of oases is because the oases is still new at that time. Maybe they more comfortable with velvet. But from this paper "De Novo Assembly of Chickpea Transcriptome Using Short Reads for Gene Discovery and Marker Identification" it shows that Oases do better than Velvet on the assembly. Because of this finding, I've decided to with Oases, than merge the transcript generated using cap3.

Just my question here, how you guys select "the best K-mers"?

kamal

**blindtiger454** · 07-19-2011, 12:27 PM

The best k-mer value will usually be about ~2/3 the length of your reads. For us, we considered the best k-mer to be the assembly which used the most reads. We have a powerful computer though, and a velvet or oases assembly only took about 2hours. We have 55bp reads, so we did k-mer assemblies for kmer29 - kmer51. Combining all the transcripts from all kmer assemblies gave us about ~1,000,000 transcripts. After running CD-HIT-EST (using a strict 100% identity for the alignment), it reduced the dataset to about 500,000 transcripts. We then ran that dataset through the TGICL/CAP3 pipeline, and it reduced the dataset to ~175,000 transcripts. We made the clustering parameters more strict in TGICL. Normally it uses 94% identity, but we raised it to 97% because plants have many paralogues with have high sequence identity.

**Jenzo** · 07-19-2011, 10:17 PM

Originally posted by blindtiger454 View Post

Combining all the transcripts from all kmer assemblies gave us about ~1,000,000 transcripts. After running CD-HIT-EST (using a strict 100% identity for the alignment), it reduced the dataset to about 500,000 transcripts. We then ran that dataset through the TGICL/CAP3 pipeline, and it reduced the dataset to ~175,000 transcripts. We made the clustering parameters more strict in TGICL. Normally it uses 94% identity, but we raised it to 97% because plants have many paralogues with have high sequence identity.

Quite good! 175000 sounds more realistic than 500000 transcripts. In our case, TGICL+CAP3, also used with more strict parameters (-p 98) its able to reduce from 500000 to 120000 transcripts. Best wishes

**blindtiger454** · 07-20-2011, 11:42 AM

Watch out for hairpins & palindromes in your assembly. I've heard that oases will create palindromes, and the CD-HIT-EST or CAP3 will favor this sequences because they are bigger. I noticed them in the annotation process, there will be a full protein coding sequence on the negative strand, but further down the sequence, the exact same protein sequence will be in the positive strand. So we have to devise a way to extract the protein sequence with accurate UTRs

**masterpiece** · 07-31-2011, 07:25 PM

Guys,

Have you heard Iassembler?? . Initially been used to align 454 reads and EST data, not sure if it can be used for combining the velvet transcript, but theoretically I think it can be done. Check this doc out. http://bioinfo.bti.cornell.edu/tool/...iAssembler.pdf. It can handle type I and II error pretty well. Better than tgicl and cap3. I will give a try on this if i got time to do it.

kamal

**mcastro** · 11-30-2011, 06:47 AM

Hi,

Recently I used this de novo assembly approach to build the transcriptome of a non-sequenced eukaryote: velvet/oases with multiple kmer and Trinity. I joined the different transcriptomes in the same file obtaining ~ 1.000.000 sequences with N50 500. After that I used cd-hit-est and tgicl to obtain a final transcriptome of 250.000 sequences and N50 900.

It was a great improvement but I think that I still have too many sequences. Do you recommend to use other programs to reduce the number of sequences, maybe iAssembler? How can I check the quality of this transcriptome? I worried about presence of N (0.007%) in the final transcriptome. What do you think? The objective of this transcriptome is to use it to compare SNP/INDELS from different samples.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 25 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 29 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 24 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

multiple k-mer & oases

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News