Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • multiple k-mer & oases

    I am yet to try Oases, but curious about what it exactly does. Most importantly, will it join contigs that are output from a velvet assembly(even if they data is single end). Or does it simply cluster sequences with similar regions (e.g. same exons) into clusters, and assigns loci & transcript IDs to these clusters?? Basically, will Oases create longer contigs from my velvet assembly?
    It seems like the best technique for transciptome assembly is the multiple k-mer method. But all the papers I read simply use the output of velvet when merging. Would it be smarter to run oases after each velvet run, and then merge the oases output instead??

  • #2
    Hi blindtiger454,
    have you had any luck on this?? I am in the same situation. I have run Velvet with 4 different k-mers and want to run oases on the combination, but I am not sure if the merging is done before or after oases, and what and how is to be merged. The Oases manual does not specify how to do such thing. I am trying to get info from the Oases mailing list with no answer so far.

    Comment


    • #3
      We merged the separate oases output. It created longer transcripts, but you lose the loci/transcript information, and are left with many sequences to annotate. We used CD-HIT-EST to merge, and are writing a simple script that will recapture the loci/transcript information.

      Comment


      • #4
        Originally posted by blindtiger454 View Post
        We merged the separate oases output. It created longer transcripts, but you lose the loci/transcript information, and are left with many sequences to annotate. We used CD-HIT-EST to merge, and are writing a simple script that will recapture the loci/transcript information.
        I also would have merged the oases output, since oases runs on velvetg's output and merging contigs.fa from several velvetg runs will not allow oases to work as intended.
        Perhaps you could try to merge with TGICL/CAP3-Package, since our experience shows that CD-HIT-EST works not optimal, i.e. it is not able to eliminate all redundancy.

        How are you planing to resolve loci/transcript information?

        Comment


        • #5
          oases + TGICL

          Hi,
          We used also oases + TGICL on RNASeq, it's seems working, we started with 73400 contigs from oases and after TGICL we got back 39000 contigs but we didn't know if it's correct because after oases the contigs contained N and it's not sure that TGICL was well working with N.
          bye, VB

          Comment


          • #6
            We are dealing with plant transcriptome, there is a lot of small variability between paralogues and allelic sister genes. We didn't want to lose that variability, so we stuck with strict CD-HIT alignment parameters (100%). However we are trying the TGICL tool as I write this. But i agree with vbiaudet, the N's produced by Oases scare me.
            Regarding CD-HIT, it outputs a file listing which sequences collapsed into which representative sequence. We appended the kmer value to each transcript name. We basically just parse the file, if a loci # from a certain kmer was collapsed, but has other transcript variants that didn't flag as redundant, we assign a gene ID linking that transcript to the representative sequence. It is hard to explain, but trivial to understand. I have never used TGICL before, so maybe it will provide better cluster information than CD-HIT did.

            Comment


            • #7
              oasis + cap3

              Hi blindtiger454 and all

              I'm also working with the same thing, currently running my dataset on oases.

              I believed most of papers used velvet instead of oases is because the oases is still new at that time. Maybe they more comfortable with velvet. But from this paper "De Novo Assembly of Chickpea Transcriptome Using Short Reads for Gene Discovery and Marker Identification" it shows that Oases do better than Velvet on the assembly. Because of this finding, I've decided to with Oases, than merge the transcript generated using cap3.

              Just my question here, how you guys select "the best K-mers"?

              kamal
              Last edited by masterpiece; 07-19-2011, 07:13 AM.

              Comment


              • #8
                The best k-mer value will usually be about ~2/3 the length of your reads. For us, we considered the best k-mer to be the assembly which used the most reads. We have a powerful computer though, and a velvet or oases assembly only took about 2hours. We have 55bp reads, so we did k-mer assemblies for kmer29 - kmer51. Combining all the transcripts from all kmer assemblies gave us about ~1,000,000 transcripts. After running CD-HIT-EST (using a strict 100% identity for the alignment), it reduced the dataset to about 500,000 transcripts. We then ran that dataset through the TGICL/CAP3 pipeline, and it reduced the dataset to ~175,000 transcripts. We made the clustering parameters more strict in TGICL. Normally it uses 94% identity, but we raised it to 97% because plants have many paralogues with have high sequence identity.

                Comment


                • #9
                  Originally posted by blindtiger454 View Post
                  Combining all the transcripts from all kmer assemblies gave us about ~1,000,000 transcripts. After running CD-HIT-EST (using a strict 100% identity for the alignment), it reduced the dataset to about 500,000 transcripts. We then ran that dataset through the TGICL/CAP3 pipeline, and it reduced the dataset to ~175,000 transcripts. We made the clustering parameters more strict in TGICL. Normally it uses 94% identity, but we raised it to 97% because plants have many paralogues with have high sequence identity.
                  Quite good! 175000 sounds more realistic than 500000 transcripts. In our case, TGICL+CAP3, also used with more strict parameters (-p 98) its able to reduce from 500000 to 120000 transcripts. Best wishes

                  Comment


                  • #10
                    Watch out for hairpins & palindromes in your assembly. I've heard that oases will create palindromes, and the CD-HIT-EST or CAP3 will favor this sequences because they are bigger. I noticed them in the annotation process, there will be a full protein coding sequence on the negative strand, but further down the sequence, the exact same protein sequence will be in the positive strand. So we have to devise a way to extract the protein sequence with accurate UTRs

                    Comment


                    • #11
                      Guys,

                      Have you heard Iassembler?? . Initially been used to align 454 reads and EST data, not sure if it can be used for combining the velvet transcript, but theoretically I think it can be done. Check this doc out. http://bioinfo.bti.cornell.edu/tool/...iAssembler.pdf. It can handle type I and II error pretty well. Better than tgicl and cap3. I will give a try on this if i got time to do it.

                      kamal

                      Comment


                      • #12
                        Hi,

                        Recently I used this de novo assembly approach to build the transcriptome of a non-sequenced eukaryote: velvet/oases with multiple kmer and Trinity. I joined the different transcriptomes in the same file obtaining ~ 1.000.000 sequences with N50 500. After that I used cd-hit-est and tgicl to obtain a final transcriptome of 250.000 sequences and N50 900.

                        It was a great improvement but I think that I still have too many sequences. Do you recommend to use other programs to reduce the number of sequences, maybe iAssembler? How can I check the quality of this transcriptome? I worried about presence of N (0.007%) in the final transcriptome. What do you think? The objective of this transcriptome is to use it to compare SNP/INDELS from different samples.
                        Last edited by mcastro; 11-30-2011, 07:08 AM.

                        Comment

                        Latest Articles

                        Collapse

                        • seqadmin
                          Current Approaches to Protein Sequencing
                          by seqadmin


                          Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                          04-04-2024, 04:25 PM
                        • seqadmin
                          Strategies for Sequencing Challenging Samples
                          by seqadmin


                          Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                          03-22-2024, 06:39 AM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by seqadmin, 04-11-2024, 12:08 PM
                        0 responses
                        25 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-10-2024, 10:19 PM
                        0 responses
                        29 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-10-2024, 09:21 AM
                        0 responses
                        24 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-04-2024, 09:00 AM
                        0 responses
                        52 views
                        0 likes
                        Last Post seqadmin  
                        Working...
                        X