Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • TGICL for denovo transcriptome

    Hi everyone,

    we have RNA seq data (without reference genome). We used several assemblers and then united all results with CAP3 - this gave ~60,000 contigs.
    The problem is that these contigs are redundant, so the percentage of uniquely aligned reads is very small.

    We tried to use TGICL, as suggested by one of the members in this forum.

    In the TGICL output - there is an ACE file and singlets file. The problem is that there are ~40,000 contigs that are in the input file before TGICL, but do not appear in the ACE/singlets.

    Any help with the TGICL output or other suggestions on how to map the reads to the "redundant" contigs will be appreciated.

  • #2
    Hi gfmgfm,
    I have the same problem with.By now I haven't get the answer.Can you tell me the result you deal with this problem.Thank you!

    Comment


    • #3
      You might want to have a look at cd-hit (http://code.google.com/p/cdhit/) to remove some redundancy.

      Comment


      • #4
        Thank you for your advise.

        Originally posted by sklages View Post
        You might want to have a look at cd-hit (http://code.google.com/p/cdhit/) to remove some redundancy.
        These day I use the software called Trinty to denovo assemle transcriptome which have 25804627 reads-paired with 90bp.I have got 67683 est by Trinty.But when I use TGICL to cluster them with default parameter.I just got 205 cluster and 67247singleton.There was so little est to be cluster.I think there are some problem.But I haven't got the idea.Did someone can give me some advise.

        Comment


        • #5
          TGICL for denovo transcriptome

          These day I use the software called Trinty to denovo assemle transcriptome which have 25804627 reads-paired with 90bp.I have got 67683 est by Trinty.But when I use TGICL to cluster them with default parameter.I just got 205 cluster and 67247singleton.There was so little est to be cluster.I think there are some problem.But I haven't got the idea.Did someone can give me some advise.

          Comment


          • #6
            Trinity is already geared towards low redundancy, that's why you won't gain much by clustering the contigs with cd-hit-est afterwards - the numbers you gave sound reasonable.
            67 k transcript contigs sounds reasonable for a sample from a heterozygous eukaryote. How many of your reads multi-map? Did you try to discard low-support contigs by checking the RSEM support (see the Trinity website for details)?

            Comment


            • #7
              Originally posted by arvid View Post
              Trinity is already geared towards low redundancy, that's why you won't gain much by clustering the contigs with cd-hit-est afterwards - the numbers you gave sound reasonable.
              67 k transcript contigs sounds reasonable for a sample from a heterozygous eukaryote. How many of your reads multi-map? Did you try to discard low-support contigs by checking the RSEM support (see the Trinity website for details)?
              Thanks for your advise.I use 25668103paired-reads to map the reference which clustered by tgicl.There were 20387812(79.43%)paired-reads and 3690778(7.19%)single-reads could map to the reference.Did you think the result was reasonable.I had a question why I should discard low-support contigs.Did the low-support contig affect the expression of the gene analysis.

              Comment


              • #8
                Originally posted by pardonliang View Post
                Thanks for your advise.I use 25668103paired-reads to map the reference which clustered by tgicl.There were 20387812(79.43%)paired-reads and 3690778(7.19%)single-reads could map to the reference.Did you think the result was reasonable.I had a question why I should discard low-support contigs.Did the low-support contig affect the expression of the gene analysis.
                If I understand your numbers, you're saying that ~80 % of your paired reads map (as pairs), and ~7 % map as singles. If so, I think that is very reasonable. You might want to try to scaffold your contigs with the 7 % of the reads that only maps as singles, however you risk to get more chimeras, and I guess the benefit for expression analysis is marginal.

                If you have contigs with low read support it shouldn't interfer with the expression analysis, but slow it down (and slow other downstream analysis down). I usually discard contigs to which RSEM assigns no reads.

                Comment


                • #9
                  cluster of different transcriptome

                  Thanks arvid.But I have other question.I have sequenced two transcriptome of same species.Susceptible I use Trinty to assemble transcriptome of two species espectively.I want to know the differentially expressed transcripts between susceptible and resistance species.But I used cluster software to cluster them which used to mapping reads to same reference.But there were so little assembled EST which could cluster.So I used the assembled contigs of susceptible and resistance species as reference for differentially expressed transcripts analysis.But I found a interesting result.I used the assembled contigs of susceptible species as reference.There were 211 up expressed transcripts.But for resistance species,there were 3021 up expressed transcripts.I didn't know what I can do to get a correct up expressed transcripts.

                  Comment


                  • #10
                    Originally posted by pardonliang View Post
                    Thanks arvid.But I have other question.I have sequenced two transcriptome of same species.Susceptible I use Trinty to assemble transcriptome of two species espectively.I want to know the differentially expressed transcripts between susceptible and resistance species.But I used cluster software to cluster them which used to mapping reads to same reference.But there were so little assembled EST which could cluster.So I used the assembled contigs of susceptible and resistance species as reference for differentially expressed transcripts analysis.But I found a interesting result.I used the assembled contigs of susceptible species as reference.There were 211 up expressed transcripts.But for resistance species,there were 3021 up expressed transcripts.I didn't know what I can do to get a correct up expressed transcripts.
                    Now it is much clearer what you are trying to achieve - previously you didn't mention that you have samples from genetically diverging material (if you are talking about the same experiment). IMHO "correct up expressed transcripts" in that context will be very difficult to define, unless you have useful prior information from both strains/ecotypes/species (your information here is not clear, first you say two of the same species, then you say resistant and susceptible species).
                    If your strains/ecotypes/species are very closely related (only small indels and SNPs) you might be fine using one of the transcriptomes as reference like you did, provided that your alignment allows for such sequence variation (still, I would carefully study the alignments from both samples on transcripts with DE calls). If they are not closely related this changes everything, please let us know.
                    The number of differentially expressed transcripts is not relevant to the interest of the experiment and can not be judged unless you give an exact description on how you came up with that number: the software and statistics used, the amount and type of replicates, the biological system you are working on, and the way the samples were collected. On a genome-wide level, it is easy to find ~3000 differentially expressed transcripts - the more important question is to find out which of them are actually differentially expressed due to biological reasons and are interesting to study further. That might be all, 1000, 100, 10, 1 or none of them.

                    Comment


                    • #11
                      I'm sorry for my unclear description

                      I’m sorry for my unclear description.For example,I have two Drosophila melanogaster species.The susceptible species have been breeded in lab without insecticide for several years.The resistance species were survived by high concentration insecticide.These two species were sequenced by solexa.I wanted to know the up expressed transcripts.I used SOAPaligner to map the reads to the reference and to calculate the Unigene expression uses RPKM method.The formula is shown below.
                      RPKM=1000000*C/(NL/1000).
                      Set RPKM to be the expression of Unigene A, and C to be number of reads that uniquely aligned to Unigene A, N to be total number of reads that uniquely aligned to all Unigenes, and L to be the base number in the CDS of Unigene A . The RPKM method is able to eliminate the influence of different gene length and sequencing level on the calculation of gene expression. Therefore the calculated gene expression can be directly used for comparing the difference of gene expression between samples.

                      I have used susceptible、resistance and susceptible and resistance assebled transcriptomes as reference to compare the number of paired-maping reads、singled-mapping reads、up expressed transcripts.The result of comparison was show below.


                      As the result,I wanted used susceptible and resistance assebled transcriptomes as reference to obtain the up and down expressed transcripts because the number of total mapping reads and diference expressed transcrits were most.I didn't know the choice whether had problem or not.Thanks arvid.[/QUOTE]
                      Last edited by pardonliang; 03-19-2012, 12:34 AM.

                      Comment


                      • #12
                        I'm no fly researcher, so I can't judge your choices based on the information you gave above. How divergent are your Drosophila melanogaster strains from the publically available sequence references at FlyBase? I guess you would come further in resolving common transcripts a with a genome reference-based mapping and assembly approach (e.g. TopHat-Cufflinks). In your case I would definately try that in addition to the de novo assembly approach you did, or at least map the assembled transcripts to the reference genome.
                        I just see an image with a :-( smiley and something written in Chinese (I guess this is an error from a image host?), and since you didn't say how you compared your numbers or how you replicated your samples (or if you used any statistics), there is no way to tell whether your comparison method is sound.
                        In any case, you need to examine your candidate differentially expressed transcripts for sequence variants between the samples!

                        Comment


                        • #13
                          Thank you for your advice.

                          Thank you your arvid adivce.I have uploaded the picture again,I hoped you will see it.I Because the data have not submitted to the magazine.So I just taked Drosophila melanogaster as example.The species which I researched was a agricultural insect which haven't reference genomes.Thank you so much for your valuable advice.

                          Comment


                          • #14
                            Originally posted by pardonliang View Post
                            Thank you your arvid adivce.I have uploaded the picture again,I hoped you will see it.I Because the data have not submitted to the magazine.So I just taked Drosophila melanogaster as example.The species which I researched was a agricultural insect which haven't reference genomes.Thank you so much for your valuable advice.
                            There is still no image. And please, please stop providing incorrect information about your experiment - just say up front that you can't talk about the details. I guess the best thing you can do at the moment is to assemble the transcriptomes together and analyze the differential expression of the transcripts that have good read support from both samples, but bear in mind that your analysis can't be really quantitative.
                            Last edited by arvid; 03-19-2012, 12:51 AM.

                            Comment


                            • #15
                              I'm so sorry for my behavior.
                              My original intention just wanted to what happened to my data.I'm sorry for the link of my image was not unreachable.SO I felt so sorry to arvid.I didn't want to waste your time and your patience.
                              I have used susceptible、resistance 、susceptible and resistance assebled transcriptomes as reference to compare the number of paired-maping reads、singled-mapping reads、up expressed transcripts.The result of stored in Attach Files.
                              According to your question(the software and statistics used, the amount and type of replicates, the biological system you are working on, and the way the samples were collected).I haven't used statistics、replicateds yet.The susceptible species have been breeded in lab without insecticide for ten years.The resistance species were collected in field and were selected by high concentration of one insecticide.Finally,the susceptible and the remain alived insects by insecticide selection ,which both of them was collected in same stage of growth period,was used to sequenced.
                              Thanks so much for arvid's advices.
                              .



                              Originally posted by arvid View Post
                              There is still no image. And please, please stop providing incorrect information about your experiment - just say up front that you can't talk about the details. I guess the best thing you can do at the moment is to assemble the transcriptomes together and analyze the differential expression of the transcripts that have good read support from both samples, but bear in mind that your analysis can't be really quantitative.
                              Attached Files

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM
                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              27 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              30 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              26 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              52 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X