Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • how to minimize transcripts while de novo assembly transcriptome

    Hi all,

    I am using Trinity for de novo assembly of mRNA from Illumina Hiseq2000, 100bp pair ends. We get more than 11M clean reads of each sample.and totally 139M clean reads( about 28G clean data). The reads quality is good and all of the bases are above 30, using FastQC.

    I did de novo assembly using Trinity with default parameters and got 920k ​transcripts( ​201bp-24477bp; and if we calculate ​​the traanscripts from 500bp ---24477bp​, we will get 350k transcripts​), After that, I filtered it with CD-HIT​-EST​ and 789k clusters left. ​There are still too many​ transcripts​.​
    ​The clusters ranges from 201bp to 24477bp, While N50 is 2549bp.


    ​I divided these data to two groups, 57.5M and 81M,respectively. I assembled again with the 57.5M reads, This time I get 518387 transcript, 358886 components. The longest is 17339, N50 is 2900. After I filted it by CD-HIT-EST, the Trinity transcript is 443765, components is 358695, N50 is 1618, which is shorter than before.

    Does anybody have any suggestion on how to minimize the number of transcript?

    ​Is it OK that N50 become shorter after filtering(2900bp to 1618bp)​

  • #2
    We have filtered transcripts by the length of the longest ORF. Pasa also try to reduce the number of the transcripts.

    Comment


    • #3
      Here are some steps I use:
      1. Perform a digital-normalization
      2. Set --min_mer_cov to >= 2
      3. Use --triplet_lock or --extended_lock for Butterfly reconstruction

      Then use CD-HIT-EST or Tgicl to cluster transcripts. Or as TiborNagy mentioned, do a CDS/ORF prediction and pick the longest.

      There could be some other methods, but I only have experience with the ones above.

      Comment


      • #4
        Originally posted by TiborNagy View Post
        We have filtered transcripts by the length of the longest ORF. Pasa also try to reduce the number of the transcripts.
        Thank you TiborNagy! As I am newbie in NGS data analyzing, so could you explain to me in detail how to filter transcripts by the length of the longest ORF? I will reads PASA and follow the instructions.

        Thank you!

        Comment


        • #5
          Originally posted by yueluo View Post
          Here are some steps I use:
          1. Perform a digital-normalization
          2. Set --min_mer_cov to >= 2
          3. Use --triplet_lock or --extended_lock for Butterfly reconstruction

          Then use CD-HIT-EST or Tgicl to cluster transcripts. Or as TiborNagy mentioned, do a CDS/ORF prediction and pick the longest.

          There could be some other methods, but I only have experience with the ones above.
          I thank you very much yueluo for your sharing your experience!

          As to --triplet_lock and --extended_lock, which is better? or do you recommend I use both of these two parameters at the same time?

          I will add --min_mer_cov to >= 2 and run trinity again.

          AS to the "Perform a digital-normalization", could you show me in detail? Thanks!

          Best,

          Comment


          • #6
            --triplet_lock and --extended_lock are two stringency levels in Butterfly reconstruction . --extended_lock is higher so it produces fewer transcripts. You can use either one.

            Digital normalization is included in the Trinity package, this is from its website:

            Comment


            • #7
              Originally posted by yueluo View Post
              --triplet_lock and --extended_lock are two stringency levels in Butterfly reconstruction . --extended_lock is higher so it produces fewer transcripts. You can use either one.

              Digital normalization is included in the Trinity package, this is from its website:
              http://trinityrnaseq.sourceforge.net...alization.html
              Many thanks yueluo indeed! I will try and hope this time I will get smaller transcripts.

              Best,

              Comment


              • #8
                I assemble the transcriptome again and this time I did it following yueluo's suggestion:

                First, I did a digital-normalization with trinity with such commands:
                util/normalize_by_kmer_coverage.pl --seqType fq --JM 80G --max_cov 30 --min_kmer_cov 2 --left_list R1.list --right_list R2.list --pairs_together --PARALLEL_STATS --JELLY_CPU 10

                Use such commands for assembly:
                Trinity.pl --seqType fq --JM 80G --min_kmer_cov 2 --extended_lock --left R1.list.normalized_K25_C30_pctSD200.fq --right R2.list.normalized_K25_C30_pctSD200.fq --CPU 10
                Then I get 467208 transcripts, 324216 components, while N50 is 1578.

                After this, I did a cluster filtering use CD-HIT-EST, the results are below:
                Total trinity transcripts: 413673
                Total trinity components: 324161
                Percent GC: 48.94

                ########################################
                Stats based on ALL transcript contigs:
                ########################################

                Contig N10: 3891
                Contig N20: 2679
                Contig N30: 1930
                Contig N40: 1371
                Contig N50: 970

                Median contig length: 349
                Average contig: 638.18
                Total assembled bases: 263995964
                Now the transcripts are still too many, Anybody have some suggestions to further reduce the number of transcripts? Thank you!

                Comment


                • #9
                  What is it you’re trying to do with them that makes this too many? Trinity has good tools for differential expression that will work at the gene level with these numbers of transcripts.

                  To try to answer your question though, about the only other thought I had was to try to expand the gene level classification through Trinity’s Trinotate pipeline (http://trinotate.sourceforge.net). Maybe things that hit the same orthologue, you can collapse further.... If you only care about protein coding genes, you could also just get rid of anything that doesn’t hit an orthologue or doesn’t hit a vertebrate protein (assuming you’re working on vertebrates).

                  But de novo transcriptome is just kinda messy and this is what everyone deals with when doing it. I don’t think there are many other options than what’s listed here that will work for you unless you have reference genome to align the transcripts back to.

                  Comment


                  • #10
                    Originally posted by Wallysb01 View Post
                    What is it you’re trying to do with them that makes this too many? Trinity has good tools for differential expression that will work at the gene level with these numbers of transcripts.

                    To try to answer your question though, about the only other thought I had was to try to expand the gene level classification through Trinity’s Trinotate pipeline (http://trinotate.sourceforge.net). Maybe things that hit the same orthologue, you can collapse further.... If you only care about protein coding genes, you could also just get rid of anything that doesn’t hit an orthologue or doesn’t hit a vertebrate protein (assuming you’re working on vertebrates).

                    But de novo transcriptome is just kinda messy and this is what everyone deals with when doing it. I don’t think there are many other options than what’s listed here that will work for you unless you have reference genome to align the transcripts back to.
                    Hi Wallysb01, My previous expected thought is to download the database of nr, swissprot,KEGG, COG and blast my transcript against these them. now in the trinotate pepeline, it seems to recommend Swiss-Prot and pfam, Do you think this is enough? I am confused about the following steps. Thanks!

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Current Approaches to Protein Sequencing
                      by seqadmin


                      Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                      04-04-2024, 04:25 PM
                    • seqadmin
                      Strategies for Sequencing Challenging Samples
                      by seqadmin


                      Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                      03-22-2024, 06:39 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, 04-11-2024, 12:08 PM
                    0 responses
                    17 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-10-2024, 10:19 PM
                    0 responses
                    22 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-10-2024, 09:21 AM
                    0 responses
                    16 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-04-2024, 09:00 AM
                    0 responses
                    46 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X