Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • De novo RNA-Seq Assembly using Trinity gives too many unigenes.

    Hi,
    I have done the De novo RNA-Seq Assembly using Trinity with my fish data, but the number of unigenes I get is 333,843, and the N50 Length is 396. I think the number of unigenes is too many. So which parameter should I set to reduce the number of unigenes and make the N50 length longer?
    My trinity version is: Trinityrnaseq_r20131110
    All of the parameter is default.
    Thanks a lot for any suggesstion!
    happy

  • #2
    Hi, assuming you first parsed your data based on quality analysis using a Phred score Q>20, you certain could set the minimum length for acceptable contigs to 500 bp, for example, using --min_contig_length 500. If you do not filter your reads based on quality, I encourage you use Trimmomatic for example in order to eliminate the sequencing errors which could affect the Bruijn graphs resolution, creating several number of unresolved bubbles.

    Comment


    • #3
      Thanks a lot @arthurmelo.
      We have filted the low quality reads by deleting the percentage of Phred score Q>20 less than 80%. So As you suggest, we'd better fillter the reads again by Trimmomatic. I'll have a try. Thanks a lot!
      As the set of the minimum length for acceptable contigs to 500 bp will loose too many reads, I'll try when having no other choice.
      BTW, is there any other way to reduce the number of the unigenes? OR is there any other way to fillter the data?
      Last edited by huan; 12-22-2015, 05:31 PM.
      happy

      Comment


      • #4
        Originally posted by huan View Post
        Thanks a lot @arthurmelo.
        We have filted the low quality reads by deleting the percentage of Phred score Q>20 less than 80%. So As you suggest, we'd better fillter the reads again by Trimmomatic. I'll have a try. Thanks a lot!
        As the set of the minimum length for acceptable contigs to 500 bp will lose too many reads, I'll try when having choice.
        BTW, is there any other way to reduce the number of the unigenes? OR is there any other way to fillter the data?
        Having ~300K contigs as the initial, raw output of Trinity isn't terribly surprising. Trinity includes some scripts for filtering this initial contig file; the most useful filtering rule is removing contigs which only include a very small number of reads supporting them. To do this you first use the analysis tools in the Trinity package to align your read data back to the assembled contigs and then RSEM to calculate the relative abundance of reads for each contig using the script 'align_and_estimate_abundance.pl' (found in the Trinity /util folder). Finally you run the script 'filter_fasta_by_rsem_values.pl' (also in the /util folder). Filtering out contigs with very low read can have a dramatic impact on the total number.

        Also, in my opinion, 500bp is much too high a threshold contig length. I routinely use 200bp.

        Comment


        • #5
          Thanks a lot @kmcarr.
          That's quite a good idea. But I am not sure whether the "removing contigs" means 'removing inchworm result contigs with --no_run_butterfly --no_run_quantifygraph' or not.
          Thanks a lot again.
          happy

          Comment


          • #6
            Originally posted by huan View Post
            Thanks a lot @kmcarr.
            That's quite a good idea. But I am not sure whether the "removing contigs" means 'removing inchworm result contigs with --no_run_butterfly --no_run_quantifygraph' or not.
            Thanks a lot again.
            The contig filtering I am describing takes place after the Trinity assembly is complete, not during the assembly phase.

            The basic process is to use the align_and_estimate_abundance.pl script found in the Trinity util/ folder to align your input reads to the 333K contigs in your final Trinity.fasta file. It also creates abundance estimates (fpkm) using RSEM.

            Following this step you use the filter_fasta_by_rsem_values.pl script (also in util/) filter out contigs with low fpkm (threshold value is adjustable). The contigs you are removing are from the final set in the Trinity.fasta file. The output will be a new, smaller file (e.g. Trinity_filtered.fasta).

            Comment


            • #7
              I got it. Thanks kmcarr.
              But I wonder I will lose a lot low expressed transcript if I filter after the Trinity assembly is complete. Does it influence my analysis? BTW, if I filter before assemble, then the low expressed transcript can still be left. Will it give a better result?
              happy

              Comment


              • #8
                Originally posted by huan View Post
                I got it. Thanks kmcarr.
                But I wonder I will lose a lot low expressed transcript if I filter after the Trinity assembly is complete. Does it influence my analysis? BTW, if I filter before assemble, then the low expressed transcript can still be left. Will it give a better result?
                When you say "filter before assembly" do you mean performing quality/adapter trrimming and filtering of your raw reads? You should pretty much always do that for any data and any downstream analysis.

                Even so it is still a good idea to perform the abundance filtering I described post assembly. What we are talking about is removing contigs with virtually no support in the underlying data, only 1-2 FPKM. Think about that for a second. These "contigs" are supported by 1 out of 1,000,000 reads. Is that a genuine, meaningful transcript? Probably not. More likely it is sequencing error masquerading as a unique transcript.

                Researchers often are far too worried about removing what is essentially noise in their data. Demanding that the bioinformatician KEEP EVERYTHING, fearing they are going to miss some big, important discovery. In reality it just confounds the analysis.

                Comment


                • #9
                  Thanks a lot kmcarr. Your answer really works great.
                  happy

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    Techniques and Challenges in Conservation Genomics
                    by seqadmin



                    The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                    Avian Conservation
                    Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                    03-08-2024, 10:41 AM
                  • seqadmin
                    The Impact of AI in Genomic Medicine
                    by seqadmin



                    Artificial intelligence (AI) has evolved from a futuristic vision to a mainstream technology, highlighted by the introduction of tools like OpenAI's ChatGPT and Google's Gemini. In recent years, AI has become increasingly integrated into the field of genomics. This integration has enabled new scientific discoveries while simultaneously raising important ethical questions1. Interviews with two researchers at the center of this intersection provide insightful perspectives into...
                    02-26-2024, 02:07 PM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, 03-14-2024, 06:13 AM
                  0 responses
                  32 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 03-08-2024, 08:03 AM
                  0 responses
                  72 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 03-07-2024, 08:13 AM
                  0 responses
                  80 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 03-06-2024, 09:51 AM
                  0 responses
                  68 views
                  0 likes
                  Last Post seqadmin  
                  Working...
                  X