Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • lenght minimal of Contigs to analysis.

    Hi,

    I have a Contigs,fasta file genered form Meta-Ray, Contigs.fasta has many shorts contigs and I want to eliminate these Contigs, but I do not know what is minimal lenght of quality to filter these Contigs. I need to use Contigs.fasta (filtered) in a other analysis, example a blast multifasta to search SNPs.

    Thanks.
    Last edited by laboder; 07-09-2013, 10:33 AM. Reason: Correction

  • #2
    length filtering

    There are bunch of scripts that can do the job if you just want to filter by length, e.g. a very short perl script.

    Comment


    • #3
      Thanks,
      I have the script to filter, but I don't know what is the minimum length of sequences for the analysis?

      Comment


      • #4
        If you have emboss (http://emboss.sourceforge.net/) use infoseq to identify the lengths of all your reads. You could probably sort it using UNIX sort (or copy paste into excel and do it there).

        Comment


        • #5
          OK, my 2 cents!
          The metrics I used includes 1) the raw reads length, 100bp each for PE, 200bp is the minimum of the two reads concatenated; but not necessary 2) the "cutoff length" of your contig length distribution, which is tricky; say in my practice a big jump at ~1000bp was observed in all the contigs, which was selected; 3) as far as consistency is kept, it seems not matter if you filter out the short ones or not; 4) N50, N80; 5) total length after filtering; 6) total length of contigs >5kb, >1kb etc.
          Depends on what the purpose of your assembly is.
          Last edited by yifangt; 07-09-2013, 03:10 PM.

          Comment


          • #6
            There is some debate as to what the minimal biologically important protein/contig length is.
            I routinely use 200 nt (~ 60 to 70aa) as the cutoff for a transcriptome assembly, some use 300nt, but this is for transcriptome. I do see though genome assemblers routinely use 200 nt as their cutoff.
            It might be worth looking at several assemblers default options of contig cutoff value.

            Comment


            • #7
              Originally posted by laboder View Post
              Hi,

              I have a Contigs,fasta file genered form Meta-Ray, Contigs.fasta has many shorts contigs and I want to eliminate these Contigs, but I do not know what is minimal lenght of quality to filter these Contigs. I need to use Contigs.fasta (filtered) in a other analysis, example a blast multifasta to search SNPs.

              Thanks.
              How can anyone but you and the people familiar with your research know the answer to this question? We don't know what was sequenced and how and what it is exactly that you want to do with your data. How do you suppose then we should be able to answer your question?
              savetherhino.org

              Comment


              • #8
                I was looking for a starting point, thanks for the suggestions.
                I will try to sort my file and to use cutoff length 200pb.

                Thanks.

                Comment


                • #9
                  Another way to consider minimal contig length would be to have an idea of average exon (or even minimum) exon length in your target species. If you were trying to capture all genes in an assembly you might want to keep short contigs if they potentially could contain exons.

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    Essential Discoveries and Tools in Epitranscriptomics
                    by seqadmin




                    The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                    04-22-2024, 07:01 AM
                  • seqadmin
                    Current Approaches to Protein Sequencing
                    by seqadmin


                    Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                    04-04-2024, 04:25 PM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, Yesterday, 11:49 AM
                  0 responses
                  15 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-24-2024, 08:47 AM
                  0 responses
                  16 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-11-2024, 12:08 PM
                  0 responses
                  61 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-10-2024, 10:19 PM
                  0 responses
                  60 views
                  0 likes
                  Last Post seqadmin  
                  Working...
                  X