Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • lenght distribution range in a a fasta file

    Hi folks,

    I have to calculate the lenght distribution range in a a fasta file, for example how many sequences are less than 100 bp, how many are in their lenght from 101 to 300 bp, from 301 to 500 bases and so on.. Any script or tool for doing this job?

    Thanks in advance!

  • #2
    If you index it with "samtools faidx" the resulting ".fai" file will be a text file containing the length of each of the sequences (among other information). You could then plot the distribution in R with whatever binning strategy you want.

    Comment


    • #3
      Dear dpryan,

      I have retrieved the ".fai" file and I have the lenghts of the genes that I wanted (2 columns file, first column are the gene names, second column the gene lenghts, as in the following):

      gene_120397 43056
      gene_240653 224380
      gene_150423 68254
      gene_143456 10090
      gene_141140 15291
      gene_253613 3088

      Could you please indicate me an R code for plotting these distributions as I am not so familiar with plotting in R

      Thank you!!!

      Comment


      • #4
        mydata <- read.table("inputfile.txt")
        plot(mydata)

        Comment


        • #5
          FastQC Length Distribution

          Hello.

          I believe that Fastqc has this information, but not in a fasta file.

          if anyone has used FastQC to find the length distribution, I am wondering in what conditions does it consider the distribution to be a fail or a pass.

          advice?

          Comment


          • #6
            Originally posted by arcolombo698 View Post
            Hello.

            I believe that Fastqc has this information, but not in a fasta file.

            if anyone has used FastQC to find the length distribution, I am wondering in what conditions does it consider the distribution to be a fail or a pass.

            advice?
            From the documentation:

            Warning

            This module will raise a warning if all sequences are not the same length.

            Failure

            This module will raise an error if any of the sequences have zero length.

            Comment


            • #7
              hi @antoza,
              did you find way to solve your problem? if u did could you share your experience?
              i am facing with the same task right now.
              thanks.

              Comment


              • #8
                FastQC does plot this information so one can visually see the distribution of length. this is a quick/easy approach.

                if you use samtools, one can plot the lengths using R.

                Comment


                • #9
                  The BBMap package has a couple programs for this purpose:

                  stats.sh in=file.fasta shist=shist.txt
                  (only works on fasta input)

                  readlength.sh in=file.fasta out=hist.txt

                  (works on fasta, fastq, or sam)

                  The way they display output is a little different, but both are easy to plot.

                  Comment


                  • #10
                    yeah , that will work. thanks @Brian Bushnell

                    Comment


                    • #11
                      hi @arcolombo698,
                      could you be little bit more specific?!
                      how could i do what i want by using FastQC when my input file is fasta?

                      Comment


                      • #12
                        hey @arcolombo698,
                        i sued samtools got the fai file. and when i tried to do the length distribution by R i changed my gene.fa.fai file to gene.txt. then i used this commend as its mentioned by@rnaeye:

                        mydata <- read.table("gene.txt")
                        plot(mydata)

                        and got this error:

                        Code:
                        > mydata <- read.table("gene.txt")
                        > plot(mydata)
                        Error: cannot allocate vector of size 156.2 Gb
                        i am new at R, so could u explain to me where did it go wrong?

                        Comment

                        Latest Articles

                        Collapse

                        • seqadmin
                          Essential Discoveries and Tools in Epitranscriptomics
                          by seqadmin




                          The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                          Yesterday, 07:01 AM
                        • seqadmin
                          Current Approaches to Protein Sequencing
                          by seqadmin


                          Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                          04-04-2024, 04:25 PM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by seqadmin, 04-11-2024, 12:08 PM
                        0 responses
                        55 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-10-2024, 10:19 PM
                        0 responses
                        52 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-10-2024, 09:21 AM
                        0 responses
                        45 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-04-2024, 09:00 AM
                        0 responses
                        55 views
                        0 likes
                        Last Post seqadmin  
                        Working...
                        X