Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • lenght distribution range in a a fasta file

    Hi folks,

    I have to calculate the lenght distribution range in a a fasta file, for example how many sequences are less than 100 bp, how many are in their lenght from 101 to 300 bp, from 301 to 500 bases and so on.. Any script or tool for doing this job?

    Thanks in advance!

  • #2
    If you index it with "samtools faidx" the resulting ".fai" file will be a text file containing the length of each of the sequences (among other information). You could then plot the distribution in R with whatever binning strategy you want.

    Comment


    • #3
      Dear dpryan,

      I have retrieved the ".fai" file and I have the lenghts of the genes that I wanted (2 columns file, first column are the gene names, second column the gene lenghts, as in the following):

      gene_120397 43056
      gene_240653 224380
      gene_150423 68254
      gene_143456 10090
      gene_141140 15291
      gene_253613 3088

      Could you please indicate me an R code for plotting these distributions as I am not so familiar with plotting in R

      Thank you!!!

      Comment


      • #4
        mydata <- read.table("inputfile.txt")
        plot(mydata)

        Comment


        • #5
          FastQC Length Distribution

          Hello.

          I believe that Fastqc has this information, but not in a fasta file.

          if anyone has used FastQC to find the length distribution, I am wondering in what conditions does it consider the distribution to be a fail or a pass.

          advice?

          Comment


          • #6
            Originally posted by arcolombo698 View Post
            Hello.

            I believe that Fastqc has this information, but not in a fasta file.

            if anyone has used FastQC to find the length distribution, I am wondering in what conditions does it consider the distribution to be a fail or a pass.

            advice?
            From the documentation:

            Warning

            This module will raise a warning if all sequences are not the same length.

            Failure

            This module will raise an error if any of the sequences have zero length.

            Comment


            • #7
              hi @antoza,
              did you find way to solve your problem? if u did could you share your experience?
              i am facing with the same task right now.
              thanks.

              Comment


              • #8
                FastQC does plot this information so one can visually see the distribution of length. this is a quick/easy approach.

                if you use samtools, one can plot the lengths using R.

                Comment


                • #9
                  The BBMap package has a couple programs for this purpose:

                  stats.sh in=file.fasta shist=shist.txt
                  (only works on fasta input)

                  readlength.sh in=file.fasta out=hist.txt

                  (works on fasta, fastq, or sam)

                  The way they display output is a little different, but both are easy to plot.

                  Comment


                  • #10
                    yeah , that will work. thanks @Brian Bushnell

                    Comment


                    • #11
                      hi @arcolombo698,
                      could you be little bit more specific?!
                      how could i do what i want by using FastQC when my input file is fasta?

                      Comment


                      • #12
                        hey @arcolombo698,
                        i sued samtools got the fai file. and when i tried to do the length distribution by R i changed my gene.fa.fai file to gene.txt. then i used this commend as its mentioned by@rnaeye:

                        mydata <- read.table("gene.txt")
                        plot(mydata)

                        and got this error:

                        Code:
                        > mydata <- read.table("gene.txt")
                        > plot(mydata)
                        Error: cannot allocate vector of size 156.2 Gb
                        i am new at R, so could u explain to me where did it go wrong?

                        Comment

                        Latest Articles

                        Collapse

                        • seqadmin
                          Current Approaches to Protein Sequencing
                          by seqadmin


                          Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                          04-04-2024, 04:25 PM
                        • seqadmin
                          Strategies for Sequencing Challenging Samples
                          by seqadmin


                          Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                          03-22-2024, 06:39 AM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by seqadmin, 04-11-2024, 12:08 PM
                        0 responses
                        18 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-10-2024, 10:19 PM
                        0 responses
                        22 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-10-2024, 09:21 AM
                        0 responses
                        16 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-04-2024, 09:00 AM
                        0 responses
                        47 views
                        0 likes
                        Last Post seqadmin  
                        Working...
                        X