Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Calculating Percentage of n's for Each Transcript in FASTA File

    Hello,

    I am completely new to next generation sequencing and need some help.

    I have a multi-sequence FASTA file from Cufflinks (a small portion of the file is below), and I would like to calculate the percentage of n's in each transcript. Is there an easy way to do the calculation for all 11,804 transcripts at once?

    Thanks so much for your help!!


    >NM_001034679 loc:chr1|327877-339078|+ exons:327877-328043,331233-331406,333963-334122,337306-339078 segs:1-167,168-341,342-501,502-2274 FPKM=0.0000000000 frac=0.000000 conf_lo=0.000000 conf_hi=0.000000 cov=0.000000 full_read_support=yes
    AACAGCTTCGGGTAAAGgaactccatccacttggggcttgaccgcgggagtcgctgtagctctcactgtg
    agaaagcaagatgcactttagaaattttaactacagttttagctccctgattgcctgtgtggcaaacagt
    gagatcttcagcgaaagtgaaaccagggccaaatttgaatccctctttaggacttatgacaaagatatca
    cctttcagtattttaaaagcttcaaacgggtcagaataaacttcagcaaccccttatctgcagcagatgc
    caggctccagctacataagactgaatttcttggaaaggaaatgaaattatattttgctcagactttacac
    ataggcagttcgcacctggcccctccgaatccagacaaacagttcctcatctctcctcctgcctcaccgc
    cggtcggctggaaacaagtggaagacgctactcccgtcataaattatgaccttttatacgctatctccaa
    gctagggccaggggaaaagtatgaactgcatgcagccaccgacaccacccccagcgtggtggtccacgtc
    tgtgagagcgatcaggagaacgaggaagaagacgagatggaaagaatgaagagaccgaagcccaaaattatccagaccaggaggccagagtacacgcccatccacttaagctgaaccggcgcccggacgaagacgtgctc
    caaaccatgctcgcaagaaggcatcttttactgtggaagcagccggtcacagctttggaggcggcagccg
    tgaccgctgtggcggaaattccagttcacgttgctcagaagagaatcgaggcttcgtcccctggttctaa
    cgctgcgcctcagtcagtgttcgaggctcctggccaggccccgagccaatcactgagcttggggtgatcg
    cacaaggacatctgggagcatcgcgggaaaaccaataatgatagtcttttgtacttgttctcttctggta
    ggttctgtcttggccaggggcagattgatccgtgggccccggggagagtctttgtgtttaatcagtctac
    aaggtagacgcactctctctCCTGGTGGGAAAAGGCGCCACGnnnnnnnnnnnngCGTCTGGTGCAGAAAGGTTGTGAAAGCAACCGTGCAACGTggaaactgtagcgtttcaatttcccccttcatgttctgatgTTTGTGCATGTGTATTACTgatTTCTCAGAACTAACCTTTGTTTGTATGTAGAGTTGCGCCACTGCTGTTTTACATCTTCTGGGGAGATAAGAAGGCATctgtgaagtctgtcacctttgcagattcGTGACTGTCTTTGCAAGGGCACCCACGGCGGGGTGGAGGGGATTCTACCTGGAATACACACACCATTCCGCATCCTGTCCGATGCGACCAAnnnnnCGTGTTTTTGCAAAAGAAgtcgatctggaattcctgtgtagcgtttcgcttataaaattcagaaaatagcactttcactgccaactactagtgggtgagaaattttagtttagatgttttagatcaggcaatacgtaggtttcatttgtttctttgacgtggtggtttatataacatgaatcatagccaaaacccttttcggggggaatagtcagttgagatcattaatttttttacccccactaatacatcaagataaacttgtaaataaagccggtagtatatattcacacctgttgtgcacttgggtgagacatatatggccagggaagactagggtcagatgtgttgacctccccgtgaatcatatgttgtagaaaatgcctttcagatgtttgatgggacttgaattcaaagcacgtgaagtggatagtggatataagaagggtgcagtgcctttcccattaattcctggtggagttgtcacactaggttaacgtttgtaatttttttctagtgtccTGTGTATGTsTGGTCGATGGGTACTCCCTTTTGGCCTTACAAtattgtaacaatgtttgtccttttgaaatacctaatgccaagtaacagtgcatgctttagaaaaggggaagagggctttctttaagaagtaaaggcgtttggctgttcctgtcaagaaactgactgaatggtctccaaaccctgtttacaggacctggtggggtgtgggggacaaatgagcaagagatGCGTGCATAGTCGTTCAAGTGTTCGTAGTTCAGTGCTTTTAAACTGGGGAGGCTAACCACGAGATATTTTTTTTAACTGCATTCTCTAATAAATCGACGCAATATGCTCTTTA
    >NM_001077977 loc:chr1|377376-383871|+ exons:377376-377447,378427-378521,383243-383871 segs:1-72,73-167,168-796 FPKM=0.0000000000 frac=0.000000 conf_lo=0.000000 conf_hi=0.000000 cov=0.000000 full_read_support=yes
    nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnngCACCCGGCTCGCTCCGCGCCTCAGACCCGgttctgaggaagcggattgttttgaaaataaagtacaccgtgaagacctccaaggaactcgctcttgtcatctcaagtctctgaaggcagagctgacagttttcctgggaagttcaccctccgggagtgaaacctgactctcaggatgatcctgtctaacgccacggccgtgacacctgtgctgaccaagttatggcaggggacagtt
    caacagggcagcaacacatctgggctggcccgcaggttcccaggccacgaggacggcaagctggcggcgc
    tctacatcctcatggtcctcggctttttcggcttcttcaccctgggcatcatgctgagttacatccgctc
    caagaaactggagcactcccatgacccatacaacgtgtacattgagtctgacacttggcaggagcaggac
    aaggcgtacttccaggcccggattctggagagctgcagggcgtgttacgtcattgagaaccaactggctg
    tagagcgacccagtgcataccttcctgagatgaagcggtcgtcctgaccccaggaccagtcaaaactgga
    cagagcctccctgatgagctgatttttctaatcacatgttccttttttctttattgtatgagtattattg
    gggtttttgtctatcataaggggtgaaagggggatttaatatcactatatttctaaaatcacattccttc
    tataatagattgtcagtcattcccca

  • #2
    This can be done with Biopieces (www.biopieces.org)

    Code:
    read_fasta -i test.fna -n 5 |
    analyze_seq |
    write_tab -ck SEQ_NAME,HARD_MASK% -x
    
    #SEQ_NAME       HARD_MASK%
    ILLUMINA-52179E_0004:2:1:1040:5263#TTAGGC/1     0.0
    ILLUMINA-52179E_0004:2:1:1041:14486#TTAGGC/1    0.0
    ILLUMINA-52179E_0004:2:1:1043:19446#TTAGGC/1    0.0
    ILLUMINA-52179E_0004:2:1:1044:7943#TTAGGC/1     4.0
    ILLUMINA-52179E_0004:2:1:1045:16499#TTAGGC/1    4.0

    HARD_MASK% is the percentage of Ns in a sequence:



    Cheers,


    Martin

    Comment


    • #3
      C fu for the peoples ...

      #include <stdio.h>
      #include <string.h>
      #include <ctype.h>

      unsigned long int sum[5];
      unsigned long int bp;
      char s[512];
      int main()
      {
      int i,j;
      char ch;

      bp = 0;
      memset(sum,0,sizeof(sum));
      while (gets(s))
      {
      if (s[0] == '>') continue; // skip fasta entry header
      for (i=0;i<s[i];i++)
      {
      ch = toupper(s[i]);
      if (ch == 'A') { sum[0]++; bp++; }
      else if (ch == 'C') { sum[1]++; bp++; }
      else if (ch == 'G') { sum[2]++; bp++; }
      else if (ch == 'T') { sum[3]++; bp++; }
      else if (ch == 'N') { sum[4]++; bp++; }
      }
      memset(s,0,sizeof(s));
      }
      for (j=0;j<5;j++)
      {
      if (j == 0) printf("A ");
      else if (j == 1) printf("C ");
      else if (j == 2) printf("G ");
      else if (j == 3) printf("T ");
      else if (j == 4) printf("N ");
      printf("%ld ",sum[j]);
      printf("\n");
      }
      printf("bases = %ld \n",bp);
      return 0;
      }

      compilation: gcc acgt.c

      example execution of resulting a.out file on hg19.fa ..

      cat hg19.fa | ./a.out
      A 844868045
      C 585017944
      G 585360436
      T 846097277
      N 234350281
      bases = 3095693983


      If you're clever you can change the program to do the final algebra. Hurried mortals will just do the final algebra themselves.

      Note in my output that A~=T and C~=G . Amazing!

      Comment


      • #4
        I couldn't resist - a Biopython solution writing a tab separated output file giving the percentage of N (or n) characters in each FASTA sequence:

        Code:
        from Bio import SeqIO
        
        in_file = "test.fna"
        out_file = "test.tab"
        
        out_handle = open(out_file, "w")
        for record in SeqIO.parse(in_file, "fasta"):
            n_count = record.seq.count("N") + record.seq.count("n")
            out_handle.write("%s\t%0.2f\n" % (record.id, n_count*100.0/len(record)))
        out_handle.close()
        Note the 100.0 is there for two reasons. First this gives the answer as a percentage between 0 and 100, rather than a fraction between 0 and 1. The second reason is to ensure floating point division rather than integer division.

        I'm sure this can be done just as easily in Perl or Ruby too (using BioPerl and BioRuby). It may not run as fast as the C version, but it was much quicker to write, and is much shorter.

        Comment


        • #5
          Syntax Error in Biopython Script

          Thanks so much for the replies. I really liked the simplicity of the Biopython code, so I wanted to start there. I had problems installing Biopython on my Mac, but a co-worker was able to install it on her PC.

          We entered the code that you kindly provided; however, we got a syntax error on the last line [out_handle.close()].

          We also tried modifying the following code to calculate the n percentage and got a syntax error on the output line [output_file.close()] again.


          input_file = open('NC_005213.ffn', 'r')
          output_file = open('nucleotide_counts.tsv','w')
          output_file.write('Gene\tA\tC\tG\tT\tLength\tCG%\n')
          from Bio import SeqIO
          for cur_record in SeqIO.parse(input_file, "fasta") :
          #count nucleotides in this record...
          gene_name = cur_record.name
          A_count = cur_record.seq.count('A')
          C_count = cur_record.seq.count('C')
          G_count = cur_record.seq.count('G')
          T_count = cur_record.seq.count('T')
          length = len(cur_record.seq)
          cg_percentage = float(C_count + G_count) / length
          output_line = '%s\t%i\t%i\t%i\t%i\t%i\t%f\n' % \
          (gene_name, A_count, C_count, G_count, T_count, length, cg_percentage)
          output_file.write(output_line)
          output_file.close()
          input_file.close()



          We are trying to figure out what is going wrong with the output line and would appreciate any help you can provide!

          Thanks so much!!!
          Kim

          Comment


          • #6
            Originally posted by KDS View Post
            Thanks so much for the replies. I really liked the simplicity of the Biopython code, so I wanted to start there. I had problems installing Biopython on my Mac, but a co-worker was able to install it on her PC.
            Apple seem to have changed things with the latest version of X Code

            When posting examples on the forum, use the [ code ] ... [ /code ] tags to preserve the formatting. This is especially important for Python's indentation. I could see what you wrote when I quoted your reply:

            Code:
            input_file = open('NC_005213.ffn', 'r')
            output_file = open('nucleotide_counts.tsv','w')
            output_file.write('Gene\tA\tC\tG\tT\tLength\tCG%\n')
            from Bio import SeqIO
            for cur_record in SeqIO.parse(input_file, "fasta") :
                     #count nucleotides in this record...
                     gene_name = cur_record.name
                     A_count = cur_record.seq.count('A')
                     C_count = cur_record.seq.count('C')
                     G_count = cur_record.seq.count('G')
                     T_count = cur_record.seq.count('T')
                     length = len(cur_record.seq)
                     cg_percentage = float(C_count + G_count) / length 
                     output_line = '%s\t%i\t%i\t%i\t%i\t%i\t%f\n' % \
                     (gene_name, A_count, C_count, G_count, T_count, length, cg_percentage)
                     output_file.write(output_line)
            output_file.close()
            input_file.close()
            What exactly was the error message? If it was this:
            Code:
            ...
            SyntaxError: unexpected character after line continuation character
            check the output_line = ... bit does NOT have a space after the final slash. That slash means the command continues on the next line, but it has to be right at the end.

            Other than that (and this could be from copy/paste via the forum), it worked here.

            Peter

            Comment


            • #7
              Thanks so much for your help!

              The error message we were getting was

              Code:
              File "<stdin>", line 13
                  output_file.close()
                               ^
               
              SyntaxError: invalid syntax        
              >>> input_file.close()

              However, we were able to run the script when we removed the last two lines

              Code:
              output_file.close()
              input_file.close()
              or when we put a space in front of the last two lines:

              Code:
              input_file = open('NC_005213.ffn', 'r')
              output_file = open('nucleotide_counts.tsv','w')
              output_file.write('Gene\tA\tC\tG\tT\tLength\tCG%\n')
              from Bio import SeqIO
              for cur_record in SeqIO.parse(input_file, "fasta") :
                       #count nucleotides in this record...
                       gene_name = cur_record.name
                       A_count = cur_record.seq.count('A')
                       C_count = cur_record.seq.count('C')
                       G_count = cur_record.seq.count('G')
                       T_count = cur_record.seq.count('T')
                       length = len(cur_record.seq)
                       cg_percentage = float(C_count + G_count) / length 
                       output_line = '%s\t%i\t%i\t%i\t%i\t%i\t%f\n' % \
                       (gene_name, A_count, C_count, G_count, T_count, length, cg_percentage)
                       output_file.write(output_line)
              
              output_file.close()
              input_file.close()

              Apple seem to have changed things with the latest version of X Code
              I am still determined to install Biopython on my Mac. Do you recommend deleting Xcode 4 and installing Xcode 3?

              Thanks again for your help!

              Kim

              Comment


              • #8
                Originally posted by KDS View Post
                I am still determined to install Biopython on my Mac. Do you recommend deleting Xcode 4 and installing Xcode 3?
                That has certainly been reported to work, see:

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Essential Discoveries and Tools in Epitranscriptomics
                  by seqadmin




                  The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                  Yesterday, 07:01 AM
                • seqadmin
                  Current Approaches to Protein Sequencing
                  by seqadmin


                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                  04-04-2024, 04:25 PM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, 04-11-2024, 12:08 PM
                0 responses
                58 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 10:19 PM
                0 responses
                54 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 09:21 AM
                0 responses
                45 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-04-2024, 09:00 AM
                0 responses
                55 views
                0 likes
                Last Post seqadmin  
                Working...
                X