SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Getting transcript lengths from GFF file Siva Bioinformatics 20 04-30-2018 09:02 PM
Whitespace in FASTA file mnkyboy Bioinformatics 2 11-13-2011 07:49 PM
Calculating the number of contigs in a scaffold file avtsanger Bioinformatics 5 01-06-2011 12:43 AM
how does cuffcompare choose which transcript to put in combined.gtf file? d f Bioinformatics 0 11-09-2010 11:30 AM
Tophat - fasta file ytmnd85 Bioinformatics 0 01-19-2010 12:38 PM

Reply
 
Thread Tools
Old 05-26-2011, 11:45 AM   #1
KDS
Junior Member
 
Location: TX

Join Date: May 2011
Posts: 4
Default Calculating Percentage of n's for Each Transcript in FASTA File

Hello,

I am completely new to next generation sequencing and need some help.

I have a multi-sequence FASTA file from Cufflinks (a small portion of the file is below), and I would like to calculate the percentage of n's in each transcript. Is there an easy way to do the calculation for all 11,804 transcripts at once?

Thanks so much for your help!!


>NM_001034679 loc:chr1|327877-339078|+ exons:327877-328043,331233-331406,333963-334122,337306-339078 segs:1-167,168-341,342-501,502-2274 FPKM=0.0000000000 frac=0.000000 conf_lo=0.000000 conf_hi=0.000000 cov=0.000000 full_read_support=yes
AACAGCTTCGGGTAAAGgaactccatccacttggggcttgaccgcgggagtcgctgtagctctcactgtg
agaaagcaagatgcactttagaaattttaactacagttttagctccctgattgcctgtgtggcaaacagt
gagatcttcagcgaaagtgaaaccagggccaaatttgaatccctctttaggacttatgacaaagatatca
cctttcagtattttaaaagcttcaaacgggtcagaataaacttcagcaaccccttatctgcagcagatgc
caggctccagctacataagactgaatttcttggaaaggaaatgaaattatattttgctcagactttacac
ataggcagttcgcacctggcccctccgaatccagacaaacagttcctcatctctcctcctgcctcaccgc
cggtcggctggaaacaagtggaagacgctactcccgtcataaattatgaccttttatacgctatctccaa
gctagggccaggggaaaagtatgaactgcatgcagccaccgacaccacccccagcgtggtggtccacgtc
tgtgagagcgatcaggagaacgaggaagaagacgagatggaaagaatgaagagaccgaagcccaaaattatccagaccaggaggccagagtacacgcccatccacttaagctgaaccggcgcccggacgaagacgtgctc
caaaccatgctcgcaagaaggcatcttttactgtggaagcagccggtcacagctttggaggcggcagccg
tgaccgctgtggcggaaattccagttcacgttgctcagaagagaatcgaggcttcgtcccctggttctaa
cgctgcgcctcagtcagtgttcgaggctcctggccaggccccgagccaatcactgagcttggggtgatcg
cacaaggacatctgggagcatcgcgggaaaaccaataatgatagtcttttgtacttgttctcttctggta
ggttctgtcttggccaggggcagattgatccgtgggccccggggagagtctttgtgtttaatcagtctac
aaggtagacgcactctctctCCTGGTGGGAAAAGGCGCCACGnnnnnnnnnnnngCGTCTGGTGCAGAAAGGTTGTGAAAGCAACCGTGCAACGTggaaactgtagcgtttcaatttcccccttcatgttctgatgTTTGTGCATGTGTATTACTgatTTCTCAGAACTAACCTTTGTTTGTATGTAGAGTTGCGCCACTGCTGTTTTACATCTTCTGGGGAGATAAGAAGGCATctgtgaagtctgtcacctttgcagattcGTGACTGTCTTTGCAAGGGCACCCACGGCGGGGTGGAGGGGATTCTACCTGGAATACACACACCATTCCGCATCCTGTCCGATGCGACCAAnnnnnCGTGTTTTTGCAAAAGAAgtcgatctggaattcctgtgtagcgtttcgcttataaaattcagaaaatagcactttcactgccaactactagtgggtgagaaattttagtttagatgttttagatcaggcaatacgtaggtttcatttgtttctttgacgtggtggtttatataacatgaatcatagccaaaacccttttcggggggaatagtcagttgagatcattaatttttttacccccactaatacatcaagataaacttgtaaataaagccggtagtatatattcacacctgttgtgcacttgggtgagacatatatggccagggaagactagggtcagatgtgttgacctccccgtgaatcatatgttgtagaaaatgcctttcagatgtttgatgggacttgaattcaaagcacgtgaagtggatagtggatataagaagggtgcagtgcctttcccattaattcctggtggagttgtcacactaggttaacgtttgtaatttttttctagtgtccTGTGTATGTsTGGTCGATGGGTACTCCCTTTTGGCCTTACAAtattgtaacaatgtttgtccttttgaaatacctaatgccaagtaacagtgcatgctttagaaaaggggaagagggctttctttaagaagtaaaggcgtttggctgttcctgtcaagaaactgactgaatggtctccaaaccctgtttacaggacctggtggggtgtgggggacaaatgagcaagagatGCGTGCATAGTCGTTCAAGTGTTCGTAGTTCAGTGCTTTTAAACTGGGGAGGCTAACCACGAGATATTTTTTTTAACTGCATTCTCTAATAAATCGACGCAATATGCTCTTTA
>NM_001077977 loc:chr1|377376-383871|+ exons:377376-377447,378427-378521,383243-383871 segs:1-72,73-167,168-796 FPKM=0.0000000000 frac=0.000000 conf_lo=0.000000 conf_hi=0.000000 cov=0.000000 full_read_support=yes
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnngCACCCGGCTCGCTCCGCGCCTCAGACCCGgttctgaggaagcggattgttttgaaaataaagtacaccgtgaagacctccaaggaactcgctcttgtcatctcaagtctctgaaggcagagctgacagttttcctgggaagttcaccctccgggagtgaaacctgactctcaggatgatcctgtctaacgccacggccgtgacacctgtgctgaccaagttatggcaggggacagtt
caacagggcagcaacacatctgggctggcccgcaggttcccaggccacgaggacggcaagctggcggcgc
tctacatcctcatggtcctcggctttttcggcttcttcaccctgggcatcatgctgagttacatccgctc
caagaaactggagcactcccatgacccatacaacgtgtacattgagtctgacacttggcaggagcaggac
aaggcgtacttccaggcccggattctggagagctgcagggcgtgttacgtcattgagaaccaactggctg
tagagcgacccagtgcataccttcctgagatgaagcggtcgtcctgaccccaggaccagtcaaaactgga
cagagcctccctgatgagctgatttttctaatcacatgttccttttttctttattgtatgagtattattg
gggtttttgtctatcataaggggtgaaagggggatttaatatcactatatttctaaaatcacattccttc
tataatagattgtcagtcattcccca
KDS is offline   Reply With Quote
Old 05-26-2011, 10:51 PM   #2
maasha
Senior Member
 
Location: Denmark

Join Date: Apr 2009
Posts: 153
Default

This can be done with Biopieces (www.biopieces.org)

Code:
read_fasta -i test.fna -n 5 |
analyze_seq |
write_tab -ck SEQ_NAME,HARD_MASK% -x

#SEQ_NAME       HARD_MASK%
ILLUMINA-52179E_0004:2:1:1040:5263#TTAGGC/1     0.0
ILLUMINA-52179E_0004:2:1:1041:14486#TTAGGC/1    0.0
ILLUMINA-52179E_0004:2:1:1043:19446#TTAGGC/1    0.0
ILLUMINA-52179E_0004:2:1:1044:7943#TTAGGC/1     4.0
ILLUMINA-52179E_0004:2:1:1045:16499#TTAGGC/1    4.0

HARD_MASK% is the percentage of Ns in a sequence:

http://code.google.com/p/biopieces/wiki/analyze_seq

Cheers,


Martin
maasha is offline   Reply With Quote
Old 05-27-2011, 06:00 AM   #3
Richard Finney
Senior Member
 
Location: bethesda

Join Date: Feb 2009
Posts: 700
Default

C fu for the peoples ...

#include <stdio.h>
#include <string.h>
#include <ctype.h>

unsigned long int sum[5];
unsigned long int bp;
char s[512];
int main()
{
int i,j;
char ch;

bp = 0;
memset(sum,0,sizeof(sum));
while (gets(s))
{
if (s[0] == '>') continue; // skip fasta entry header
for (i=0;i<s[i];i++)
{
ch = toupper(s[i]);
if (ch == 'A') { sum[0]++; bp++; }
else if (ch == 'C') { sum[1]++; bp++; }
else if (ch == 'G') { sum[2]++; bp++; }
else if (ch == 'T') { sum[3]++; bp++; }
else if (ch == 'N') { sum[4]++; bp++; }
}
memset(s,0,sizeof(s));
}
for (j=0;j<5;j++)
{
if (j == 0) printf("A ");
else if (j == 1) printf("C ");
else if (j == 2) printf("G ");
else if (j == 3) printf("T ");
else if (j == 4) printf("N ");
printf("%ld ",sum[j]);
printf("\n");
}
printf("bases = %ld \n",bp);
return 0;
}

compilation: gcc acgt.c

example execution of resulting a.out file on hg19.fa ..

cat hg19.fa | ./a.out
A 844868045
C 585017944
G 585360436
T 846097277
N 234350281
bases = 3095693983


If you're clever you can change the program to do the final algebra. Hurried mortals will just do the final algebra themselves.

Note in my output that A~=T and C~=G . Amazing!
Richard Finney is offline   Reply With Quote
Old 05-31-2011, 05:15 AM   #4
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,543
Default

I couldn't resist - a Biopython solution writing a tab separated output file giving the percentage of N (or n) characters in each FASTA sequence:

Code:
from Bio import SeqIO

in_file = "test.fna"
out_file = "test.tab"

out_handle = open(out_file, "w")
for record in SeqIO.parse(in_file, "fasta"):
    n_count = record.seq.count("N") + record.seq.count("n")
    out_handle.write("%s\t%0.2f\n" % (record.id, n_count*100.0/len(record)))
out_handle.close()
Note the 100.0 is there for two reasons. First this gives the answer as a percentage between 0 and 100, rather than a fraction between 0 and 1. The second reason is to ensure floating point division rather than integer division.

I'm sure this can be done just as easily in Perl or Ruby too (using BioPerl and BioRuby). It may not run as fast as the C version, but it was much quicker to write, and is much shorter.
maubp is offline   Reply With Quote
Old 06-07-2011, 10:21 AM   #5
KDS
Junior Member
 
Location: TX

Join Date: May 2011
Posts: 4
Default Syntax Error in Biopython Script

Thanks so much for the replies. I really liked the simplicity of the Biopython code, so I wanted to start there. I had problems installing Biopython on my Mac, but a co-worker was able to install it on her PC.

We entered the code that you kindly provided; however, we got a syntax error on the last line [out_handle.close()].

We also tried modifying the following code to calculate the n percentage and got a syntax error on the output line [output_file.close()] again.


input_file = open('NC_005213.ffn', 'r')
output_file = open('nucleotide_counts.tsv','w')
output_file.write('Gene\tA\tC\tG\tT\tLength\tCG%\n')
from Bio import SeqIO
for cur_record in SeqIO.parse(input_file, "fasta") :
#count nucleotides in this record...
gene_name = cur_record.name
A_count = cur_record.seq.count('A')
C_count = cur_record.seq.count('C')
G_count = cur_record.seq.count('G')
T_count = cur_record.seq.count('T')
length = len(cur_record.seq)
cg_percentage = float(C_count + G_count) / length
output_line = '%s\t%i\t%i\t%i\t%i\t%i\t%f\n' % \
(gene_name, A_count, C_count, G_count, T_count, length, cg_percentage)
output_file.write(output_line)
output_file.close()
input_file.close()



We are trying to figure out what is going wrong with the output line and would appreciate any help you can provide!

Thanks so much!!!
Kim
KDS is offline   Reply With Quote
Old 06-07-2011, 10:38 AM   #6
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,543
Default

Quote:
Originally Posted by KDS View Post
Thanks so much for the replies. I really liked the simplicity of the Biopython code, so I wanted to start there. I had problems installing Biopython on my Mac, but a co-worker was able to install it on her PC.
Apple seem to have changed things with the latest version of X Code

When posting examples on the forum, use the [ code ] ... [ /code ] tags to preserve the formatting. This is especially important for Python's indentation. I could see what you wrote when I quoted your reply:

Code:
input_file = open('NC_005213.ffn', 'r')
output_file = open('nucleotide_counts.tsv','w')
output_file.write('Gene\tA\tC\tG\tT\tLength\tCG%\n')
from Bio import SeqIO
for cur_record in SeqIO.parse(input_file, "fasta") :
         #count nucleotides in this record...
         gene_name = cur_record.name
         A_count = cur_record.seq.count('A')
         C_count = cur_record.seq.count('C')
         G_count = cur_record.seq.count('G')
         T_count = cur_record.seq.count('T')
         length = len(cur_record.seq)
         cg_percentage = float(C_count + G_count) / length 
         output_line = '%s\t%i\t%i\t%i\t%i\t%i\t%f\n' % \
         (gene_name, A_count, C_count, G_count, T_count, length, cg_percentage)
         output_file.write(output_line)
output_file.close()
input_file.close()
What exactly was the error message? If it was this:
Code:
...
SyntaxError: unexpected character after line continuation character
check the output_line = ... bit does NOT have a space after the final slash. That slash means the command continues on the next line, but it has to be right at the end.

Other than that (and this could be from copy/paste via the forum), it worked here.

Peter
maubp is offline   Reply With Quote
Old 06-07-2011, 02:37 PM   #7
KDS
Junior Member
 
Location: TX

Join Date: May 2011
Posts: 4
Default

Thanks so much for your help!

The error message we were getting was

Code:
File "<stdin>", line 13
    output_file.close()
                 ^
 
SyntaxError: invalid syntax        
>>> input_file.close()

However, we were able to run the script when we removed the last two lines

Code:
output_file.close()
input_file.close()
or when we put a space in front of the last two lines:

Code:
input_file = open('NC_005213.ffn', 'r')
output_file = open('nucleotide_counts.tsv','w')
output_file.write('Gene\tA\tC\tG\tT\tLength\tCG%\n')
from Bio import SeqIO
for cur_record in SeqIO.parse(input_file, "fasta") :
         #count nucleotides in this record...
         gene_name = cur_record.name
         A_count = cur_record.seq.count('A')
         C_count = cur_record.seq.count('C')
         G_count = cur_record.seq.count('G')
         T_count = cur_record.seq.count('T')
         length = len(cur_record.seq)
         cg_percentage = float(C_count + G_count) / length 
         output_line = '%s\t%i\t%i\t%i\t%i\t%i\t%f\n' % \
         (gene_name, A_count, C_count, G_count, T_count, length, cg_percentage)
         output_file.write(output_line)

output_file.close()
input_file.close()

Quote:
Apple seem to have changed things with the latest version of X Code
I am still determined to install Biopython on my Mac. Do you recommend deleting Xcode 4 and installing Xcode 3?

Thanks again for your help!

Kim
KDS is offline   Reply With Quote
Old 06-07-2011, 11:06 PM   #8
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,543
Default

Quote:
Originally Posted by KDS View Post
I am still determined to install Biopython on my Mac. Do you recommend deleting Xcode 4 and installing Xcode 3?
That has certainly been reported to work, see:
http://lists.open-bio.org/pipermail/...ne/007320.html
maubp is offline   Reply With Quote
Reply

Tags
bioinformatics, cufflinks, percent calculation, sequence quality

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 02:38 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO