Seqanswers Leaderboard Ad

**maasha** · 05-26-2011, 10:51 PM

This can be done with Biopieces (www.biopieces.org)

Code:

read_fasta -i test.fna -n 5 |
analyze_seq |
write_tab -ck SEQ_NAME,HARD_MASK% -x

#SEQ_NAME       HARD_MASK%
ILLUMINA-52179E_0004:2:1:1040:5263#TTAGGC/1     0.0
ILLUMINA-52179E_0004:2:1:1041:14486#TTAGGC/1    0.0
ILLUMINA-52179E_0004:2:1:1043:19446#TTAGGC/1    0.0
ILLUMINA-52179E_0004:2:1:1044:7943#TTAGGC/1     4.0
ILLUMINA-52179E_0004:2:1:1045:16499#TTAGGC/1    4.0

HARD_MASK% is the percentage of Ns in a sequence:

Google Code Archive - Long-term storage for Google Code Project Hosting.

http://code.google.com/p/biopieces/wiki/analyze_seq

Cheers,

Martin

**Richard Finney** · 05-27-2011, 06:00 AM

C fu for the peoples ...

#include <stdio.h>
#include <string.h>
#include <ctype.h>

unsigned long int sum[5];
unsigned long int bp;
char s[512];
int main()
{
int i,j;
char ch;

bp = 0;
memset(sum,0,sizeof(sum));
while (gets(s))
{
if (s[0] == '>') continue; // skip fasta entry header
for (i=0;i<s[i];i++)
{
ch = toupper(s[i]);
if (ch == 'A') { sum[0]++; bp++; }
else if (ch == 'C') { sum[1]++; bp++; }
else if (ch == 'G') { sum[2]++; bp++; }
else if (ch == 'T') { sum[3]++; bp++; }
else if (ch == 'N') { sum[4]++; bp++; }
}
memset(s,0,sizeof(s));
}
for (j=0;j<5;j++)
{
if (j == 0) printf("A ");
else if (j == 1) printf("C ");
else if (j == 2) printf("G ");
else if (j == 3) printf("T ");
else if (j == 4) printf("N ");
printf("%ld ",sum[j]);
printf("\n");
}
printf("bases = %ld \n",bp);
return 0;
}

compilation: gcc acgt.c

example execution of resulting a.out file on hg19.fa ..

cat hg19.fa | ./a.out
A 844868045
C 585017944
G 585360436
T 846097277
N 234350281
bases = 3095693983

If you're clever you can change the program to do the final algebra. Hurried mortals will just do the final algebra themselves.

Note in my output that A~=T and C~=G . Amazing!

**maubp** · 05-31-2011, 05:15 AM

I couldn't resist - a Biopython solution writing a tab separated output file giving the percentage of N (or n) characters in each FASTA sequence:

Code:

from Bio import SeqIO

in_file = "test.fna"
out_file = "test.tab"

out_handle = open(out_file, "w")
for record in SeqIO.parse(in_file, "fasta"):
    n_count = record.seq.count("N") + record.seq.count("n")
    out_handle.write("%s\t%0.2f\n" % (record.id, n_count*100.0/len(record)))
out_handle.close()

Note the 100.0 is there for two reasons. First this gives the answer as a percentage between 0 and 100, rather than a fraction between 0 and 1. The second reason is to ensure floating point division rather than integer division.

I'm sure this can be done just as easily in Perl or Ruby too (using BioPerl and BioRuby). It may not run as fast as the C version, but it was much quicker to write, and is much shorter.

**KDS** · 06-07-2011, 10:21 AM

Syntax Error in Biopython Script

Thanks so much for the replies. I really liked the simplicity of the Biopython code, so I wanted to start there. I had problems installing Biopython on my Mac, but a co-worker was able to install it on her PC.

We entered the code that you kindly provided; however, we got a syntax error on the last line [out_handle.close()].

We also tried modifying the following code to calculate the n percentage and got a syntax error on the output line [output_file.close()] again.

input_file = open('NC_005213.ffn', 'r')
output_file = open('nucleotide_counts.tsv','w')
output_file.write('Gene\tA\tC\tG\tT\tLength\tCG%\n')
from Bio import SeqIO
for cur_record in SeqIO.parse(input_file, "fasta") :
#count nucleotides in this record...
gene_name = cur_record.name
A_count = cur_record.seq.count('A')
C_count = cur_record.seq.count('C')
G_count = cur_record.seq.count('G')
T_count = cur_record.seq.count('T')
length = len(cur_record.seq)
cg_percentage = float(C_count + G_count) / length
output_line = '%s\t%i\t%i\t%i\t%i\t%i\t%f\n' % \
(gene_name, A_count, C_count, G_count, T_count, length, cg_percentage)
output_file.write(output_line)
output_file.close()
input_file.close()

We are trying to figure out what is going wrong with the output line and would appreciate any help you can provide!

Thanks so much!!!
Kim

**maubp** · 06-07-2011, 10:38 AM

Originally posted by KDS View Post

Thanks so much for the replies. I really liked the simplicity of the Biopython code, so I wanted to start there. I had problems installing Biopython on my Mac, but a co-worker was able to install it on her PC.

Apple seem to have changed things with the latest version of X Code

When posting examples on the forum, use the [ code ] ... [ /code ] tags to preserve the formatting. This is especially important for Python's indentation. I could see what you wrote when I quoted your reply:

Code:

input_file = open('NC_005213.ffn', 'r')
output_file = open('nucleotide_counts.tsv','w')
output_file.write('Gene\tA\tC\tG\tT\tLength\tCG%\n')
from Bio import SeqIO
for cur_record in SeqIO.parse(input_file, "fasta") :
         #count nucleotides in this record...
         gene_name = cur_record.name
         A_count = cur_record.seq.count('A')
         C_count = cur_record.seq.count('C')
         G_count = cur_record.seq.count('G')
         T_count = cur_record.seq.count('T')
         length = len(cur_record.seq)
         cg_percentage = float(C_count + G_count) / length 
         output_line = '%s\t%i\t%i\t%i\t%i\t%i\t%f\n' % \
         (gene_name, A_count, C_count, G_count, T_count, length, cg_percentage)
         output_file.write(output_line)
output_file.close()
input_file.close()

What exactly was the error message? If it was this:

Code:

...
SyntaxError: unexpected character after line continuation character

check the output_line = ... bit does NOT have a space after the final slash. That slash means the command continues on the next line, but it has to be right at the end.

Other than that (and this could be from copy/paste via the forum), it worked here.

Peter

**KDS** · 06-07-2011, 02:37 PM

Thanks so much for your help!

The error message we were getting was

Code:

File "<stdin>", line 13
    output_file.close()
                 ^
 
SyntaxError: invalid syntax        
>>> input_file.close()

However, we were able to run the script when we removed the last two lines

Code:

output_file.close()
input_file.close()

or when we put a space in front of the last two lines:

Code:

input_file = open('NC_005213.ffn', 'r')
output_file = open('nucleotide_counts.tsv','w')
output_file.write('Gene\tA\tC\tG\tT\tLength\tCG%\n')
from Bio import SeqIO
for cur_record in SeqIO.parse(input_file, "fasta") :
         #count nucleotides in this record...
         gene_name = cur_record.name
         A_count = cur_record.seq.count('A')
         C_count = cur_record.seq.count('C')
         G_count = cur_record.seq.count('G')
         T_count = cur_record.seq.count('T')
         length = len(cur_record.seq)
         cg_percentage = float(C_count + G_count) / length 
         output_line = '%s\t%i\t%i\t%i\t%i\t%i\t%f\n' % \
         (gene_name, A_count, C_count, G_count, T_count, length, cg_percentage)
         output_file.write(output_line)

output_file.close()
input_file.close()

Apple seem to have changed things with the latest version of X Code

I am still determined to install Biopython on my Mac. Do you recommend deleting Xcode 4 and installing Xcode 3?

Thanks again for your help!

Kim

**maubp** · 06-07-2011, 11:06 PM

Originally posted by KDS View Post

I am still determined to install Biopython on my Mac. Do you recommend deleting Xcode 4 and installing Xcode 3?

That has certainly been reported to work, see:

[Biopython] installation failure on mac OS10.6.7

http://lists.open-bio.org/pipermail/biopython/2011-June/007320.html

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, Yesterday, 11:49 AM	0 responses 13 views 0 likes	Last Post by seqadmin Yesterday, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 61 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Calculating Percentage of n's for Each Transcript in FASTA File

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News