Seqanswers Leaderboard Ad

**Cogitare** · 08-20-2013, 06:02 AM

Perhaps this

It would be helpful if you gave more information about how the program fails. However, one issue I think is this line:

infile = open(sys.argv[0], "rU")

Should be:

infile = open(sys.argv[1], "rU")

In this case sys.argv[0] = your script name, whereas sys.argv[1] = infile.fasta. See here for more information:

python: sys.argv[0] meaning in official documentation

http://stackoverflow.com/questions/5222408/python-sys-argv0-meaning-in-official-documentation

Quoting from docs.python.org: "sys.argv The list of command line arguments passed to a Python script. argv[0] is the script name (it is operating system dependent whether this is a full pathname o...

**illinu** · 08-20-2013, 06:15 AM

Originally posted by Cogitare View Post

It would be helpful if you gave more information about how the program fails. However, one issue I think is this line:

infile = open(sys.argv[0], "rU")

Should be:

infile = open(sys.argv[1], "rU")

In this case sys.argv[0] = your script name, whereas sys.argv[1] = infile.fasta. See here for more information:
http://stackoverflow.com/questions/5...-documentation

Thank you. I also tried with sys.argv[1]
The program does not return any error, it just does not return anything. If I run it in Idle, it crashes and if I run it in terminal, it jumps to the next line as if it was running but it stays there forever. I am trying with very small test files and if it worked it should go fairly fast instead of staying hanged for a long time.

**dpryan** · 08-20-2013, 06:27 AM

Originally posted by illinu View Post

New to python but trying to learn, here I wrote a program to count the nucleotides in a fasta file. The approach is that when the line does not contain a ">" symbol then the program counts nucleotides one by one and adds it to the variable sum. The program does not work and I don't know why. Please help!

#!/usr/bin/env python
import sys

# Takes fasta file and counts total bases

infile = open(sys.argv[0], "rU")
sum = 0
line = infile.readlines()
while ">" not in lline:

for i in line:

sum += 1

print (sum)

usage: $ python countbases.py infile.fasta

Aside from the argv[0], issue, you're also looking for things in "lline" rather than "line"...

**illinu** · 08-20-2013, 06:32 AM

Originally posted by dpryan View Post

Aside from the argv[0], issue, you're also looking for things in "lline" rather than "line"...

That's a typo here

, in the program it's ok. So you're saying that apart from that the program seems ok and it should work?

**dpryan** · 08-20-2013, 06:49 AM

Well, you're also trying to iterate through a list in an odd way that I'm not sure would work. You might take a rather more efficient approach:

Code:

#!/usr/bin/env python
import sys

f = open(sys.argv[1], "rU")
total=0
for line in f :
    if(line[0] == ">") :
        continue

    total += len(line)-1
print("%s has %i nucleotides" % (sys.argv[1], total))

With readlines(), you end up reading the whole file into a list, which is really memory inefficient if you're dealing with a whole genome.

**illinu** · 08-20-2013, 07:11 AM

Impressed! it works like a charm

Thanks a million

**dpryan** · 08-20-2013, 07:16 AM

No problem. In a real version, you'd probably want to use argparse (so you can easily give help and usage information) and check to see that the fasta file exists before running further. You also might want to add the number of bases per chromosome, likely only printed if the user specifies.

**brentp** · 08-20-2013, 12:22 PM

import sys
print sum(len(l.rstrip()) for l in open(sys.argv[1]) if l[0] != ">")

**CHObot** · 08-22-2013, 08:17 AM

Yet another way (and more general) is to use the BioPython module. You will want to get familiar with handling sequences like this anyway, it is much more convenient to have a data structure. It would be this:

Code:

import sys
from Bio import SeqIO

for seq_record in SeqIO.parse(sys.argv[1], "fasta"):
    print seq_record.id
    print len(seq_record)

And that would handle multifasta files (with multiple sequences in them) which you will no doubt encounter eventually.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 22 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 24 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 19 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Python counting bases fasta file

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News