SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
counting tags in bam file james4545 Bioinformatics 2 01-10-2013 08:03 AM
VarScan shift in counting of bases! AlbertZANG Bioinformatics 0 06-18-2012 10:29 AM
Counting coverage bases batsal Bioinformatics 0 02-17-2012 10:46 AM
Counting mapped nucleotides in a bam file moriah Bioinformatics 7 08-22-2011 01:30 AM
Unique bases from .gff file Khanjan Bioinformatics 2 12-01-2010 12:21 PM

Reply
 
Thread Tools
Old 08-20-2013, 05:44 AM   #1
illinu
Member
 
Location: US

Join Date: Jul 2013
Posts: 55
Default Python counting bases fasta file

New to python but trying to learn, here I wrote a program to count the nucleotides in a fasta file. The approach is that when the line does not contain a ">" symbol then the program counts nucleotides one by one and adds it to the variable sum. The program does not work and I don't know why. Please help!

#!/usr/bin/env python
import sys

# Takes fasta file and counts total bases

infile = open(sys.argv[0], "rU")
sum = 0
line = infile.readlines()
while ">" not in lline:
for i in line:
sum += 1
print (sum)

usage: $ python countbases.py infile.fasta
illinu is offline   Reply With Quote
Old 08-20-2013, 06:02 AM   #2
Cogitare
Junior Member
 
Location: US

Join Date: Aug 2013
Posts: 1
Default Perhaps this

It would be helpful if you gave more information about how the program fails. However, one issue I think is this line:

infile = open(sys.argv[0], "rU")

Should be:

infile = open(sys.argv[1], "rU")

In this case sys.argv[0] = your script name, whereas sys.argv[1] = infile.fasta. See here for more information:
http://stackoverflow.com/questions/5...-documentation
Cogitare is offline   Reply With Quote
Old 08-20-2013, 06:15 AM   #3
illinu
Member
 
Location: US

Join Date: Jul 2013
Posts: 55
Default

Quote:
Originally Posted by Cogitare View Post
It would be helpful if you gave more information about how the program fails. However, one issue I think is this line:

infile = open(sys.argv[0], "rU")

Should be:

infile = open(sys.argv[1], "rU")

In this case sys.argv[0] = your script name, whereas sys.argv[1] = infile.fasta. See here for more information:
http://stackoverflow.com/questions/5...-documentation
Thank you. I also tried with sys.argv[1]
The program does not return any error, it just does not return anything. If I run it in Idle, it crashes and if I run it in terminal, it jumps to the next line as if it was running but it stays there forever. I am trying with very small test files and if it worked it should go fairly fast instead of staying hanged for a long time.
illinu is offline   Reply With Quote
Old 08-20-2013, 06:27 AM   #4
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

Quote:
Originally Posted by illinu View Post
New to python but trying to learn, here I wrote a program to count the nucleotides in a fasta file. The approach is that when the line does not contain a ">" symbol then the program counts nucleotides one by one and adds it to the variable sum. The program does not work and I don't know why. Please help!

#!/usr/bin/env python
import sys

# Takes fasta file and counts total bases

infile = open(sys.argv[0], "rU")
sum = 0
line = infile.readlines()
while ">" not in lline:
for i in line:
sum += 1
print (sum)

usage: $ python countbases.py infile.fasta
Aside from the argv[0], issue, you're also looking for things in "lline" rather than "line"...
dpryan is offline   Reply With Quote
Old 08-20-2013, 06:32 AM   #5
illinu
Member
 
Location: US

Join Date: Jul 2013
Posts: 55
Default

Quote:
Originally Posted by dpryan View Post
Aside from the argv[0], issue, you're also looking for things in "lline" rather than "line"...
That's a typo here , in the program it's ok. So you're saying that apart from that the program seems ok and it should work?
illinu is offline   Reply With Quote
Old 08-20-2013, 06:49 AM   #6
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

Well, you're also trying to iterate through a list in an odd way that I'm not sure would work. You might take a rather more efficient approach:
Code:
#!/usr/bin/env python
import sys

f = open(sys.argv[1], "rU")
total=0
for line in f :
    if(line[0] == ">") :
        continue

    total += len(line)-1
print("%s has %i nucleotides" % (sys.argv[1], total))
With readlines(), you end up reading the whole file into a list, which is really memory inefficient if you're dealing with a whole genome.

Last edited by dpryan; 08-20-2013 at 06:49 AM. Reason: typo
dpryan is offline   Reply With Quote
Old 08-20-2013, 07:11 AM   #7
illinu
Member
 
Location: US

Join Date: Jul 2013
Posts: 55
Default

Impressed! it works like a charm

Thanks a million
illinu is offline   Reply With Quote
Old 08-20-2013, 07:16 AM   #8
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

No problem. In a real version, you'd probably want to use argparse (so you can easily give help and usage information) and check to see that the fasta file exists before running further. You also might want to add the number of bases per chromosome, likely only printed if the user specifies.
dpryan is offline   Reply With Quote
Old 08-20-2013, 12:22 PM   #9
brentp
Member
 
Location: salt lake city, UT

Join Date: Apr 2010
Posts: 72
Default

import sys
print sum(len(l.rstrip()) for l in open(sys.argv[1]) if l[0] != ">")
brentp is offline   Reply With Quote
Old 08-22-2013, 08:17 AM   #10
CHObot
Junior Member
 
Location: Greater Boston Area

Join Date: May 2013
Posts: 8
Default

Yet another way (and more general) is to use the BioPython module. You will want to get familiar with handling sequences like this anyway, it is much more convenient to have a data structure. It would be this:

Code:
import sys
from Bio import SeqIO

for seq_record in SeqIO.parse(sys.argv[1], "fasta"):
    print seq_record.id
    print len(seq_record)
And that would handle multifasta files (with multiple sequences in them) which you will no doubt encounter eventually.
CHObot is offline   Reply With Quote
Reply

Tags
bases, count, python

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 10:15 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO