Seqanswers Leaderboard Ad

**NicoBxl** · 03-28-2011, 02:23 AM

in R it's easy

Code:

countN <- function(a){
  return(sum(strsplit(a,"")[[1]]=="N")==4)
}

a <- "ATGCNNNN"
b <- "ATGCNN"
x <- c(a,b)
sapply(x,countN)

ATGCNN   ATGN 
     TRUE      FALSE

**marcela** · 03-28-2011, 02:27 AM

Quite simple! Thanks!

**NicoBxl** · 03-28-2011, 02:27 AM

you're welcome

**schmima** · 03-28-2011, 02:44 AM

hm - don't know of any - but is relatively easy task - heres a quick and dirty script (assumes that your infile has no header) - most probably not too fast due to simultanous reading and writing. But it should work (did not test - was originally something else):

[EDIT - THE TABS ARE NOT CORRECTLY PRINTED HERE... DONT KNOW IF THEY WILL APPEAR IF YOU COPY IT]

############

import sys

if len(sys.argv) < 5:
print 'usage:\n\tpython %s infile outfile char maxnum' % (str(sys.argv[0]))
sys.exit(0)

in_file = sys.argv[1]
out_file = sys.argv[2]
char = sys.argv[3]
maxnum = int(sys.argv[4])

infile = open(in_file, 'r')
outfile = open(out_file, 'w')

prevLines = []
for line in infile:
prevLines.append(str(line[:-1]))
if len(prevLines) == 4:
nucID = prevLines[0]
nucSeq = prevLines[1]
qualID = prevLines[2]
qualSeq = prevLines[3]
if nucSeq.count(char) <= maxnum:
outstring = '\n'.join([nucID,nucSeq,qualID,qualSeq])+'\n'
outfile.write(outstring)
prevLines = []

if len(prevLines) == 4:
charcounter = 0
nucID = prevLines[0]
nucSeq = prevLines[1]
qualID = prevLines[2]
qualSeq = prevLines[3]
if nucSeq.count(char) <= maxnum:
outstring = '\n'.join([nucID,nucSeq,qualID,qualSeq])+'\n'
outfile.write(outstring)

outfile.close()
infile.close()

##############

copy the part between the ###### into an empty textfile, save as XY.py and run:
python XY.py pathtoinfile pathtooutfile characteryouwanttoremove threshold

eg
python XY.py /home/me/myreads.fastq /home/me/myreads_filtered.fastq N 4

(removes all reads with more than 4 Ns - ie writes the others to outfile)

**schmima** · 03-28-2011, 02:47 AM

ad tabs: if necessary use this code here and replace every * with one tab. ** are double tabs and so on:

import sys

if len(sys.argv) < 5:
*print 'usage:\n\tpython %s infile outfile char maxnum' % (str(sys.argv[0]))
*sys.exit(0)

in_file = sys.argv[1]
out_file = sys.argv[2]
char = sys.argv[3]
maxnum = int(sys.argv[4])

infile = open(in_file, 'r')
outfile = open(out_file, 'w')

prevLines = []
for line in infile:
*prevLines.append(str(line[:-1]))
*if len(prevLines) == 4:
**nucID = prevLines[0]
**nucSeq = prevLines[1]
**qualID = prevLines[2]
**qualSeq = prevLines[3]
**if nucSeq.count(char) <= maxnum:
***outstring = '\n'.join([nucID,nucSeq,qualID,qualSeq])+'\n'
***outfile.write(outstring)
**prevLines = []

if len(prevLines) == 4:
*charcounter = 0
*nucID = prevLines[0]
*nucSeq = prevLines[1]
*qualID = prevLines[2]
*qualSeq = prevLines[3]
*if nucSeq.count(char) <= maxnum:
**outstring = '\n'.join([nucID,nucSeq,qualID,qualSeq])+'\n'
**outfile.write(outstring)

outfile.close()
infile.close()

Topics	Statistics	Last Post
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, Yesterday, 08:47 AM	0 responses 12 views 0 likes	Last Post by seqadmin Yesterday, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 59 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 54 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM

Seqanswers Leaderboard Ad

Announcement

Removing reads with "N" > 4

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News