Seqanswers Leaderboard Ad

**jgibbons1** · 03-15-2013, 01:22 PM

For anyone interested...

Good ole reliable grep:
$ grep -B 1 'GGTATTCTATTGGTA' reads.fq | grep -v '\-\-' | sed 's/@HWI/>HWI/g'

**Richard Finney** · 03-15-2013, 01:46 PM

What about the reverse complement ?

_

notes ...

-bash-3.00$ /h1/finneyr/bin/revcomp GGTATTCTATTGGTA
TACCAATAGAATACC
-bash-3.00$ grep -B 1 'GGTATTCTATTGGTA' FASTQS/98023.1.fq
@NCI-GA4_1:2:65:1268:1284/1
TTCATGATGGAATCATTGATCCATTGACCAGTTGAGAAATGATTGATGCTGAAGAGGTGGTATTCTATTGGTAAG
-bash-3.00$ grep -B 1 'TACCAATAGAATACC' FASTQS/98023.1.fq
@NCI-GA4_1:2:48:975:432/1
TGGATAAGTTTTTGGTAGCCTGCTTCATTATCGGAAGTTTGGTAATAAGCCTTACCAATAGAATACCACATCTTC

**jgibbons1** · 03-18-2013, 07:57 AM

Thanks Richard...

I'm trying to automate this because I'll be doing it with many files/comparisons.

I've tried to embed a command line perl command into a shell script. Unfortunately, I'm not very good at shell scripting or perl scripting, but I've patched together this script:

Syntax: kmer_RC.sh KMERFILE > KMER_RC

#!/bin/bash
FILENAME=$1
while read LINE
do
perl -p -e 'tr/ATCG/TAGC/' $1 | perl -l -n -e 'print scalar reverse $_';

done < $FILENAME

Here are the contents of my TEST_kmerlist file (the input file):
AAAAGTTGCAGTGGA
AAAGTGCAAATGTGT
AAAGTTGCAGTGGAC
AAATGTGTTGTTAGA
AACAAGATGAAAGTG

However, when I execute the command, it is printing the reverse complements 5 time...
TCCACTGCAACTTTT
ACACATTTGCACTTT
GTCCACTGCAACTTT
TCTAACAACACATTT
CACTTTCATCTTGTT
TCCACTGCAACTTTT
ACACATTTGCACTTT
GTCCACTGCAACTTT
TCTAACAACACATTT
CACTTTCATCTTGTT
TCCACTGCAACTTTT
ACACATTTGCACTTT
GTCCACTGCAACTTT
TCTAACAACACATTT
CACTTTCATCTTGTT
TCCACTGCAACTTTT
ACACATTTGCACTTT
GTCCACTGCAACTTT
TCTAACAACACATTT
CACTTTCATCTTGTT
TCCACTGCAACTTTT
ACACATTTGCACTTT
GTCCACTGCAACTTT
TCTAACAACACATTT
CACTTTCATCTTGTT

For simplicity:
1a) TCCACTGCAACTTTT
2a) ACACATTTGCACTTT
3a) GTCCACTGCAACTTT
4a) TCTAACAACACATTT
5a) CACTTTCATCTTGTT
1b)TCCACTGCAACTTTT
2b) ACACATTTGCACTTT
3b) GTCCACTGCAACTTT
4b) TCTAACAACACATTT
5b) CACTTTCATCTTGTT
1c) TCCACTGCAACTTTT
2c) ACACATTTGCACTTT
3c) GTCCACTGCAACTTT
4c) TCTAACAACACATTT
5c) CACTTTCATCTTGTT
1d) TCCACTGCAACTTTT
2d) ACACATTTGCACTTT
3d) GTCCACTGCAACTTT
4d) TCTAACAACACATTT
5d) CACTTTCATCTTGTT
1e) TCCACTGCAACTTTT
2e) ACACATTTGCACTTT
3e) GTCCACTGCAACTTT
4e) TCTAACAACACATTT
5e) CACTTTCATCTTGTT

For the life of me, I can't figure out why this is repeating exactly 4 times. When I run this as the perl command only, (perl -p -e 'tr/ATCG/TAGC/' KMERFILE | perl -l -n -e 'print scalar reverse $_'

it works perfectly.

Any ideas?

Thanks,

John

**mastal** · 03-18-2013, 08:57 AM

retrieve reads containing kmer

Hi,

when you run the shell script, the perl command is inside a while loop,
so it is executing the perl command each time it goes through the loop.

You have 5 lines in your test file, each time the shell script reads a line,
perl processes the whole input file ($1).

Best wishes,
Maria

**jgibbons1** · 03-18-2013, 09:03 AM

Ahh...that explains it! Thank you Maria.

Is there a simple way to modify the script so that I can corretc this?

Thanks...and apologies for my naivety

**mastal** · 03-18-2013, 11:01 AM

retrieve reads containing kmer

try replacing $1 in the perl command with $LINE

**mastal** · 03-18-2013, 02:59 PM

retrieve reads containing kmer

Originally posted by mastal View Post

try replacing $1 in the perl command with $LINE

Actually, my previous suggestion doesn't work,

but what does work is removing the shell while loop - the 3 lines beginning with
while, do, and done,

but I don't see this as being a solution for working with a full-size sequencing file, it would be much better to have a script that processes the file line by line.

maria

**jgibbons1** · 03-19-2013, 06:13 AM

Hi Maria,
Sorry -- I meant to reply earlier. Yes, the script works fine just with the command line perl statement. Luckily, I am not doing this on the whole genome scale. In brief, I'm intereste din a subset of Kmers that vary between samples -- so I am pulling only reads which contain these kmers out. In total, its about 2,000 kmers and I am getting a couple hundred thousand reads from an already reduced data set of 8 million reads.
Thanks for you suggestions!

**A.N.Other** · 03-19-2013, 09:03 AM

Looks like you have it sorted. Nevertheless, for others doing this, here's a 2 minute python (v2 or 3) that will do the same with fastq or fasta (single line sequences). Very quick so no controlling for bad inputs - use at own risk, modify as you like etc etc.

Code:

from sys import argv, version_info
from os.path import realpath, splitext
from itertools import islice
if version_info[0] == 2: import string as str

f_name, kmer = realpath(argv[1]), argv[2].upper()
f_basename, f_ext = splitext(f_name)

atgc_complement = str.maketrans('AaTtUuCcGgXxNnRrYyWwSsKkMmDdVvHhBb-.',
                                 'TTAAAAGGCXXXNNYYRRWWSSMMKKHHBBDDVV-.')

kmers = [kmer, kmer[::-1].translate(atgc_complement)]

if '.fq' in f_name or '.fastq' in f_name:
    chunksize = 4
elif '.fa' in f_name or '.fasta' in f_name:
    chunksize = 2

with open(f_name, 'r') as f_obj:
    with open(f_basename + '_' + kmer + f_ext, 'w') as outfile:
        while True:
            lines = list(islice(f_obj, chunksize))
            if not lines: break
            seq = lines[1].strip().upper()
            for kmer in kmers:
                if kmer in seq:
                    outfile.writelines(lines)
                    break

Run as ...

Code:

python <scriptname.py> <FQorFAfile> <kmer>
e.g.
python kmercollect.py /data/hello.fq ATCATCGTACGACT

**jgibbons1** · 03-19-2013, 09:08 AM

This is fantastic. Thanks for sharing. I'm sure it will be faster than my pipeline (which takes about 6 hours from the data set I described).

**A.N.Other** · 03-19-2013, 09:22 AM

Sorry, just changed it again, actually. Much faster now - I was doing it in a daft way as had just pulled it together for you from separate scripts.

**A.N.Other** · 03-19-2013, 09:50 AM

Noticed you want to do it for a list of kmers. Assuming you have them in a text file, 1 per line, then this will pull out everything at once for you:

Code:

from sys import argv, version_info
from os.path import realpath, splitext
from itertools import islice
if version_info[0] == 2: import string as str

f_name, kfile_name = realpath(argv[1]), realpath(argv[2])
f_basename, f_ext = splitext(f_name)

atgc_complement = str.maketrans('AaTtUuCcGgXxNnRrYyWwSsKkMmDdVvHhBb-.',
                                'TTAAAAGGCXXXNNYYRRWWSSMMKKHHBBDDVV-.')

with open(kfile_name, 'r') as f_obj: kmers = [line.strip().upper() for line in f_obj if line.strip()]
kmers += [kmer[::-1].translate(atgc_complement) for kmer in kmers]

if '.fq' in f_name or '.fastq' in f_name:
    chunksize = 4
elif '.fa' in f_name or '.fasta' in f_name:
    chunksize = 2

with open(f_name, 'r') as f_obj:
    with open(f_basename + '_kmers' + f_ext, 'w') as outfile:
        while True:
            lines = list(islice(f_obj, chunksize))
            if not lines: break
            seq = lines[1].strip().upper()
            for kmer in kmers:
                if kmer in seq:
                    outfile.writelines(lines)
                    break

Run as ...

Code:

python kmercollect.py /data/hello.fq /data/kmers.txt

Topics	Statistics	Last Post
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, Today, 08:47 AM	0 responses 12 views 0 likes	Last Post by seqadmin Today, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 59 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 54 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM

Seqanswers Leaderboard Ad

Announcement

retrieve reads containing kmer

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News