Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • retrieve reads containing kmer

    Hey folks,
    Wondering if anyone has written a script to do this?

    Basically, I has a list of kmers that are over represented in a specific sample. I would like to retrieve all of my reads containing this kmer (and the kmers reverse complement) from my fasta/fastq file retaining the read identifier information.

    So, input would be a specific kmer and a read file (in fasta or fastq) and the output would be a fasta/fastq of the reads containing my kmer.

    Thanks

  • #2
    For anyone interested...

    Good ole reliable grep:
    $ grep -B 1 'GGTATTCTATTGGTA' reads.fq | grep -v '\-\-' | sed 's/@HWI/>HWI/g'

    Comment


    • #3
      What about the reverse complement ?

      _

      notes ...

      -bash-3.00$ /h1/finneyr/bin/revcomp GGTATTCTATTGGTA
      TACCAATAGAATACC
      -bash-3.00$ grep -B 1 'GGTATTCTATTGGTA' FASTQS/98023.1.fq
      @NCI-GA4_1:2:65:1268:1284/1
      TTCATGATGGAATCATTGATCCATTGACCAGTTGAGAAATGATTGATGCTGAAGAGGTGGTATTCTATTGGTAAG
      -bash-3.00$ grep -B 1 'TACCAATAGAATACC' FASTQS/98023.1.fq
      @NCI-GA4_1:2:48:975:432/1
      TGGATAAGTTTTTGGTAGCCTGCTTCATTATCGGAAGTTTGGTAATAAGCCTTACCAATAGAATACCACATCTTC

      Comment


      • #4
        Thanks Richard...

        I'm trying to automate this because I'll be doing it with many files/comparisons.

        I've tried to embed a command line perl command into a shell script. Unfortunately, I'm not very good at shell scripting or perl scripting, but I've patched together this script:

        Syntax: kmer_RC.sh KMERFILE > KMER_RC

        #!/bin/bash
        FILENAME=$1
        while read LINE
        do
        perl -p -e 'tr/ATCG/TAGC/' $1 | perl -l -n -e 'print scalar reverse $_';

        done < $FILENAME


        Here are the contents of my TEST_kmerlist file (the input file):
        AAAAGTTGCAGTGGA
        AAAGTGCAAATGTGT
        AAAGTTGCAGTGGAC
        AAATGTGTTGTTAGA
        AACAAGATGAAAGTG

        However, when I execute the command, it is printing the reverse complements 5 time...
        TCCACTGCAACTTTT
        ACACATTTGCACTTT
        GTCCACTGCAACTTT
        TCTAACAACACATTT
        CACTTTCATCTTGTT
        TCCACTGCAACTTTT
        ACACATTTGCACTTT
        GTCCACTGCAACTTT
        TCTAACAACACATTT
        CACTTTCATCTTGTT
        TCCACTGCAACTTTT
        ACACATTTGCACTTT
        GTCCACTGCAACTTT
        TCTAACAACACATTT
        CACTTTCATCTTGTT
        TCCACTGCAACTTTT
        ACACATTTGCACTTT
        GTCCACTGCAACTTT
        TCTAACAACACATTT
        CACTTTCATCTTGTT
        TCCACTGCAACTTTT
        ACACATTTGCACTTT
        GTCCACTGCAACTTT
        TCTAACAACACATTT
        CACTTTCATCTTGTT

        For simplicity:
        1a) TCCACTGCAACTTTT
        2a) ACACATTTGCACTTT
        3a) GTCCACTGCAACTTT
        4a) TCTAACAACACATTT
        5a) CACTTTCATCTTGTT
        1b)TCCACTGCAACTTTT
        2b) ACACATTTGCACTTT
        3b) GTCCACTGCAACTTT
        4b) TCTAACAACACATTT
        5b) CACTTTCATCTTGTT
        1c) TCCACTGCAACTTTT
        2c) ACACATTTGCACTTT
        3c) GTCCACTGCAACTTT
        4c) TCTAACAACACATTT
        5c) CACTTTCATCTTGTT
        1d) TCCACTGCAACTTTT
        2d) ACACATTTGCACTTT
        3d) GTCCACTGCAACTTT
        4d) TCTAACAACACATTT
        5d) CACTTTCATCTTGTT
        1e) TCCACTGCAACTTTT
        2e) ACACATTTGCACTTT
        3e) GTCCACTGCAACTTT
        4e) TCTAACAACACATTT
        5e) CACTTTCATCTTGTT

        For the life of me, I can't figure out why this is repeating exactly 4 times. When I run this as the perl command only, (perl -p -e 'tr/ATCG/TAGC/' KMERFILE | perl -l -n -e 'print scalar reverse $_' it works perfectly.

        Any ideas?

        Thanks,

        John

        Comment


        • #5
          retrieve reads containing kmer

          Hi,

          when you run the shell script, the perl command is inside a while loop,
          so it is executing the perl command each time it goes through the loop.

          You have 5 lines in your test file, each time the shell script reads a line,
          perl processes the whole input file ($1).

          Best wishes,
          Maria

          Comment


          • #6
            Ahh...that explains it! Thank you Maria.

            Is there a simple way to modify the script so that I can corretc this?

            Thanks...and apologies for my naivety

            Comment


            • #7
              retrieve reads containing kmer

              try replacing $1 in the perl command with $LINE

              Comment


              • #8
                retrieve reads containing kmer

                Originally posted by mastal View Post
                try replacing $1 in the perl command with $LINE
                Actually, my previous suggestion doesn't work,

                but what does work is removing the shell while loop - the 3 lines beginning with
                while, do, and done,

                but I don't see this as being a solution for working with a full-size sequencing file, it would be much better to have a script that processes the file line by line.

                maria

                Comment


                • #9
                  Hi Maria,
                  Sorry -- I meant to reply earlier. Yes, the script works fine just with the command line perl statement. Luckily, I am not doing this on the whole genome scale. In brief, I'm intereste din a subset of Kmers that vary between samples -- so I am pulling only reads which contain these kmers out. In total, its about 2,000 kmers and I am getting a couple hundred thousand reads from an already reduced data set of 8 million reads.
                  Thanks for you suggestions!

                  Comment


                  • #10
                    Looks like you have it sorted. Nevertheless, for others doing this, here's a 2 minute python (v2 or 3) that will do the same with fastq or fasta (single line sequences). Very quick so no controlling for bad inputs - use at own risk, modify as you like etc etc.

                    Code:
                    from sys import argv, version_info
                    from os.path import realpath, splitext
                    from itertools import islice
                    if version_info[0] == 2: import string as str
                    
                    f_name, kmer = realpath(argv[1]), argv[2].upper()
                    f_basename, f_ext = splitext(f_name)
                    
                    atgc_complement = str.maketrans('AaTtUuCcGgXxNnRrYyWwSsKkMmDdVvHhBb-.',
                                                     'TTAAAAGGCXXXNNYYRRWWSSMMKKHHBBDDVV-.')
                    
                    kmers = [kmer, kmer[::-1].translate(atgc_complement)]
                    
                    if '.fq' in f_name or '.fastq' in f_name:
                        chunksize = 4
                    elif '.fa' in f_name or '.fasta' in f_name:
                        chunksize = 2
                    
                    with open(f_name, 'r') as f_obj:
                        with open(f_basename + '_' + kmer + f_ext, 'w') as outfile:
                            while True:
                                lines = list(islice(f_obj, chunksize))
                                if not lines: break
                                seq = lines[1].strip().upper()
                                for kmer in kmers:
                                    if kmer in seq:
                                        outfile.writelines(lines)
                                        break
                    Run as ...

                    Code:
                    python <scriptname.py> <FQorFAfile> <kmer>
                    e.g.
                    python kmercollect.py /data/hello.fq ATCATCGTACGACT
                    Last edited by A.N.Other; 03-19-2013, 10:06 AM.

                    Comment


                    • #11
                      This is fantastic. Thanks for sharing. I'm sure it will be faster than my pipeline (which takes about 6 hours from the data set I described).

                      Comment


                      • #12
                        Sorry, just changed it again, actually. Much faster now - I was doing it in a daft way as had just pulled it together for you from separate scripts.

                        Comment


                        • #13
                          Noticed you want to do it for a list of kmers. Assuming you have them in a text file, 1 per line, then this will pull out everything at once for you:

                          Code:
                          from sys import argv, version_info
                          from os.path import realpath, splitext
                          from itertools import islice
                          if version_info[0] == 2: import string as str
                          
                          f_name, kfile_name = realpath(argv[1]), realpath(argv[2])
                          f_basename, f_ext = splitext(f_name)
                          
                          atgc_complement = str.maketrans('AaTtUuCcGgXxNnRrYyWwSsKkMmDdVvHhBb-.',
                                                          'TTAAAAGGCXXXNNYYRRWWSSMMKKHHBBDDVV-.')
                          
                          with open(kfile_name, 'r') as f_obj: kmers = [line.strip().upper() for line in f_obj if line.strip()]
                          kmers += [kmer[::-1].translate(atgc_complement) for kmer in kmers]
                          
                          if '.fq' in f_name or '.fastq' in f_name:
                              chunksize = 4
                          elif '.fa' in f_name or '.fasta' in f_name:
                              chunksize = 2
                          
                          with open(f_name, 'r') as f_obj:
                              with open(f_basename + '_kmers' + f_ext, 'w') as outfile:
                                  while True:
                                      lines = list(islice(f_obj, chunksize))
                                      if not lines: break
                                      seq = lines[1].strip().upper()
                                      for kmer in kmers:
                                          if kmer in seq:
                                              outfile.writelines(lines)
                                              break
                          Run as ...

                          Code:
                          python kmercollect.py /data/hello.fq /data/kmers.txt
                          Last edited by A.N.Other; 03-19-2013, 10:04 AM.

                          Comment

                          Latest Articles

                          Collapse

                          • seqadmin
                            Essential Discoveries and Tools in Epitranscriptomics
                            by seqadmin




                            The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                            04-22-2024, 07:01 AM
                          • seqadmin
                            Current Approaches to Protein Sequencing
                            by seqadmin


                            Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                            04-04-2024, 04:25 PM

                          ad_right_rmr

                          Collapse

                          News

                          Collapse

                          Topics Statistics Last Post
                          Started by seqadmin, Today, 08:47 AM
                          0 responses
                          12 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 04-11-2024, 12:08 PM
                          0 responses
                          60 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 04-10-2024, 10:19 PM
                          0 responses
                          59 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 04-10-2024, 09:21 AM
                          0 responses
                          54 views
                          0 likes
                          Last Post seqadmin  
                          Working...
                          X