Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Counting specific sequence stretches in fastq files

    Hey all,

    For a while ive been trying to find a way how to count specific sequence stretches directly from fastq files. For example i would like to see how many reads contain the following sequence: "TGACCCGATAGA". It seems to work with grep -c when i use fasta files but not fastq files.

    Is there anything im missing?

    Best,
    Exo

  • #2
    You can still use grep -c, but pipe the file through awk first:

    Code:
    awk '{if(NR%4 == 2) print $0}' sample.fastq | grep -c ...

    Comment


    • #3
      This is a perfect application for BBDuk. I would not recommend grep for this approach with multiline fasta as it does not distinguish between lines and reads, and cannot handle newlines in the middle of the target sequence. And I have no idea why grep would fail with fastq files; that should not be a problem. But anyway -

      bbduk.sh in=reads.fq literal=TGACCCGATAGA k=12 mm=f rcomp=f int=f

      That will tell you exactly how many reads contain that sequence. If you are also interested in reads containing the reverse-complement of the sequence, remove the flag "rcomp=f". The "int=f" flag tells it to treat reads individually rather than as pairs, if your reads are interleaved. k is the kmer length, which in this case is the length of your sequence, since you want an exact match. Incidentally, BBDuk can also tell you how many reads contain a sequence that is 1 substitution away from the target with the flag "hdist=1", and so forth. It can also handle degenerate symbols like 'N' if you add the flag "copyundefined", which is occasionally useful when dealing with amplification primers.
      Last edited by Brian Bushnell; 06-12-2016, 09:19 AM.

      Comment


      • #4
        Originally posted by dpryan View Post
        You can still use grep -c, but pipe the file through awk first:

        Code:
        awk '{if(NR%4 == 2) print $0}' sample.fastq | grep -c ...
        Returns at max 1 hit per sequence though..
        savetherhino.org

        Comment


        • #5
          Originally posted by rhinoceros View Post
          Returns at max 1 hit per sequence though..
          Yes, though that's typically what's desired.

          Comment


          • #6
            Originally posted by rhinoceros View Post
            Returns at max 1 hit per sequence though..
            Hi- Is that a problem? The OP asks for the number of reads containing pattern. So if a read contains the pattern twice it should still be counted as 1, which is what grep -c does, isn't it?

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Essential Discoveries and Tools in Epitranscriptomics
              by seqadmin


              The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
              Yesterday, 07:01 AM
            • seqadmin
              Current Approaches to Protein Sequencing
              by seqadmin


              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
              04-04-2024, 04:25 PM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, 04-11-2024, 12:08 PM
            0 responses
            55 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 10:19 PM
            0 responses
            51 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 09:21 AM
            0 responses
            45 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-04-2024, 09:00 AM
            0 responses
            55 views
            0 likes
            Last Post seqadmin  
            Working...
            X