Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Counting specific sequence stretches in fastq files

    Hey all,

    For a while ive been trying to find a way how to count specific sequence stretches directly from fastq files. For example i would like to see how many reads contain the following sequence: "TGACCCGATAGA". It seems to work with grep -c when i use fasta files but not fastq files.

    Is there anything im missing?

    Best,
    Exo

  • #2
    You can still use grep -c, but pipe the file through awk first:

    Code:
    awk '{if(NR%4 == 2) print $0}' sample.fastq | grep -c ...

    Comment


    • #3
      This is a perfect application for BBDuk. I would not recommend grep for this approach with multiline fasta as it does not distinguish between lines and reads, and cannot handle newlines in the middle of the target sequence. And I have no idea why grep would fail with fastq files; that should not be a problem. But anyway -

      bbduk.sh in=reads.fq literal=TGACCCGATAGA k=12 mm=f rcomp=f int=f

      That will tell you exactly how many reads contain that sequence. If you are also interested in reads containing the reverse-complement of the sequence, remove the flag "rcomp=f". The "int=f" flag tells it to treat reads individually rather than as pairs, if your reads are interleaved. k is the kmer length, which in this case is the length of your sequence, since you want an exact match. Incidentally, BBDuk can also tell you how many reads contain a sequence that is 1 substitution away from the target with the flag "hdist=1", and so forth. It can also handle degenerate symbols like 'N' if you add the flag "copyundefined", which is occasionally useful when dealing with amplification primers.
      Last edited by Brian Bushnell; 06-12-2016, 09:19 AM.

      Comment


      • #4
        Originally posted by dpryan View Post
        You can still use grep -c, but pipe the file through awk first:

        Code:
        awk '{if(NR%4 == 2) print $0}' sample.fastq | grep -c ...
        Returns at max 1 hit per sequence though..
        savetherhino.org

        Comment


        • #5
          Originally posted by rhinoceros View Post
          Returns at max 1 hit per sequence though..
          Yes, though that's typically what's desired.

          Comment


          • #6
            Originally posted by rhinoceros View Post
            Returns at max 1 hit per sequence though..
            Hi- Is that a problem? The OP asks for the number of reads containing pattern. So if a read contains the pattern twice it should still be counted as 1, which is what grep -c does, isn't it?

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Strategies for Sequencing Challenging Samples
              by seqadmin


              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
              03-22-2024, 06:39 AM
            • seqadmin
              Techniques and Challenges in Conservation Genomics
              by seqadmin



              The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

              Avian Conservation
              Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
              03-08-2024, 10:41 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, Yesterday, 06:37 PM
            0 responses
            10 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, Yesterday, 06:07 PM
            0 responses
            9 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 03-22-2024, 10:03 AM
            0 responses
            49 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 03-21-2024, 07:32 AM
            0 responses
            67 views
            0 likes
            Last Post seqadmin  
            Working...
            X