Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • prep_reads file in Tophat run shows a different number of reads

    Hi All,

    I have a very naive question. When I use "grep" on one of my fastq files to check the number of reads I get a certain value.

    Code:
    grep "@" SRR072905.fastq | wc -l
    40042015
    However, When I check the prep_reads.log file after a tophat run I get a different number of reads

    Code:
    more prep_reads.log 
    prep_reads v1.1.4 (1709)
    ---------------------------
    4060 out of 26976249 reads have been filtered out
    40042015 are in the fastq file, but Tophat says the number is 26976249. Is the program not taking all the reads as input?

    Please help so I know if I have to run the program again.

    Thanks
    Abhijit

  • #2
    Using grep "@" to count reads in a fastq file will not work the way you think it will.

    Read this thread for an explanation.

    Comment


    • #3
      But fastq is a set of four lines. The "@" line is followed by the sequence, while the "+" line is followed by the sequence quality. Therefore, counting the "@" or the "+" sign should give you the number of reads. Am I wrong?

      Comment


      • #4
        Ignore my previous comment kmcarr. I figured out my mistake. I have to use the "^" at the beginning of the search while using "grep". Thnaks

        Comment


        • #5
          The problem is that will flag every line with the @ character in it, so if your quality strings have that character you will double count. grep ^@ instead, or just count lines and divide by 4.

          Comment


          • #6
            Originally posted by gen2prot View Post
            Ignore my previous comment kmcarr. I figured out my mistake. I have to use the "^" at the beginning of the search while using "grep". Thnaks
            Even using grep ^@ will not work perfectly. As the thread linked above notes, "@" is a valid quality character in Illumina FASTQ files which may even appear at the beginning of a quality line. grep ^@ will count these as well.

            You need to use a search pattern for grep which will be absolutely unique to the header line. I typically use ^"@HW". The read IDs start with the machine names; our machine names all start with HW and "W" is not a valid quality character so the string @HW can only appear in read IDs.

            Comment


            • #7
              Couldn't you just do:

              Code:
              wc -l foo.fastq
              and then divide by 4?

              That should run faster too.

              Comment


              • #8
                Originally posted by jasonwood
                just count lines and divide by 4.
                Originally posted by cram View Post
                Code:
                wc -l foo.fastq
                and then divide by 4?
                Not if the fastq is generated with line wrapping - I've seen one of the bwa tool, qualfa2fq.pl doing this on the quality.

                But you may overestimate the number of reads if the fastq is encoding quality Sanger's like (in quality line, "@" is available in Sanger's encoding ASCII+33)

                You could try to export from fastq-sanger to fastq-illumina, do count with
                Code:
                grep -c ^@ yourfile.fastq
                and then you're sure about the number of sequences.

                A shorter way could be to grep some elements of the title (flowcell name, delimitation character(s)...).
                Originally posted by kmcarr
                You need to use a search pattern for grep which will be absolutely unique to the header line.
                Can you show us the head of your fastq ?
                Last edited by nicolallias; 03-24-2011, 01:52 AM.

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Current Approaches to Protein Sequencing
                  by seqadmin


                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                  04-04-2024, 04:25 PM
                • seqadmin
                  Strategies for Sequencing Challenging Samples
                  by seqadmin


                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                  03-22-2024, 06:39 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, 04-11-2024, 12:08 PM
                0 responses
                25 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 10:19 PM
                0 responses
                28 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 09:21 AM
                0 responses
                24 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-04-2024, 09:00 AM
                0 responses
                52 views
                0 likes
                Last Post seqadmin  
                Working...
                X