Seqanswers Leaderboard Ad

**kmcarr** · 03-23-2011, 09:55 AM

Using grep "@" to count reads in a fastq file will not work the way you think it will.

Read this thread for an explanation.

**gen2prot** · 03-23-2011, 10:20 AM

But fastq is a set of four lines. The "@" line is followed by the sequence, while the "+" line is followed by the sequence quality. Therefore, counting the "@" or the "+" sign should give you the number of reads. Am I wrong?

**gen2prot** · 03-23-2011, 10:30 AM

Ignore my previous comment kmcarr. I figured out my mistake. I have to use the "^" at the beginning of the search while using "grep". Thnaks

**jasonwood** · 03-23-2011, 10:36 AM

The problem is that will flag every line with the @ character in it, so if your quality strings have that character you will double count. grep ^@ instead, or just count lines and divide by 4.

**kmcarr** · 03-23-2011, 10:57 AM

Originally posted by gen2prot View Post

Ignore my previous comment kmcarr. I figured out my mistake. I have to use the "^" at the beginning of the search while using "grep". Thnaks

Even using grep ^@ will not work perfectly. As the thread linked above notes, "@" is a valid quality character in Illumina FASTQ files which may even appear at the beginning of a quality line. grep ^@ will count these as well.

You need to use a search pattern for grep which will be absolutely unique to the header line. I typically use ^"@HW". The read IDs start with the machine names; our machine names all start with HW and "W" is not a valid quality character so the string @HW can only appear in read IDs.

**cram** · 03-23-2011, 01:50 PM

Couldn't you just do:

Code:

wc -l foo.fastq

and then divide by 4?

That should run faster too.

**nicolallias** · 03-24-2011, 01:50 AM

Originally posted by jasonwood

just count lines and divide by 4.

Originally posted by cram View Post

Code:

wc -l foo.fastq

and then divide by 4?

Not if the fastq is generated with line wrapping - I've seen one of the bwa tool, qualfa2fq.pl doing this on the quality.

But you may overestimate the number of reads if the fastq is encoding quality Sanger's like (in quality line, "@" is available in Sanger's encoding ASCII+33)

You could try to export from fastq-sanger to fastq-illumina, do count with

Code:

grep -c ^@ yourfile.fastq

and then you're sure about the number of sequences.

A shorter way could be to grep some elements of the title (flowcell name, delimitation character(s)...).

Originally posted by kmcarr

You need to use a search pattern for grep which will be absolutely unique to the header line.

Can you show us the head of your fastq ?

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 25 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 28 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 24 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

prep_reads file in Tophat run shows a different number of reads

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News