SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
TopHat/Bowtie - number of reads aligned mgibson Bioinformatics 7 10-22-2011 09:04 PM
How to run Tophat with annotation file masylichu Bioinformatics 2 09-06-2011 08:25 PM
Different number of unique reads using TopHat -g reut Bioinformatics 0 08-29-2011 06:22 AM
set up TOPHAT run with paired end reads PFS Bioinformatics 1 03-08-2011 05:45 PM
Split accepted_hits.bam file after Tophat run? hong_sunwoo Bioinformatics 6 10-18-2010 01:06 AM

Reply
 
Thread Tools
Old 03-23-2011, 09:11 AM   #1
gen2prot
Member
 
Location: Hyderabad, India

Join Date: Apr 2010
Posts: 66
Default prep_reads file in Tophat run shows a different number of reads

Hi All,

I have a very naive question. When I use "grep" on one of my fastq files to check the number of reads I get a certain value.

Code:
grep "@" SRR072905.fastq | wc -l
40042015
However, When I check the prep_reads.log file after a tophat run I get a different number of reads

Code:
more prep_reads.log 
prep_reads v1.1.4 (1709)
---------------------------
4060 out of 26976249 reads have been filtered out
40042015 are in the fastq file, but Tophat says the number is 26976249. Is the program not taking all the reads as input?

Please help so I know if I have to run the program again.

Thanks
Abhijit
gen2prot is offline   Reply With Quote
Old 03-23-2011, 10:55 AM   #2
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,178
Default

Using grep "@" to count reads in a fastq file will not work the way you think it will.

Read this thread for an explanation.
kmcarr is offline   Reply With Quote
Old 03-23-2011, 11:20 AM   #3
gen2prot
Member
 
Location: Hyderabad, India

Join Date: Apr 2010
Posts: 66
Default

But fastq is a set of four lines. The "@" line is followed by the sequence, while the "+" line is followed by the sequence quality. Therefore, counting the "@" or the "+" sign should give you the number of reads. Am I wrong?
gen2prot is offline   Reply With Quote
Old 03-23-2011, 11:30 AM   #4
gen2prot
Member
 
Location: Hyderabad, India

Join Date: Apr 2010
Posts: 66
Default

Ignore my previous comment kmcarr. I figured out my mistake. I have to use the "^" at the beginning of the search while using "grep". Thnaks
gen2prot is offline   Reply With Quote
Old 03-23-2011, 11:36 AM   #5
jasonwood
Member
 
Location: RI

Join Date: May 2010
Posts: 10
Default

The problem is that will flag every line with the @ character in it, so if your quality strings have that character you will double count. grep ^@ instead, or just count lines and divide by 4.
jasonwood is offline   Reply With Quote
Old 03-23-2011, 11:57 AM   #6
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,178
Default

Quote:
Originally Posted by gen2prot View Post
Ignore my previous comment kmcarr. I figured out my mistake. I have to use the "^" at the beginning of the search while using "grep". Thnaks
Even using grep ^@ will not work perfectly. As the thread linked above notes, "@" is a valid quality character in Illumina FASTQ files which may even appear at the beginning of a quality line. grep ^@ will count these as well.

You need to use a search pattern for grep which will be absolutely unique to the header line. I typically use ^"@HW". The read IDs start with the machine names; our machine names all start with HW and "W" is not a valid quality character so the string @HW can only appear in read IDs.
kmcarr is offline   Reply With Quote
Old 03-23-2011, 02:50 PM   #7
cram
Member
 
Location: Canada

Join Date: Nov 2008
Posts: 16
Default

Couldn't you just do:

Code:
wc -l foo.fastq
and then divide by 4?

That should run faster too.
cram is offline   Reply With Quote
Old 03-24-2011, 02:50 AM   #8
nicolallias
Member
 
Location: France

Join Date: Jan 2010
Posts: 23
Smile

Quote:
Originally Posted by jasonwood
just count lines and divide by 4.
Quote:
Originally Posted by cram View Post
Code:
wc -l foo.fastq
and then divide by 4?
Not if the fastq is generated with line wrapping - I've seen one of the bwa tool, qualfa2fq.pl doing this on the quality.

But you may overestimate the number of reads if the fastq is encoding quality Sanger's like (in quality line, "@" is available in Sanger's encoding ASCII+33)

You could try to export from fastq-sanger to fastq-illumina, do count with
Code:
grep -c ^@ yourfile.fastq
and then you're sure about the number of sequences.

A shorter way could be to grep some elements of the title (flowcell name, delimitation character(s)...).
Quote:
Originally Posted by kmcarr
You need to use a search pattern for grep which will be absolutely unique to the header line.
Can you show us the head of your fastq ?

Last edited by nicolallias; 03-24-2011 at 02:52 AM.
nicolallias is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 11:46 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO