![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
TopHat/Bowtie - number of reads aligned | mgibson | Bioinformatics | 7 | 10-22-2011 09:04 PM |
How to run Tophat with annotation file | masylichu | Bioinformatics | 2 | 09-06-2011 08:25 PM |
Different number of unique reads using TopHat -g | reut | Bioinformatics | 0 | 08-29-2011 06:22 AM |
set up TOPHAT run with paired end reads | PFS | Bioinformatics | 1 | 03-08-2011 05:45 PM |
Split accepted_hits.bam file after Tophat run? | hong_sunwoo | Bioinformatics | 6 | 10-18-2010 01:06 AM |
![]() |
|
Thread Tools |
![]() |
#1 |
Member
Location: Hyderabad, India Join Date: Apr 2010
Posts: 66
|
![]()
Hi All,
I have a very naive question. When I use "grep" on one of my fastq files to check the number of reads I get a certain value. Code:
grep "@" SRR072905.fastq | wc -l
40042015
Code:
more prep_reads.log prep_reads v1.1.4 (1709) --------------------------- 4060 out of 26976249 reads have been filtered out Please help so I know if I have to run the program again. Thanks Abhijit |
![]() |
![]() |
![]() |
#3 |
Member
Location: Hyderabad, India Join Date: Apr 2010
Posts: 66
|
![]()
But fastq is a set of four lines. The "@" line is followed by the sequence, while the "+" line is followed by the sequence quality. Therefore, counting the "@" or the "+" sign should give you the number of reads. Am I wrong?
|
![]() |
![]() |
![]() |
#4 |
Member
Location: Hyderabad, India Join Date: Apr 2010
Posts: 66
|
![]()
Ignore my previous comment kmcarr. I figured out my mistake. I have to use the "^" at the beginning of the search while using "grep". Thnaks
|
![]() |
![]() |
![]() |
#5 |
Member
Location: RI Join Date: May 2010
Posts: 10
|
![]()
The problem is that will flag every line with the @ character in it, so if your quality strings have that character you will double count. grep ^@ instead, or just count lines and divide by 4.
|
![]() |
![]() |
![]() |
#6 | |
Senior Member
Location: USA, Midwest Join Date: May 2008
Posts: 1,178
|
![]() Quote:
You need to use a search pattern for grep which will be absolutely unique to the header line. I typically use ^"@HW". The read IDs start with the machine names; our machine names all start with HW and "W" is not a valid quality character so the string @HW can only appear in read IDs. |
|
![]() |
![]() |
![]() |
#7 |
Member
Location: Canada Join Date: Nov 2008
Posts: 16
|
![]()
Couldn't you just do:
Code:
wc -l foo.fastq That should run faster too. |
![]() |
![]() |
![]() |
#8 | ||
Member
Location: France Join Date: Jan 2010
Posts: 23
|
![]() Quote:
But you may overestimate the number of reads if the fastq is encoding quality Sanger's like (in quality line, "@" is available in Sanger's encoding ASCII+33) You could try to export from fastq-sanger to fastq-illumina, do count with Code:
grep -c ^@ yourfile.fastq
A shorter way could be to grep some elements of the title (flowcell name, delimitation character(s)...). Quote:
Last edited by nicolallias; 03-24-2011 at 02:52 AM. |
||
![]() |
![]() |
![]() |
Thread Tools | |
|
|