Hi All,
I'm probably missing something really obvious here which I can't figure out...
The problem: It seems that tophat2 doesn't read correctly the two lists of fastq files passed as lists of comma separated files. In particular it ignores the last file of the second list (the second mate)
More in detail: I have a library that has been sequenced on two lanes in paired-end mode. So I have two pairs of fastq files. For testing purposes I reduced the number of reads in each file as follows:
## Lane 3:
s_3_1.fq.gz: 25000 reads (mate 1)
s_3_4.fq.gz: 25000 reads (mate 2)
## Lane 4:
s_4_1.fq.gz: 50000
s_4_4.fq.gz: 50000
Now, if I run tophat like this:
It appears that the "left" reads are in total 74991 (25K + 50K), while the "right" reads are only 24992 (that is, only from s_3_4.fq.gz)
Here's the first line of the output
So, I tried to invert the order with which the fastq files are passed as last and second last arguments:
Again, "left reads" are ~75000 but now the right reads are ~50000 (only from s_4_4.fq.gz, now ignoring s_3_4.fq.gz):
Running the two pairs of files separately produces the expected number of left and right reads, as well as concatenating the fastq files belonging to the same mate. So I guess the problem is not with the files themselves (which look fine to me anyway).
Any ideas what's happening?
Many thanks
Dario
I'm probably missing something really obvious here which I can't figure out...
The problem: It seems that tophat2 doesn't read correctly the two lists of fastq files passed as lists of comma separated files. In particular it ignores the last file of the second list (the second mate)
More in detail: I have a library that has been sequenced on two lanes in paired-end mode. So I have two pairs of fastq files. For testing purposes I reduced the number of reads in each file as follows:
## Lane 3:
s_3_1.fq.gz: 25000 reads (mate 1)
s_3_4.fq.gz: 25000 reads (mate 2)
## Lane 4:
s_4_1.fq.gz: 50000
s_4_4.fq.gz: 50000
Now, if I run tophat like this:
Code:
tophat2 -o both1 -r 100 --mate-std-dev 80 --library-type fr-unstranded -G ${annotgtf} ${bwtidx} \ s_3_1.fq.gz,s_4_1.fq.gz \ s_3_4.fq.gz,s_4_4.fq.gz
Here's the first line of the output
Code:
[2012-07-09 15:21:43] Beginning TopHat run (v2.0.4) ----------------------------------------------- [2012-07-09 15:21:43] Checking for Bowtie Bowtie version: 2.0.0.5 [2012-07-09 15:21:43] Checking for Samtools Samtools version: 0.1.18.0 [2012-07-09 15:21:43] Checking for Bowtie index files [2012-07-09 15:21:43] Checking for reference FASTA file [2012-07-09 15:21:43] Generating SAM header for /lustre/sblab/berald01/reference_data/genomes/iGenomes/Homo_sapiens/UCSC/hg19/Sequence/Bowtie2Index/genome format: fastq quality scale: phred33 (default) [2012-07-09 15:22:10] Reading known junctions from GTF file [2012-07-09 15:22:16] Preparing reads left reads: min. length=20, max. length=100, 74991 kept reads (9 discarded) right reads: min. length=20, max. length=100, 24992 kept reads (8 discarded)
Code:
tophat2 -o both2 -r 100 --mate-std-dev 80 --library-type fr-unstranded -G ${annotgtf} ${bwtidx} \ s_4_1.fq.gz,s_3_1.fq.gz \ s_4_4.fq.gz,s_3_4.fq.gz
Code:
2012-07-09 15:21:42] Beginning TopHat run (v2.0.4) ... [2012-07-09 15:21:50] Preparing reads left reads: min. length=20, max. length=100, 74991 kept reads (9 discarded) right reads: min. length=20, max. length=100, 49949 kept reads (51 discarded)
Any ideas what's happening?
Many thanks
Dario
Comment