Hi,
I've recently discovered a strange behaviour in TopHat: it can sometimes give incomplete (or even incorrect) results, due to an error while running Bowtie on the junction sequence database.
I'm using TopHat 1.0.13, with Bowtie 0.12.5 (or 0.12.3), on a Linux x86_64 computation cluster with a Lustre filesystem. The data I'm using are single-end, 76bp long reads. I'm running TopHat with the following parameters:
-p 1 -a 8 -i 40 -m 1 -I 1000000 -F 0 --coverage-search --microexon-search
For one of the runs where I get incomplete results, I had noticed this weird thing in the output:
[Thu May 27 17:25:49 2010] Mapping reads against segment_juncs with Bowtie
[Thu May 27 17:25:50 2010] Mapping reads against segment_juncs with Bowtie
[Thu May 27 17:25:51 2010] Mapping reads against segment_juncs with Bowtie
The weird thing is that mapping the reads against segment_juncs should take a lot more time, since I have about 20 million reads. So I thought that there might be an error in building the bowtie index for the splice junctions, but the bowtie_build.log shows no error. However, I find the following type of errors in some other log files from the run:
############################################
filebd4xji.log
Error reading ebwt array: returned 41750080, length was 168445184
Your index files may be corrupt; please try re-building or re-downloading.
A complete index consists of 6 files: XYZ.1.ebwt, XYZ.2.ebwt, XYZ.3.ebwt,
XYZ.4.ebwt, XYZ.rev.1.ebwt, and XYZ.rev.2.ebwt. The XYZ.1.ebwt and
XYZ.rev.1.ebwt files should have the same size, as should the XYZ.2.ebwt and
XYZ.rev.2.ebwt files.
############################################
So it seems that even though the Bowtie index for the junction sequences was built correctly, the alignment of reads on the junction index fails. I've run several series of tests, and I found that this Bowtie error does not occur all the times (it seems to be more or less random), but it does seem to be quite frequent for large datasets. It is not clear yet why this happens - it might be OS-specific or filesystem-specific - so I am currently testing several solutions to fix this problem (see also parallel thread "Bowtie can't read index files").
However, the bigger issue here is that TopHat does not catch the error thrown by Bowtie, and finishes with apparent success, while giving only an incomplete set of exon-exon junctions. This is quite dangerous, since most users will not search for "Error" messages in the log files if TopHat has finished successfully. So I would advise TopHat users to check the log files for Bowtie errors before proceeding with their analyses.
Any comments or suggestions on how to solve this problem would be much appreciated.
Best wishes,
Anamaria
I've recently discovered a strange behaviour in TopHat: it can sometimes give incomplete (or even incorrect) results, due to an error while running Bowtie on the junction sequence database.
I'm using TopHat 1.0.13, with Bowtie 0.12.5 (or 0.12.3), on a Linux x86_64 computation cluster with a Lustre filesystem. The data I'm using are single-end, 76bp long reads. I'm running TopHat with the following parameters:
-p 1 -a 8 -i 40 -m 1 -I 1000000 -F 0 --coverage-search --microexon-search
For one of the runs where I get incomplete results, I had noticed this weird thing in the output:
[Thu May 27 17:25:49 2010] Mapping reads against segment_juncs with Bowtie
[Thu May 27 17:25:50 2010] Mapping reads against segment_juncs with Bowtie
[Thu May 27 17:25:51 2010] Mapping reads against segment_juncs with Bowtie
The weird thing is that mapping the reads against segment_juncs should take a lot more time, since I have about 20 million reads. So I thought that there might be an error in building the bowtie index for the splice junctions, but the bowtie_build.log shows no error. However, I find the following type of errors in some other log files from the run:
############################################
filebd4xji.log
Error reading ebwt array: returned 41750080, length was 168445184
Your index files may be corrupt; please try re-building or re-downloading.
A complete index consists of 6 files: XYZ.1.ebwt, XYZ.2.ebwt, XYZ.3.ebwt,
XYZ.4.ebwt, XYZ.rev.1.ebwt, and XYZ.rev.2.ebwt. The XYZ.1.ebwt and
XYZ.rev.1.ebwt files should have the same size, as should the XYZ.2.ebwt and
XYZ.rev.2.ebwt files.
############################################
So it seems that even though the Bowtie index for the junction sequences was built correctly, the alignment of reads on the junction index fails. I've run several series of tests, and I found that this Bowtie error does not occur all the times (it seems to be more or less random), but it does seem to be quite frequent for large datasets. It is not clear yet why this happens - it might be OS-specific or filesystem-specific - so I am currently testing several solutions to fix this problem (see also parallel thread "Bowtie can't read index files").
However, the bigger issue here is that TopHat does not catch the error thrown by Bowtie, and finishes with apparent success, while giving only an incomplete set of exon-exon junctions. This is quite dangerous, since most users will not search for "Error" messages in the log files if TopHat has finished successfully. So I would advise TopHat users to check the log files for Bowtie errors before proceeding with their analyses.
Any comments or suggestions on how to solve this problem would be much appreciated.
Best wishes,
Anamaria
Comment