SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Tophat prep_reads error ega2d RNA Sequencing 4 12-07-2012 08:17 AM
Tophat~Error retrieving prep_reads info ruc9 Bioinformatics 6 02-28-2012 07:56 AM
tophat Error running running 'prep_reads' victoryhe Bioinformatics 2 10-17-2011 04:53 AM
TOPHAT ,Error locating program: prep_reads chenyao Bioinformatics 2 08-17-2011 04:34 AM
another tophat "could not execute prep_reads" error James Bioinformatics 7 11-17-2010 04:49 AM

Reply
 
Thread Tools
Old 04-29-2010, 04:02 PM   #1
bzhang
Member
 
Location: California

Join Date: Apr 2010
Posts: 10
Default prep_reads error when running Tophat

I am running tophat on a test reads and got the following error,

Thu Apr 29 16:48:07 2010] Beginning TopHat run (v1.0.13)
-----------------------------------------------
[Thu Apr 29 16:48:07 2010] Preparing output location ./tophat_out/
[Thu Apr 29 16:48:07 2010] Checking for Bowtie index files
[Thu Apr 29 16:48:07 2010] Checking for reference FASTA file
[Thu Apr 29 16:48:07 2010] Checking for Bowtie
Bowtie version: 0.12.5.0
[Thu Apr 29 16:48:07 2010] Checking reads
seed length: 101bp
format: fastq
quality scale: phred33 (default)
[FAILED]
Error: could not execute prep_reads

The prep_reads.log file has this information,

rep_reads v1.0.13
---------------------------
Saw ASCII character 10 but expected 33-based Phred qual.
terminate called after throwing an instance of 'int'

I looked through data and the only ASCII character 10s I could find are the newlines at the end of each line. The test data is attached. Can someone help?
Attached Files
File Type: txt test.txt (492 Bytes, 124 views)
bzhang is offline   Reply With Quote
Old 04-29-2010, 04:18 PM   #2
shurjo
Senior Member
 
Location: Rockville, MD

Join Date: Jan 2009
Posts: 126
Default

If this is Illumina data, were your reads processed with pipeline v1.3 or later? If so, you have to include the --solexa-quals option in your TopHat run.
shurjo is offline   Reply With Quote
Old 04-29-2010, 04:24 PM   #3
bzhang
Member
 
Location: California

Join Date: Apr 2010
Posts: 10
Default

This is Illumina data. What I received was sequence.txt file and I have converted it into fastq (sanger) format. Do I still need to use --solexa-quals?
bzhang is offline   Reply With Quote
Old 04-29-2010, 04:26 PM   #4
shurjo
Senior Member
 
Location: Rockville, MD

Join Date: Jan 2009
Posts: 126
Default

Fastq files include quality scores, so the answer would be yes (once again, only if your reads were processed with pipeline v1.3 or later).
shurjo is offline   Reply With Quote
Old 04-29-2010, 04:33 PM   #5
bzhang
Member
 
Location: California

Join Date: Apr 2010
Posts: 10
Default

I have already converted the Illumina quality score to Sanger standard quality score (shift each character by 31). Do I still need to use the option?
bzhang is offline   Reply With Quote
Old 04-29-2010, 08:58 PM   #6
shurjo
Senior Member
 
Location: Rockville, MD

Join Date: Jan 2009
Posts: 126
Default

I guess not. At this point my knowledge ends and I would go running to the nearest full-time bioinformatics geek. One last thing though: I do see an extra newline at the end of the sample you posted, so I would double check your input file once to make sure that you dont have any in there.

Sorry and best of luck,

Shurjo
shurjo is offline   Reply With Quote
Old 04-29-2010, 09:25 PM   #7
bzhang
Member
 
Location: California

Join Date: Apr 2010
Posts: 10
Default

Shurjo, Thanks for the help. I have checked the file again to make sure there is no extra newline. These two reads were taken out from a large data file. The prep_reads apparently runs fine for the first 200,000 some reads and then choke on these two and I just could not see how they are different from other reads.
bzhang is offline   Reply With Quote
Old 04-29-2010, 10:05 PM   #8
Cole Trapnell
Senior Member
 
Location: Boston, MA

Join Date: Nov 2008
Posts: 212
Default

Can you verify that the FASTQ file is correctly formatted? The fact that TopHat is choosing a seed length of 101bp tells me something's up with that file. The seed length ought to be 25 for 50bp reads or longer. TopHat's FASTQ parser occasionally screws up when FASTQ records are incorrectly formatted or when the read and/or quality sequences span more than one line in the file. We plan to replace the parser in an upcoming version to make it more robust to this kind of thing.
Cole Trapnell is offline   Reply With Quote
Old 04-29-2010, 11:13 PM   #9
bzhang
Member
 
Location: California

Join Date: Apr 2010
Posts: 10
Default

Cole, could you take a look at the fastq file I attached? The original fastq file was converted from the Illumina SCARF format and contains millions of reads. prep_reads gave the error after 10 minutes, and the two reads I attached seem to be responsible for the problem.
bzhang is offline   Reply With Quote
Old 04-30-2010, 01:00 AM   #10
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,541
Default

Quote:
Originally Posted by bzhang View Post
Saw ASCII character 10 but expected 33-based Phred qual.
terminate called after throwing an instance of 'int'

I looked through data and the only ASCII character 10s I could find are the newlines at the end of each line. The test data is attached. Can someone help?
Are you on Linux/Unix? It sounds like the file has DOS/Windows new lines (CR, LF - i.e. ASCII 10, 13) rather than Unix style (LF only). Try using dos2unix on it (or a similar tool).
maubp is offline   Reply With Quote
Old 04-30-2010, 11:06 AM   #11
bzhang
Member
 
Location: California

Join Date: Apr 2010
Posts: 10
Default

I think I figured out the problem. The Illumina sequence file uses '.' for undetermined bases and prep_reads filters this out when reading the sequence. This creates a mismatch between the sequences and the quality scores. For the problematic reads I attached, the first sequence contains 11 '.'s, so prep_reads reads in 90 bases. There happens to be a '@' in the quality scores after 90 and prep_reads treats it as the start of a new record, and this messes up the next record and hence the error. I don't know if using '.' in the sequences is a new convention adopted by Illumina or not. I am surprised that I am the first one to encounter this problem. For now I guess I'll just convert all those '.'s into 'N's, but prep_reads can certainly be more robust.

I am sort of lucky in a sense that my data contains enough reads to see this problem. If I only have 200,000 reads, I may not see the problem and happily carry on the downstream analysis unaware of the mismatch between the sequences and the quality scores.
bzhang is offline   Reply With Quote
Old 04-30-2010, 11:09 AM   #12
Cole Trapnell
Senior Member
 
Location: Boston, MA

Join Date: Nov 2008
Posts: 212
Default

Thanks for the heads up. We'll add the bug to our tracker and address it in the next release. Others are likely to have this problem.
Cole Trapnell is offline   Reply With Quote
Old 05-01-2010, 04:19 AM   #13
darked89
Member
 
Location: Barcelona, Spain

Join Date: Jun 2009
Posts: 36
Question

Quote:
Originally Posted by Cole Trapnell View Post
Can you verify that the FASTQ file is correctly formatted? The fact that TopHat is choosing a seed length of 101bp tells me something's up with that file. The seed length ought to be 25 for 50bp reads or longer.
I am also getting seed lengths = read_length (54, 76bp). Tophat runs fine till the end, but the accepted_hits.sam has zero spliced reads (for 76bp run). I run it in paired end mode, therefore assumed that something is wrong with my --mate-inner-dist / --mate-std-dev values (60, 20). Checked with the lab corrected these (20, 20), but still got no splices. Input FASTQ files were filtered using R ShortRead package. The same files seem to be doing OK with other mappers (SOAP, GEM).

Is there any way I can check that my FASTQ files are Tophat compatible?
darked89 is offline   Reply With Quote
Old 05-01-2010, 08:30 AM   #14
bzhang
Member
 
Location: California

Join Date: Apr 2010
Posts: 10
Default

From what I understand by reading the code, at least in the recent versions, the seed length is equal to the shortest read length. So if all the reads are of the same length, the seed length is set to the read length. I am not sure about the impact of setting seed length this way, guess I have to read more paper to understand this.
bzhang is offline   Reply With Quote
Old 05-01-2010, 01:23 PM   #15
bzhang
Member
 
Location: California

Join Date: Apr 2010
Posts: 10
Default

Quote:
Originally Posted by darked89 View Post
I am also getting seed lengths = read_length (54, 76bp). Tophat runs fine till the end, but the accepted_hits.sam has zero spliced reads (for 76bp run). I run it in paired end mode, therefore assumed that something is wrong with my --mate-inner-dist / --mate-std-dev values (60, 20). Checked with the lab corrected these (20, 20), but still got no splices. Input FASTQ files were filtered using R ShortRead package. The same files seem to be doing OK with other mappers (SOAP, GEM).

Is there any way I can check that my FASTQ files are Tophat compatible?
It seems tophat calls bowtie with option -v 2, which, according to the manual, means at most 2 mismatches allowed and the option -l (which specifies seed length) is ignored. I think your fastq files are fine as long as they don't contain non-alphabetical characters in the sequences.
bzhang is offline   Reply With Quote
Old 05-05-2010, 12:36 PM   #16
rcorbett
Member
 
Location: canada

Join Date: Sep 2009
Posts: 29
Default

I am getting the same error:
This is with version tophat-1.0.13. Has anyone coded a fix for this?

@2065
AGGCCGCTCCGGGCGCTGGACGTTGGGCTCCTGGCGAACCTCTCGGCGCTGGAGACTGGATATAACACAAC
+SOLEXA3_1:1:1:22:173/2
EDHEGHHCDHHGHHHHHCG/HHHHGEDHH=HHHGGHF<?HHHEGHDCHHADB#6@=#8AAG:@@E=1#@>#
Saw ASCII character 10 but expected 33-based Phred qual.
terminate called after throwing an instance of 'int'
Aborted
rcorbett is offline   Reply With Quote
Old 05-05-2010, 01:13 PM   #17
rcorbett
Member
 
Location: canada

Join Date: Sep 2009
Posts: 29
Default

I tried a workaround of changing all the "." characters in the sequence to "N" with awk:

awk '{if ( (NR-2)%4==0) {gsub(/\./,"N",$0); print $0} else { print $0}}'

Which gets me through prep_reads without any errors. Can anyone comment on if this is a safe workaround?
rcorbett is offline   Reply With Quote
Old 05-05-2010, 02:36 PM   #18
bzhang
Member
 
Location: California

Join Date: Apr 2010
Posts: 10
Default

Quote:
Originally Posted by rcorbett View Post
I tried a workaround of changing all the "." characters in the sequence to "N" with awk:

awk '{if ( (NR-2)%4==0) {gsub(/\./,"N",$0); print $0} else { print $0}}'

Which gets me through prep_reads without any errors. Can anyone comment on if this is a safe workaround?
I did the same conversion and was able to run the downstream analysis with cufflinks and the result seems to be fine. I think this is a safe workaround as 'N' is a legitimate character in the fasta sequence and I assume the alignment software (bowtie) treats it intelligently.
bzhang is offline   Reply With Quote
Old 05-06-2010, 05:30 AM   #19
maubp
Peter (Biopython etc)
 
Location: Dundee, Scotland, UK

Join Date: Jul 2009
Posts: 1,541
Default

I've added [ code ] tags round your FASTQ example for clarity - otherwise the forum messes things up.
Quote:
Originally Posted by rcorbett View Post
I am getting the same error:
This is with version tophat-1.0.13. Has anyone coded a fix for this?

Code:
@2065
AGGCCGCTCCGGGCGCTGGACGTTGGGCTCCTGGCGAACCTCTCGGCGCTGGAGACTGGATATAACACAAC
+SOLEXA3_1:1:1:22:173/2
EDHEGHHCDHHGHHHHHCG/HHHHGEDHH=HHHGGHF<?HHHEGHDCHHADB#6@=#8AAG:@@E=1#@>#
Saw ASCII character 10 but expected 33-based Phred qual.
terminate called after throwing an instance of 'int'
Aborted
The (optional repeated) identifier on the + line doesn't match the (mandatory) identifier on the @ line. Assuming nothing went wrong in the cut and paste into the forum, it looks like something is very wrong with your FASTQ file. This may be what is upsetting tophat.
maubp is offline   Reply With Quote
Old 05-07-2010, 06:15 AM   #20
clariet
Member
 
Location: NJ

Join Date: Mar 2010
Posts: 18
Default

Hi, Cole_Trapnell

Does current version of Tophat support SOLiD data? Thanks

Clariet

Quote:
Originally Posted by Cole Trapnell View Post
Thanks for the heads up. We'll add the bug to our tracker and address it in the next release. Others are likely to have this problem.
clariet is offline   Reply With Quote
Reply

Tags
bowtie, error, prep_reads, tophat

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 05:05 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO