Seqanswers Leaderboard Ad

**maubp** · 10-07-2010, 02:31 AM

The errors suggest you have broken the FASTQ file in your editing. Could you post the first few reads, use the [ code ] and [ /code ] tags on the forum (or the # icon on the advanced editing mode).

This should answer my next question: Is your FASTQ file in colour space or has it been converted to sequence space (the NCBI can do this for display)?

P.S. Are you are using this file (uncompressed)?
ftp://ftp.ncbi.nlm.nih.gov/sra/SeqSa...0361.fastq.bz2

P.S. Probably a silly question, check you are using the -C/--color switch to tophat.

**maubp** · 10-07-2010, 05:49 AM

The OP has seen it, but for anyone else reading, see also this thread:

Reformating SOLiD input for TopHat 1.1 - SEQanswers

http://seqanswers.com/forums/showthread.php?t=7206

Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc

**krobison** · 10-07-2010, 06:45 AM

Silly of me not to have posted the head of my FASTQ! And yes, it is the file you have linked to (but I have decompressed it).

Note that it doesn't seem to matter what sequences I use; as far as I can tell I get the same error messages out.

Code:

@SRR040361.1 VAB_ugc_85__100_137__138_121__123_bc_Frag50_solid0032_20090715_ugc_121__1231_49_696 length=50
T12213101232031112231111223021120221322222222202222
+SRR040361.1 VAB_ugc_85__100_137__138_121__123_bc_Frag50_solid0032_20090715_ugc_121__1231_49_696 length=50
&3)+.(>=:&)-&5&)3('*0()&//5/&&+&71&&$1*6%+7)3%.82*
@SRR040361.3 VAB_ugc_85__100_137__138_121__123_bc_Frag50_solid0032_20090715_ugc_121__1231_50_1372 length=50
T13300301112110223302310003221022222201220122222222
+SRR040361.3 VAB_ugc_85__100_137__138_121__123_bc_Frag50_solid0032_20090715_ugc_121__1231_50_1372 length=50
)620:77744/:94/=12)0);/:7756&,&56&%&/,'/'19/&,24,6
@SRR040361.12 VAB_ugc_85__100_137__138_121__123_bc_Frag50_solid0032_20090715_ugc_121__1231_53_334 length=50
T31123230002223111100312233113231220332210022103222
+SRR040361.12 VAB_ugc_85__100_137__138_121__123_bc_Frag50_solid0032_20090715_ugc_121__1231_53_334 length=50
=:==9;2>==>5<7;>9;<-,8<;475<1989./27*9&++68,)&%802
@SRR040361.13 VAB_ugc_85__100_137__138_121__123_bc_Frag50_solid0032_20090715_ugc_121__1231_53_1091 length=50
T13011331210033320001032333320201230312111322113211
+SRR040361.13 VAB_ugc_85__100_137__138_121__123_bc_Frag50_solid0032_20090715_ugc_121__1231_53_1091 length=50
A>5>A:=$:<:<$;;7<#,;&?#670<#&9)7*3/.1+5=':,07-&,&4
@SRR040361.14 VAB_ugc_85__100_137__138_121__123_bc_Frag50_solid0032_20090715_ugc_121__1231_54_127 length=50
T32110313302221222331121332211111100021123122131232
+SRR040361.14 VAB_ugc_85__100_137__138_121__123_bc_Frag50_solid0032_20090715_ugc_121__1231_54_127 length=50
;)@1//3)&<&1,/)1>&)(:)64&&&:;2',1(&,&.&5$'8650/45(
@SRR040361.15 VAB_ugc_85__100_137__138_121__123_bc_Frag50_solid0032_20090715_ugc_121__1231_54_311 length=50
T23332102212232122131103321013321221002223122103222
+SRR040361.15 VAB_ugc_85__100_137__138_121__123_bc_Frag50_solid0032_20090715_ugc_121__1231_54_311 length=50
<<::>@>9;7?6;><A<7<>>97)(<9'71>9/35;3$*/655/3788+)

**maubp** · 10-07-2010, 06:54 AM

Well that does look wrong - the sequences are length 51 (including the leading letter) while the qualities are just length 50.

This is the start of the original FASTQ file from the NCBI,

Code:

@SRR040361.1 VAB_ugc_85__100_137__138_121__123_bc_Frag50_solid0032_20090715_ugc_121__1231_49_696
T12213101232031112231111223021120221322222222202222
+
!&3)+.(>=:&)-&5&)3('*0()&//5/&&+&71&&$1*6%+7)3%.82*

Both the sequence and the quality strings are length 51.

This is the start of your conversion:

Code:

@SRR040361.1 VAB_ugc_85__100_137__138_121__123_bc_Frag50_solid0032_20090715_ugc_121__1231_49_696 length=50
T12213101232031112231111223021120221322222222202222
+SRR040361.1 VAB_ugc_85__100_137__138_121__123_bc_Frag50_solid0032_20090715_ugc_121__1231_49_696 length=50
&3)+.(>=:&)-&5&)3('*0()&//5/&&+&71&&$1*6%+7)3%.82*

You have removed the first quality character but not the first character of the sequence. I'd have expected this:

Code:

@SRR040361.1 VAB_ugc_85__100_137__138_121__123_bc_Frag50_solid0032_20090715_ugc_121__1231_49_696 length=50
12213101232031112231111223021120221322222222202222
+
&3)+.(>=:&)-&5&)3('*0()&//5/&&+&71&&$1*6%+7)3%.82*

(Note the repeat of the id and description on the plus line is usually considered to be optional - and a waste of disk space)

**aforntacc** · 07-19-2013, 10:41 AM

hello all
i am very very new to tophat, i need some help because i ran into this error, what should i do please
thanks.
Error running 'prep_reads'
Error: qual length (131) differs from seq length (100) for fastq record HWI-ST365_0157:7:2101:9222:152711#GCGGTC/2!

**aforntacc** · 07-19-2013, 10:48 AM

sorry this is the header of my fastq file
@HWI-ST365_0157:7:1101:1818:2058#GCGGTC/2
AGAGAAGGAGGCGATTGGGATNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNANNNNNNNNNNNNNNNNNNNN
+HWI-ST365_0157:7:1101:1818:2058#GCGGTC/2
bb_eeeeegggggaghfiibgBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
@HWI-ST365_0157:7:1101:1915:2059#GCGGGC/2
CTTGGGAGAATTTTGAAAAGAACCATTTTNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNANNNNNNNTTGTTAATCTNNNNNNNNNNNNNNN
+HWI-ST365_0157:7:1101:1915:2059#GCGGGC/2
_a_ceeccgggggf]egfbeJR`JJ`XbBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
@HWI-ST365_0157:7:1101:1933:2060#GCGGTC/2
GAACTGATAGTACATCCACCTGAGGTGGGGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGNNNNNNNTTTGCAATTANTNNNNNNNNNNNNN

so now what do i do
thanks

**GenoMax** · 07-19-2013, 11:00 AM

As mastal suggested in the other thread you can examine the offending record by pulling it out of your file like this:

Code:

$ cat (or zcat) fastq_file_name | grep "HWI-ST365_0157:7:2101:9222:152711#GCGGTC/2" -A 3

**aforntacc** · 07-20-2013, 01:14 AM

ok i have checked the record of the error message, here
@HWI-ST365_0157:7:2101:9222:152711#GCGGTC/2
CTGCACCAGCCCGTCGAAGACACATCAGTGACTCCATCATGACTTTTTCTTCATCAATCATTTTGAGAACAGCACCAGCCTTGATCATCGAGTATTCACC
+HWI-ST365_0157:7:2101:9222:152711#GCGGTC/2
_bbeeeeeg^ecggfhhiiffhihihihffggiihgfhhbghifiidgefdeghffhhiiiiiiiefegga_cebcbcca^`bcccdccb`a``_bcY_b

thay are both 100 i do not know why the error.
i am viturlizing ubuntu on windows 7, could this be an issue?
kindly assist

**ronaldrcutler** · 10-20-2016, 09:37 AM

Same Problem, different situation.

Originally posted by GenoMax View Post

As mastal suggested in the other thread you can examine the offending record by pulling it out of your file like this:

Code:

$ cat (or zcat) fastq_file_name | grep "HWI-ST365_0157:7:2101:9222:152711#GCGGTC/2" -A 3

Hi Genomax,

I have the same error in a file of mine:

Code:

Error: qual length (214) differs from seq length (140) for fastq record !

When I try your suggested command on a file that has been having this problem I end up with this output:

Code:

@J00138:68:HCWKCBBXX:1:1102:25418:27109 2:N:0:ATTACTCG+GCCTCTAT
TTTAAATCGGTGGTTAAGAGCCAAATGTATGACTACAGGGAACTTCTAGGCATAGTTAACATATAAGTTAGAGCAT
+
AAFAFFJJJJJJJJJJJFJJJJJJJJJJJJJFJJJJJFFJJJJJJJJJJJJJJJJJJJJJJJJJJF<-FJFJJJJJ

This does not seem to show the 'bad apple'. Any help with this?

Header:

Code:

@J00138:68:HCWKCBBXX:1:1101:24718:1068 2:N:0:NTTACTCG+NCCTCTAT
NACTTTTTTTTCCATTTGAGAGATGAAAACACAGGAAGAAGTGAAGGTCTGGAGTTTGATCGCCAGACAAATGACC
+
#AAAF-<JJ-----7FF-<<<-F-<7<JF<F-FF-<-FF--<-<7<AF----<--<7-A-<-------<<-AA7-7
@J00138:68:HCWKCBBXX:1:1101:24941:1068 2:N:0:NTTACTCG+NCCTCTAT
NATAAGTCACTGCAGAGAGAGGTGGAGGAATTGAACGGTGAAAATGGGCAGCTTGAATCCGCTTTGGCTCTTGCAA
+
#AAFFJJJFJJJJJJFJAJ<JJ-FFFJFJJFFFJJJJJJJJJFJJJJJJJJJ7-FJJJJJJJFFFFFFJFJJAJJJ
@J00138:68:HCWKCBBXX:1:1101:24962:1068 2:N:0:NTTACTCG+NCCTCTAT
NAGCGCTCTTATCAGTCGTCTGCAAGCCTATATAGAGGAACACGGTTCGGAAGACCTTCTGCTTAATACTGAAGAA
+
#A-A<--AAFFJFF-FF<FFJAJJJ<FJ<F--<<JA-FFJJJJ7<7AFFFJFFFJAFFAA-JJ<-AFJJFAJJFF<
@J00138:68:HCWKCBBXX:1:1101:25002:1068 2:N:0:NTTACTCG+NCCTCTAT
NGGTCGGGCAATTAGTTTGGTGACCCCCGTGAGTATAAGCACTAACCATAGGGGGTGCCTGAGAATTTGGTGACCC
+
#AAA<F<JJJFJJJJAJJFFJJJFJJJJJJFFJ-<A-FJJJFFJJJJJJJ7AAAFJJJF<77A<FFJJJ7<AJF7A

Thanks!

**GenoMax** · 10-20-2016, 09:57 AM

Use the repair.sh tool from BBMap to take out the problem/malformed reads from your files.

**ronaldrcutler** · 10-28-2016, 09:00 AM

Thanks, worked well.

**ronaldrcutler** · 10-28-2016, 09:19 AM

repair.sh

So I have run repair.sh with the default parameters on a pair of read files with this resulting error. I thought if repair.sh saw this, then it would just remove it?

Code:

Mismatch between length of bases and qualities for read 33584341 (id=J00138:68:HCWKCBBXX:3:2218:11728:42565 1:N:0:GAATTCGT+ACGTCCTG).
# qualities=42, # bases=132

AAFFFJFJJJ70:42565 1:N:0:GAATTCGT+ACGTCCTG
ACCAACTATATTAAAAAAAAATAAGGGCATC0J221TTTACGCJJJJJJTGATTAJJJJJJAAGCAGAAATATTGAAATTJJJJJJJJJGGCTTAAGGCTATCTTGAGTTTTCGTTGGAGGTCACTCCAGCA

	at stream.Read.validate(Read.java:114)
	at stream.Read.<init>(Read.java:78)
	at stream.Read.<init>(Read.java:61)
	at stream.FASTQ.quadToRead(FASTQ.java:862)
	at stream.FASTQ.toReadList(FASTQ.java:696)
	at stream.FastqReadInputStream.fillBuffer(FastqReadInputStream.java:111)
	at stream.FastqReadInputStream.nextList(FastqReadInputStream.java:96)
	at stream.ConcurrentGenericReadInputStream$ReadThread.readLists(ConcurrentGenericReadInputStream.java:656)
	at stream.ConcurrentGenericReadInputStream$ReadThread.run(ConcurrentGenericReadInputStream.java:635)

**GenoMax** · 10-28-2016, 09:27 AM

Post your full repair.sh command.

**ronaldrcutler** · 10-28-2016, 09:29 AM

Full repair.sh command

Code:

repair.sh in1=47_R1_001.fastq.gz_recovered in2=47_R2_001.fastq.gz_recovered out1=47_R1_001.fastq.gz_recovered_repaired out2=47_R2_001.fastq.gz_recovered_repaired outs=47_001_singeltons_repair repair -Xmx24g

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 30 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 32 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 28 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 53 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Strange Tophat prep_reads behavior on small files

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News