Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Strange Tophat prep_reads behavior on small files

    I'm trying to get my SE colorspace dataset from SRA (SRR040361) to work. SRA datasets apparently have an extra quality value which Tophat doesn't like, so I deleted the first character of each quality string.

    That still led to an error message from prep_reads. So I'm trying with very small files & my very rusty C++ knowledge to probe prep_reads. Which leads to a very weird response

    With only one sequence in my file
    Code:
    Error: qual length (1) differs from seq length (51) for fastq record SRR040361.2
    With two sequences (the same ones!)
    Code:
    Error: qual length (9) differs from seq length (51) for fastq record SRR040361.2
    With three
    Code:
    Error: qual length (17) differs from seq length (51) for fastq record SRR040361.2
    Etc.
    Here is the table:
    Code:
    #seq  rpt len
    1       1
    2       9
    3      17
    4      33
    5      41
    6      49
    7      52
    8      52

    BTW, the URL http://randspringer.de/bam referred to in a config error message does not exist

  • #2
    The errors suggest you have broken the FASTQ file in your editing. Could you post the first few reads, use the [ code ] and [ /code ] tags on the forum (or the # icon on the advanced editing mode).

    This should answer my next question: Is your FASTQ file in colour space or has it been converted to sequence space (the NCBI can do this for display)?

    P.S. Are you are using this file (uncompressed)?
    ftp://ftp.ncbi.nlm.nih.gov/sra/SeqSa...0361.fastq.bz2

    P.S. Probably a silly question, check you are using the -C/--color switch to tophat.

    Comment


    • #3
      The OP has seen it, but for anyone else reading, see also this thread:
      Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc

      Comment


      • #4
        Silly of me not to have posted the head of my FASTQ! And yes, it is the file you have linked to (but I have decompressed it).

        Note that it doesn't seem to matter what sequences I use; as far as I can tell I get the same error messages out.
        Code:
        @SRR040361.1 VAB_ugc_85__100_137__138_121__123_bc_Frag50_solid0032_20090715_ugc_121__1231_49_696 length=50
        T12213101232031112231111223021120221322222222202222
        +SRR040361.1 VAB_ugc_85__100_137__138_121__123_bc_Frag50_solid0032_20090715_ugc_121__1231_49_696 length=50
        &3)+.(>=:&)-&5&)3('*0()&//5/&&+&71&&$1*6%+7)3%.82*
        @SRR040361.3 VAB_ugc_85__100_137__138_121__123_bc_Frag50_solid0032_20090715_ugc_121__1231_50_1372 length=50
        T13300301112110223302310003221022222201220122222222
        +SRR040361.3 VAB_ugc_85__100_137__138_121__123_bc_Frag50_solid0032_20090715_ugc_121__1231_50_1372 length=50
        )620:77744/:94/=12)0);/:7756&,&56&%&/,'/'19/&,24,6
        @SRR040361.12 VAB_ugc_85__100_137__138_121__123_bc_Frag50_solid0032_20090715_ugc_121__1231_53_334 length=50
        T31123230002223111100312233113231220332210022103222
        +SRR040361.12 VAB_ugc_85__100_137__138_121__123_bc_Frag50_solid0032_20090715_ugc_121__1231_53_334 length=50
        =:==9;2>==>5<7;>9;<-,8<;475<1989./27*9&++68,)&%802
        @SRR040361.13 VAB_ugc_85__100_137__138_121__123_bc_Frag50_solid0032_20090715_ugc_121__1231_53_1091 length=50
        T13011331210033320001032333320201230312111322113211
        +SRR040361.13 VAB_ugc_85__100_137__138_121__123_bc_Frag50_solid0032_20090715_ugc_121__1231_53_1091 length=50
        A>5>A:=$:<:<$;;7<#,;&?#670<#&9)7*3/.1+5=':,07-&,&4
        @SRR040361.14 VAB_ugc_85__100_137__138_121__123_bc_Frag50_solid0032_20090715_ugc_121__1231_54_127 length=50
        T32110313302221222331121332211111100021123122131232
        +SRR040361.14 VAB_ugc_85__100_137__138_121__123_bc_Frag50_solid0032_20090715_ugc_121__1231_54_127 length=50
        ;)@1//3)&<&1,/)1>&)(:)64&&&:;2',1(&,&.&5$'8650/45(
        @SRR040361.15 VAB_ugc_85__100_137__138_121__123_bc_Frag50_solid0032_20090715_ugc_121__1231_54_311 length=50
        T23332102212232122131103321013321221002223122103222
        +SRR040361.15 VAB_ugc_85__100_137__138_121__123_bc_Frag50_solid0032_20090715_ugc_121__1231_54_311 length=50
        <<::>@>9;7?6;><A<7<>>97)(<9'71>9/35;3$*/655/3788+)

        Comment


        • #5
          Well that does look wrong - the sequences are length 51 (including the leading letter) while the qualities are just length 50.

          This is the start of the original FASTQ file from the NCBI,

          Code:
          @SRR040361.1 VAB_ugc_85__100_137__138_121__123_bc_Frag50_solid0032_20090715_ugc_121__1231_49_696
          T12213101232031112231111223021120221322222222202222
          +
          !&3)+.(>=:&)-&5&)3('*0()&//5/&&+&71&&$1*6%+7)3%.82*
          Both the sequence and the quality strings are length 51.

          This is the start of your conversion:

          Code:
          @SRR040361.1 VAB_ugc_85__100_137__138_121__123_bc_Frag50_solid0032_20090715_ugc_121__1231_49_696 length=50
          T12213101232031112231111223021120221322222222202222
          +SRR040361.1 VAB_ugc_85__100_137__138_121__123_bc_Frag50_solid0032_20090715_ugc_121__1231_49_696 length=50
          &3)+.(>=:&)-&5&)3('*0()&//5/&&+&71&&$1*6%+7)3%.82*
          You have removed the first quality character but not the first character of the sequence. I'd have expected this:

          Code:
          @SRR040361.1 VAB_ugc_85__100_137__138_121__123_bc_Frag50_solid0032_20090715_ugc_121__1231_49_696 length=50
          12213101232031112231111223021120221322222222202222
          +
          &3)+.(>=:&)-&5&)3('*0()&//5/&&+&71&&$1*6%+7)3%.82*
          (Note the repeat of the id and description on the plus line is usually considered to be optional - and a waste of disk space)

          Comment


          • #6
            hello all
            i am very very new to tophat, i need some help because i ran into this error, what should i do please
            thanks.
            Error running 'prep_reads'
            Error: qual length (131) differs from seq length (100) for fastq record HWI-ST365_0157:7:2101:9222:152711#GCGGTC/2!

            Comment


            • #7
              sorry this is the header of my fastq file
              @HWI-ST365_0157:7:1101:1818:2058#GCGGTC/2
              AGAGAAGGAGGCGATTGGGATNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNANNNNNNNNNNNNNNNNNNNN
              +HWI-ST365_0157:7:1101:1818:2058#GCGGTC/2
              bb_eeeeegggggaghfiibgBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
              @HWI-ST365_0157:7:1101:1915:2059#GCGGGC/2
              CTTGGGAGAATTTTGAAAAGAACCATTTTNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNANNNNNNNTTGTTAATCTNNNNNNNNNNNNNNN
              +HWI-ST365_0157:7:1101:1915:2059#GCGGGC/2
              _a_ceeccgggggf]egfbeJR`JJ`XbBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
              @HWI-ST365_0157:7:1101:1933:2060#GCGGTC/2
              GAACTGATAGTACATCCACCTGAGGTGGGGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGNNNNNNNTTTGCAATTANTNNNNNNNNNNNNN

              so now what do i do
              thanks

              Comment


              • #8
                As mastal suggested in the other thread you can examine the offending record by pulling it out of your file like this:

                Code:
                $ cat (or zcat) fastq_file_name | grep "HWI-ST365_0157:7:2101:9222:152711#GCGGTC/2" -A 3

                Comment


                • #9
                  ok i have checked the record of the error message, here
                  @HWI-ST365_0157:7:2101:9222:152711#GCGGTC/2
                  CTGCACCAGCCCGTCGAAGACACATCAGTGACTCCATCATGACTTTTTCTTCATCAATCATTTTGAGAACAGCACCAGCCTTGATCATCGAGTATTCACC
                  +HWI-ST365_0157:7:2101:9222:152711#GCGGTC/2
                  _bbeeeeeg^ecggfhhiiffhihihihffggiihgfhhbghifiidgefdeghffhhiiiiiiiefegga_cebcbcca^`bcccdccb`a``_bcY_b

                  thay are both 100 i do not know why the error.
                  i am viturlizing ubuntu on windows 7, could this be an issue?
                  kindly assist

                  Comment


                  • #10
                    Same Problem, different situation.

                    Originally posted by GenoMax View Post
                    As mastal suggested in the other thread you can examine the offending record by pulling it out of your file like this:

                    Code:
                    $ cat (or zcat) fastq_file_name | grep "HWI-ST365_0157:7:2101:9222:152711#GCGGTC/2" -A 3
                    Hi Genomax,

                    I have the same error in a file of mine:
                    Code:
                    Error: qual length (214) differs from seq length (140) for fastq record !
                    When I try your suggested command on a file that has been having this problem I end up with this output:
                    Code:
                    @J00138:68:HCWKCBBXX:1:1102:25418:27109 2:N:0:ATTACTCG+GCCTCTAT
                    TTTAAATCGGTGGTTAAGAGCCAAATGTATGACTACAGGGAACTTCTAGGCATAGTTAACATATAAGTTAGAGCAT
                    +
                    AAFAFFJJJJJJJJJJJFJJJJJJJJJJJJJFJJJJJFFJJJJJJJJJJJJJJJJJJJJJJJJJJF<-FJFJJJJJ
                    This does not seem to show the 'bad apple'. Any help with this?

                    Header:
                    Code:
                    @J00138:68:HCWKCBBXX:1:1101:24718:1068 2:N:0:NTTACTCG+NCCTCTAT
                    NACTTTTTTTTCCATTTGAGAGATGAAAACACAGGAAGAAGTGAAGGTCTGGAGTTTGATCGCCAGACAAATGACC
                    +
                    #AAAF-<JJ-----7FF-<<<-F-<7<JF<F-FF-<-FF--<-<7<AF----<--<7-A-<-------<<-AA7-7
                    @J00138:68:HCWKCBBXX:1:1101:24941:1068 2:N:0:NTTACTCG+NCCTCTAT
                    NATAAGTCACTGCAGAGAGAGGTGGAGGAATTGAACGGTGAAAATGGGCAGCTTGAATCCGCTTTGGCTCTTGCAA
                    +
                    #AAFFJJJFJJJJJJFJAJ<JJ-FFFJFJJFFFJJJJJJJJJFJJJJJJJJJ7-FJJJJJJJFFFFFFJFJJAJJJ
                    @J00138:68:HCWKCBBXX:1:1101:24962:1068 2:N:0:NTTACTCG+NCCTCTAT
                    NAGCGCTCTTATCAGTCGTCTGCAAGCCTATATAGAGGAACACGGTTCGGAAGACCTTCTGCTTAATACTGAAGAA
                    +
                    #A-A<--AAFFJFF-FF<FFJAJJJ<FJ<F--<<JA-FFJJJJ7<7AFFFJFFFJAFFAA-JJ<-AFJJFAJJFF<
                    @J00138:68:HCWKCBBXX:1:1101:25002:1068 2:N:0:NTTACTCG+NCCTCTAT
                    NGGTCGGGCAATTAGTTTGGTGACCCCCGTGAGTATAAGCACTAACCATAGGGGGTGCCTGAGAATTTGGTGACCC
                    +
                    #AAA<F<JJJFJJJJAJJFFJJJFJJJJJJFFJ-<A-FJJJFFJJJJJJJ7AAAFJJJF<77A<FFJJJ7<AJF7A
                    Thanks!

                    Comment


                    • #11
                      Use the repair.sh tool from BBMap to take out the problem/malformed reads from your files.

                      Comment


                      • #12
                        Thanks, worked well.

                        Comment


                        • #13
                          repair.sh

                          So I have run repair.sh with the default parameters on a pair of read files with this resulting error. I thought if repair.sh saw this, then it would just remove it?

                          Code:
                          Mismatch between length of bases and qualities for read 33584341 (id=J00138:68:HCWKCBBXX:3:2218:11728:42565 1:N:0:GAATTCGT+ACGTCCTG).
                          # qualities=42, # bases=132
                          
                          AAFFFJFJJJ70:42565 1:N:0:GAATTCGT+ACGTCCTG
                          ACCAACTATATTAAAAAAAAATAAGGGCATC0J221TTTACGCJJJJJJTGATTAJJJJJJAAGCAGAAATATTGAAATTJJJJJJJJJGGCTTAAGGCTATCTTGAGTTTTCGTTGGAGGTCACTCCAGCA
                          
                          	at stream.Read.validate(Read.java:114)
                          	at stream.Read.<init>(Read.java:78)
                          	at stream.Read.<init>(Read.java:61)
                          	at stream.FASTQ.quadToRead(FASTQ.java:862)
                          	at stream.FASTQ.toReadList(FASTQ.java:696)
                          	at stream.FastqReadInputStream.fillBuffer(FastqReadInputStream.java:111)
                          	at stream.FastqReadInputStream.nextList(FastqReadInputStream.java:96)
                          	at stream.ConcurrentGenericReadInputStream$ReadThread.readLists(ConcurrentGenericReadInputStream.java:656)
                          	at stream.ConcurrentGenericReadInputStream$ReadThread.run(ConcurrentGenericReadInputStream.java:635)

                          Comment


                          • #14
                            Post your full repair.sh command.

                            Comment


                            • #15
                              Full repair.sh command
                              Code:
                              repair.sh in1=47_R1_001.fastq.gz_recovered in2=47_R2_001.fastq.gz_recovered out1=47_R1_001.fastq.gz_recovered_repaired out2=47_R2_001.fastq.gz_recovered_repaired outs=47_001_singeltons_repair repair -Xmx24g

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM
                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              30 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              32 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              28 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              53 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X