Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Reformating SOLiD input for TopHat 1.1

    Hello all,

    I have now spent some time reformatting my .csfasta and .qual files to make them compatible to the new TopHat v 1.1.

    I would be interested to see your approaches!

    First I used the tools supplied with MAQ (fq_all2std.pl csfa2std and solid2fastq.pl). However these did not work, since they converted my colorspace code to nucleotide code.

    Then I chose to use solid2fastq.c from the BFAST package and this seems to work.

    I also converted the "." characters in my .csfasta to "N"s using:

    Code:
    awk '{if ( (NR-2)%4==0) {gsub(/\./,"N",$0); print $0} else { print $0}}'
    and changed the negative values in the .qual file of -1 to 1:

    Code:
    sed -i 's_-1_1_g' Sample_14_Qual_minus.qual
    The result looks like this:

    Code:
    @853_8_25
    T32NN232N3N10NN123NN202NN101NN010N1120NN012N0002N0N
    +
    +*""%%%"%"%%""%%%""%)&""%%%""%%+"&'%&""'(%"'))'"&"
    @853_8_35
    T31NN012N0N01NN011NN112NN300NNNN2N2001NN1N0N1N1NN3N
    +
    ?:""=54"@"=+""A98""745"";98""""2"=@>8""<"4"<"6"";"
    Even though TopHat now accepts this file, I still get an error:


    Code:
    [Thu Oct  7 09:32:30 2010] Beginning TopHat run (v1.1.0)
    -----------------------------------------------
    [Thu Oct  7 09:32:30 2010] Preparing output location /home/schaefer/tophat/RBM20/Sample14/
    [Thu Oct  7 09:32:30 2010] Checking for Bowtie index files
    [Thu Oct  7 09:32:30 2010] Checking for reference FASTA file
    [Thu Oct  7 09:32:30 2010] Checking for Bowtie
    	Bowtie version:			 0.12.3.0
    [Thu Oct  7 09:32:30 2010] Checking for Samtools
    	Samtools version:		 0.1.8.0
    [Thu Oct  7 09:32:37 2010] Checking reads
    	min read length: 50bp, max read length: 50bp
    	format:		 fastq
    	quality scale:	 phred33 (default)
    	[FAILED]
    Error: could not execute prep_reads
    using
    Code:
    /usr/local/tophat_1.1.0/tophat --integer-quals -o /home/schaefer/tophat/Sample14 -C rn4_c /home/schaefer/tophat/fastq/Sample14Ns.fastq
    with or without the option --integer-quals.

    Now I wonder if this is due to my input file or due to other problems mentioned here: http://seqanswers.com/forums/showthread.php?t=7141

    What input files have worked for you and how did you generate them?

    All the best,
    Sebastian

  • #2
    Check in the logs subdirectory of the directory you told Tophat to work in; there will be a prep_reads.log there -- post that & we can see if it is the same trouble I am running into or something I haven't (yet) butted heads against.

    Comment


    • #3
      There is a prep_reads.log file in the logs directory, but it is empty.

      The run.log shows how prep_reads was started:

      Code:
      /usr/local/tophat_1.1.0/prep_reads --min-anchor 8 --splice-mismatches 0 --min-report-intron 50 --max-report-intron 500000 --min-isoform-fraction 0.15 --output-dir /home/schaefer/tophat/Sample14/ --max-multihits 40 --segment-length 25 --segment-mismatches 2 --min-closure-exon 100 --min-closure-intron 50 --max-closure-intron 5000 --min-coverage-intron 50 --max-coverage-intron 20000 --min-segment-intron 50 --max-segment-intron 500000 --sam-header /home/schaefer/tophat/Sample14/tmp/stub_header.sam --no-microexon-search --integer-quals --color --fastq /home/schaefer/tophat/fastq/Sample14Ns.fastq

      Comment


      • #4
        I am running into the same problem. I converted from csfasta and qual to fastq using solid2fastq from the Bfast package.

        The prep_reads.log says

        prep_reads v1.1.0 (1604)
        ---------------------------
        Saw ASCII character 29 but expected 33-based Phred qual.
        terminate called after throwing an instance of 'int'


        Running prep_reads directly give me the following error

        ~/Desktop > prep_reads test.fastq
        prep_reads v1.1.0 (1604)
        ---------------------------
        Error: qual length (62) differs from seq length (51) for fastq record !
        Last edited by dsidote; 10-07-2010, 06:23 AM.

        Comment


        • #5
          Since your quality and sequence files differ in length, maybe you have not changed the negative values in your .qual file? (see above command)

          Comment


          • #6
            Does someone (such as the Tophat team) have a small colorspace dataset which works in Tophat that they'd be willing & able to make public? Having a positive control would be awfully handy.

            Comment


            • #7
              Originally posted by DerSeb View Post
              Since your quality and sequence files differ in length, maybe you have not changed the negative values in your .qual file? (see above command)
              Since I have many more reads then I need I just removed all of the reads with missed colorcalls from the csfasta and qual files. It was about 1% of the total number of reads that got removed.

              Comment


              • #8
                I emailed Daehwan and he said the command is

                Code:
                tophat --color --quals bowtie_color_index 1.csfasta 1.qual
                You don't have to convert to fastq, but you do have to remove the header lines from the csfasta and qual files before running tophat. Also, you have to replace the '.' with N in the csfasta and replace '-1' with 0 in the qual file.

                It seems to be running fine following the above.

                Comment


                • #9
                  Here is a patch fixing all the colorspace bugs I've encountered so far:
                  1. prep_reads failing on -1 qualities
                  2. prep_reads failing on comments
                  3. prep_reads converting . to N only on the last read

                  Change a few lines of the source code seemed preferable to changing the reads.

                  I haven't finished a run yet so I'm not sure there if there are other bugs.

                  To apply the patch,
                  Code:
                  tar xzf tophat-1.1.0.tar.gz
                  # put tophat-1.1.0.colorspace.patch.gz in tophat-1.1.0
                  cd tophat-1.1.0
                  gunzip tophat-1.1.0.colorspace.patch.gz
                  patch -p0 < tophat-1.1.0.colorspace.patch
                  ./configure ; make ; make install
                  Attached Files

                  Comment


                  • #10
                    Thanks dcjones. Much easier than messing with multi-GB files.

                    Comment


                    • #11
                      Originally posted by dcjones View Post
                      Here is a patch fixing all the colorspace bugs I've encountered so far:
                      1. prep_reads failing on -1 qualities
                      2. prep_reads failing on comments
                      3. prep_reads converting . to N only on the last read

                      Change a few lines of the source code seemed preferable to changing the reads.

                      I haven't finished a run yet so I'm not sure there if there are other bugs.

                      To apply the patch,
                      Code:
                      tar xzf tophat-1.1.0.tar.gz
                      # put tophat-1.1.0.colorspace.patch.gz in tophat-1.1.0
                      cd tophat-1.1.0
                      gunzip tophat-1.1.0.colorspace.patch.gz
                      patch -p0 < tophat-1.1.0.colorspace.patch
                      ./configure ; make ; make install
                      Thanks so much for doing this dcjones.

                      I was finally able to process a small subset of solid data without tophat crashing! One note, comment lines still cause the tophat error

                      Code:
                      [Thu Oct  7 14:02:44 2010] Beginning TopHat run (v1.1.0)
                      -----------------------------------------------
                      [Thu Oct  7 14:02:44 2010] Preparing output location tophat-out/short/
                      [Thu Oct  7 14:02:44 2010] Checking for Bowtie index files
                      [Thu Oct  7 14:02:44 2010] Checking for reference FASTA file
                      [Thu Oct  7 14:02:44 2010] Checking for Bowtie
                              Bowtie version:                  0.12.7.0
                      [Thu Oct  7 14:02:44 2010] Checking for Samtools
                              Samtools version:                0.1.8.0
                      [Thu Oct  7 14:03:09 2010] Checking reads
                      Error: file short.csfasta does not appear to be a valid FASTA or FASTQ file
                      
                      Error encountered parsing file short.csfasta:
                       Records in Fastq files should start with '@' character

                      Comment


                      • #12
                        hello and thx for all the ideas and help. I don't want to mess around with the installation so far and still try to convert my csfasta to Ns only. however, when I use this:

                        Code:
                        awk '{if ( (NR-2)%2==0) {gsub(/\./,"N",$0); print $0} else { print $0}}'
                        to change . to N and this

                        Code:
                        sed '1,3d'
                        to remove header lines

                        to make my .csfasta:

                        Code:
                        >853_8_25_F3
                        T32NN232N3N10NN123NN202NN101NN010N1120NN012N0002N0N
                        >853_8_35_F3
                        T31NN012N0N01NN011NN112NN300NNNN2N2001NN1N0N1N1NN3N
                        >853_8_75_F3
                        T32NN011N1N31NN001NN301NN120NN232N2201NN231N1202N1N
                        >853_8_96_F3
                        T32NN212N2N30NN113NN201NN002NNN20N3010NN3N0N0N0NN0N
                        >853_8_114_F3
                        T30NN111N1N00NN212NN133NN331NNNN1N2333NN1N3N2N2NN0N
                        and this

                        Code:
                        sed -i 's_-1_0_g'
                        to get my .qual file:

                        Code:
                        >853_8_25_F3
                        10 9 0 0 4 4 4 0 4 0 4 4 0 0 4 4 4 0 0 4 8 5 0 0 4 4 4 0 0 4 4 10 0 5 6 4 5 0 0 6 7 4 0 6 8 8 6 0 5 0 
                        >853_8_35_F3
                        30 25 0 0 28 20 19 0 31 0 28 10 0 0 32 24 23 0 0 22 19 20 0 0 26 24 23 0 0 0 0 17 0 28 31 29 23 0 0 27 0 19 0 27 0 21 0 0 26 0 
                        >853_8_75_F3
                        30 28 0 0 26 27 22 0 15 0 26 19 0 0 28 26 25 0 0 29 26 28 0 0 24 19 27 0 0 13 11 20 0 13 26 17 21 0 0 24 4 8 0 7 4 7 7 0 7 0 
                        >853_8_96_F3
                        8 4 0 0 7 4 5 0 4 0 4 4 0 0 7 4 4 0 0 4 4 4 0 0 5 4 4 0 0 0 6 5 0 6 5 4 5 0 0 4 0 4 0 6 0 10 0 0 4 0 
                        >853_8_114_F3
                        30 31 0 0 19 19 21 0 28 0 28 27 0 0 6 14 6 0 0 7 25 22 0 0 4 29 16 0 0 0 0 4 0 17 4 27 16 0 0 5 0 19 0 8 0 7 0 0 15 0
                        I still get the same error... The log file of prep_reads is empty. There is also an empty file called left_kept_reads.fq in the output directory.

                        Code:
                        [Fri Oct  8 11:17:26 2010] Beginning TopHat run (v1.1.0)
                        -----------------------------------------------
                        [Fri Oct  8 11:17:26 2010] Preparing output location /home/schaefer/tophat/RBM20/Sample14/
                        [Fri Oct  8 11:17:26 2010] Checking for Bowtie index files
                        [Fri Oct  8 11:17:26 2010] Checking for reference FASTA file
                        [Fri Oct  8 11:17:26 2010] Checking for Bowtie
                        	Bowtie version:			 0.12.3.0
                        [Fri Oct  8 11:17:26 2010] Checking for Samtools
                        	Samtools version:		 0.1.8.0
                        [Fri Oct  8 11:17:32 2010] Checking reads
                        	min read length: 50bp, max read length: 50bp
                        	format:		 fasta
                        	[FAILED]
                        Error: could not execute prep_reads
                        Also both .csfasta and .qual have the same number of lines.

                        I guess my files should be alright and it must be a problem with the installation?
                        Did you all compile tophat on your system or did you just copy pre-compiled files?

                        Comment


                        • #13
                          I used the precompiled version and it worked. Our sysadmin is recompiling the code with dcjones patch, so as soon as that is done I will test it with unmodified data.

                          DerSeb: Did you try removing the reads with the missed colorcalls instead of converting to 'N' to see if the mixed colorspace-basespace is the issue?
                          Last edited by dsidote; 10-08-2010, 05:38 AM.

                          Comment


                          • #14
                            Originally posted by DerSeb View Post
                            hello and thx for all the ideas and help. I don't want to mess around with the installation so far and still try to convert my csfasta to Ns only. however, when I use this:
                            I don't thing there is a problem with the '.'s needing to be 'N's. It expects '.'s in colorspace reads. The problem is that tophat converts the '.'s to 'N's on exactly one read (the last read), and it should not.

                            I don't know that you can modify your reads to work around that.

                            Comment


                            • #15
                              I finally have worked out a full script which converts my single-end SOLiD data from the Short Read Archive in FASTQ (aside: it would be great if the TopHat team swiped the colorspace reading code from Bowtie, which is much more flexible in the format)
                              Code:
                              #!/usr/bin/perl
                              use strict;
                              
                              # reformat single-end SOLiD data from Short Read Archive
                              # to work successfully with patched version of TopHat 1.1.0
                              
                              # sample bash command line to run TopHat
                              # tophat --color -G hg18.ref-genes.gtf -o SRR040290-tophat -p 8 --quals hg18  SRR040290.csfasta  SRR040290.qual 1> tophat.out 2> tophat.err
                              
                              foreach my $arg(@ARGV)
                              {
                                  my ($stem)=($arg=~/(SRR[0-9]+)/);;
                                  die "Could not identify stem in $arg\n" unless (defined $stem);
                                  
                                  open(IN,$arg);
                                  open(FASTA,">$stem.csfasta");
                                  open(QUAL,">$stem.qual");
                                  while (my $idLineA=<IN>)
                                  {
                              	chomp($idLineA);
                              	my ($id)=($idLineA=~/^.([^ ]+)/);
                              	my $seqLine=<IN>;
                              	my $idLineB=<IN>;
                              	my $qualLine=<IN>;
                              	chomp($qualLine);
                              	my @qualVals=();
                              	foreach my $qualChar(split(//,$qualLine))
                              	{
                              	    my $qualVal=ord($qualChar)-33;
                              	    if ($qualVal<0)
                              	    {
                              		$qualVal=0;  # forbid negative qual values
                              		print STDERR "Negative qual implied by >$qualChar< for $idLineB\n";
                              	    }
                              	    push(@qualVals,$qualVal);
                              	}
                              	shift(@qualVals); # dump first qual val
                              	print FASTA ">$id\n";
                              	print FASTA $seqLine;
                              	print QUAL ">$id\n";
                              	print QUAL join(" ",@qualVals),"\n";
                                  }
                              }
                              So now TopHat runs and exits with success. The junctions.bed file has junctions (a few shown)
                              Code:
                              track name=junctions description="TopHat junctions"
                              chr20   204696  205452  JUNC00000001    1       -       204696  205452  255,0,0 2       31,19   0,737
                              chr20   205747  205877  JUNC00000002    1       -       205747  205877  255,0,0 2       39,11   0,119
                              chr20   218929  219262  JUNC00000003    1       -       218929  219262  255,0,0 2       19,31   0,302
                              chr20   275750  277924  JUNC00000004    3       +       275750  277924  255,0,0 2       36,32   0,2142
                              chr20   277975  278299  JUNC00000005    1       +       277975  278299  255,0,0 2       32,18   0,306
                              chr20   278448  281888  JUNC00000006    2       +       278448  281888  255,0,0 2       28,35   0,3405
                              chr20   309778  316694  JUNC00000007    8       +       309778  316694  255,0,0 2       35,40   0,6876
                              chr20   320182  324880  JUNC00000008    21      +       320182  324880  255,0,0 2       41,39   0,4659
                              BUT, the accepted_hits.bam file is empty! What did I do wrong this time?

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM
                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              23 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              24 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              21 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              52 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X