Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Processing SOLiD data from SRA using Tophat

    Hello all,

    I'm attempting to run Tophat on SOLiD data from an SRA file and running into problems with the fastq file formatting.

    After running fastq-dump on the SRA file, I get the following format:

    @SRR1119927.1 solid309_20110721_FRAG_BC_yadegari_1_55_1170 length=50
    T000002201013000130000000.01...20...2....2.....2...
    +SRR1119927.1 solid309_20110721_FRAG_BC_yadegari_1_55_1170 length=50
    !+,0,,/'*&/)&&)2%&+2.0%37!7%!!!1%!!!%!!!!5!!!!!5!!!
    Executing Tophat like this:

    tophat -C -o output --bowtie1 ColorIndex SRR.fastq

    Results in the following error:

    Error running bowtie:
    Too few quality values for read: 2899T33
    are you sure this is a FASTQ-int file?
    I researched this error and found that the problem may be I need to use the --quals option and provide a separate quality file. So, I split the fastq file into two separate files:

    @SRR1119927.1 solid309_20110721_FRAG_BC_yadegari_1_55_1170 length=50
    T000002201013000130000000.01...20...2....2.....2...
    +SRR1119927.1 solid309_20110721_FRAG_BC_yadegari_1_55_1170 length=50
    !+,0,,/'*&/)&&)2%&+2.0%37!7%!!!1%!!!%!!!!5!!!!!5!!!
    And ran:

    tophat -C --quals -o output --bowtie1 ColorIndex SRR.fastq SRR_qual.fastq

    That generates the following error:

    Error encountered parsing file SRR.fastq:
    Premature end of file (missing quality values for SRR1119927.1 solid309_20110721_FRAG_BC_yadegari_1_55_1170 length=50)
    I can't find any information on how to properly format the base and quality files when they are separated so that Tophat can read them. Is this my problem? Or something else?

    <EDIT>

    I properly formatted the two split files into proper FASTA:

    >SRR1119927.1 solid309_20110721_FRAG_BC_yadegari_1_55_1170 length=50
    T000002201013000130000000.01...20...2....2.....2...
    >SRR1119927.1 solid309_20110721_FRAG_BC_yadegari_1_55_1170 length=50
    !+,0,,/'*&/)&&)2%&+2.0%37!7%!!!1%!!!%!!!!5!!!!!5!!!
    But now get the following error:

    Error running 'prep_reads'
    Error: beginning of quality values record not found! (!'/,<&.&&*'%1*%.2(%&20%'&!')!!!%&!!!1!!!!1!!!!!%!!!)
    Last edited by Helical; 06-19-2014, 06:43 AM.

  • #2
    TopHat is probably expecting the data to be in 2 files, .csfasta and .qual.

    I think there should be a command 'abi-dump', instead of fastq-dump,

    that will produce the file formats that you need.

    Comment


    • #3
      Did you use fastq-dump, or abi-dump to generate your original files? If the SRA submission was actually in color space reads, then you should use "abi-dump" NOT fastq-dump with the SRA toolkit. The abi-dump command will actually give you matched csfasta/csqual files.
      Michael Black, Ph.D.
      ScitoVation LLC. RTP, N.C.

      Comment

      Latest Articles

      Collapse

      • seqadmin
        Essential Discoveries and Tools in Epitranscriptomics
        by seqadmin


        The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
        Today, 07:01 AM
      • seqadmin
        Current Approaches to Protein Sequencing
        by seqadmin


        Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
        04-04-2024, 04:25 PM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by seqadmin, 04-11-2024, 12:08 PM
      0 responses
      37 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-10-2024, 10:19 PM
      0 responses
      41 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-10-2024, 09:21 AM
      0 responses
      35 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-04-2024, 09:00 AM
      0 responses
      54 views
      0 likes
      Last Post seqadmin  
      Working...
      X