Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • SRA files for use in MIRA3 assembler

    Hi, i am new to bioinformatics and would greatly appreciate any guidance as I'm largly going at this alone.
    Essentially what I want to do is use MIRA3 to assemble some 454 EST data that I have downloaded from NCBI. It is in the full SRA format and I was planning on using the SRA toolkit (sff-dump) to convert it into SFF format before using sff_extract to create fastq and xml files for use in the main MIRA3 assembler.

    Is this the best approach? I have tried using sff-dump but am told: my data is not supported while constructing formatter within short read archive module, and this is followed by my sra file directory. The sra toolkit seems to work in general as I have succesfully converted the sra files into fastq files.

    Thanks in advance, seenstevo

  • #2
    Some of 454 is submitted to SRA as fastq. You will not be able to create sff files from this data.

    Comment


    • #3
      Is there any special reason you want to go the sff_extract way instead of directly using the FASTQ files from the SRA?

      Comment


      • #4
        @ srasdk. I have heard that some 454 are submitted as fastq however seeing as I was able to use the fastq-dump tool to apparently create fastq files which I was then able to view etc (having not been able to previously) I assumed they were not in fastq format. using the file command on the files tells me they are simply data files. not really sure what format they are in...?

        Comment


        • #5
          @ BaCh. Following the MIRA3 guide to preparing 454 data it wants the fastq files for the sequence and quality and the xml file for clipping info. if my SRA data still contains lower case base calls and N's then do I still need to do the clipping as suggested by MIRA3? it seems to be a necessary file.

          If there is another approach I could take using just the fastq files i've got then i'd welcome other suggestions. also bearing in mind that when I view the fastq files, all lower case base reads have been upper cased. i guess this would not be a problem if the quality info was retained in the fastq format.
          thanks, seenstevo

          Comment


          • #6
            The format you get from NCBI is SRA format. When it is created from fastq data, it lacks sufficient information like "454 signal" and "right quality clip" to generate SFF.

            I am not familiar with MIRA3, but is sounds like you lack ready-to-use scripts to generated required format.
            If you are handy with perl/python/awk/etc... You may be able to use generic vdb-dump from Toolkit and post-process the output.

            Example:
            ./vdb-dump -C NAME,READ,'(INSDC:quality:texthred_33)QUALITY',READ_LEN SRR000001 -f tab | head -1
            EM7LVYS01C1LWG TCAGGGGGGAGCTTAAATTTGAAACTAGAAAAATTTTGAACAAAATAATCATAATTGTTAGCTGATGAAAAACTAGAAAAGATTTTCTGAGTGTTGGAACCGAAAGGGTTTGAATTCAAACCCTTTCGGTTCCAACGGTATCCCGTAGTGTGCATTCATCCCTGCTCTGGATACAGTCAGCTCCCAAATTCCATAAACAACTCCTTTGTAAGTAACCTCCTTTTGACAGGGGGTACTGAGCGGGCTGGCAAGGCN =;8GC91*#==<C=EA.EA/<B=(<<:=HC90'FB5&;B:<GC6(=D=<<==C=C==B<=<<<=;<<GC8.#<<9=FB4%<8EA4%87:<<8=B;C<@8>5=C?*A<&A<&<=49/2A='@;#A<&<A9C=@9B::B:<;=C?+<<;<===<=;C<==<FB0=<=<<<D=9=;;=<=<=<;=FB2FB2C<C<;=FB0<C==;C<D@-<=B:<=C=C;<C=GD7*=;:=HD90'==<<=<=:FB0<<C<;C=C=<! 4, 88, 44, 119
            You are getting 1 line-per-record tab-separated output.
            The first 3 columns are name, basecalls, and phred quality in ascii format with offset=33.
            The last column is read layout:
            4 bases - tcag primer
            88 bases - first mate
            44 bases - 454 mate linker
            119 bases - second mate

            In case of 454 fragments you will get 2 lengths: for primer and for fragment
            Last edited by srasdk; 10-26-2011, 09:47 AM. Reason: formating

            Comment


            • #7
              Originally posted by seenstevo View Post
              @ BaCh. Following the MIRA3 guide to preparing 454 data it wants the fastq files for the sequence and quality and the xml file for clipping info. if my SRA data still contains lower case base calls and N's then do I still need to do the clipping as suggested by MIRA3? it seems to be a necessary file.
              Having the XML is the best thing, but as long as the sequence data has "clippings" via lowercase/uppercase, MIRA will understand that. Just turn off the MIRA warning that it wants the XML.

              B.

              Comment


              • #8
                That vdb-dump method looks a bit complicated for me but will collar someone who might know how to use it, thanks.

                @BaCh. When I viewed the fastq files I converted from the SRA files they lacked the lowercase/uppercase info as everything was simply put in uppercase. Does this mean that the info is lost and is there any way to keep it incase I can't get the SRA files into SFF format?
                Cheers

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Current Approaches to Protein Sequencing
                  by seqadmin


                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                  04-04-2024, 04:25 PM
                • seqadmin
                  Strategies for Sequencing Challenging Samples
                  by seqadmin


                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                  03-22-2024, 06:39 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, 04-11-2024, 12:08 PM
                0 responses
                25 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 10:19 PM
                0 responses
                27 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 09:21 AM
                0 responses
                24 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-04-2024, 09:00 AM
                0 responses
                52 views
                0 likes
                Last Post seqadmin  
                Working...
                X