Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • PacBio data - problem with SRA toolkit

    I have problems getting fasta from PacBio SRA file using SRA toolkit. For example, file SRR2003880.sra should contain about 163K sequences, it yields only 46K and those do not correspond to the same names on NCBI SRA website. I can successfully process other PacBio files, and I am using the newest version of SRA toolkit with the following command line:

    sratoolkit.2.4.5-2-win64/bin/fastq-dump.exe --fasta SRR2003880.sra

    My best guess is that the upload of the data on NCBI SRA website was incorrect. They did not answer me yet. I would very appreciate anybody's help or opinion.

    Thank you.

  • #2
    Fastq-dump appears to be rejecting reads because of this

    "Rejected 117005 SPOTS because SPOTLEN < 1".

    These reads appear to have no sequence.

    You can confirm this yourself by doing

    Code:
    $ fastq-dump -M 0 -F SRR2003880
    You can download the original HDF5 files for this record (using the "Download" tab) and verify if there are many 0 length sequences. You will need access to SMRTportal to properly process the raw data files.

    Comment


    • #3
      Thanks. But those reads show up in the NCBI website as not empty.

      Comment


      • #4
        It is possible that the download from SRA is corrupt. Best recourse there is to wait to hear back from SRA support. They generally fix these files based on my experience.

        In the mean time, hdf5 files from the download tab is the original data from the submitter. It does not appear to contain the metadata.xml file that is required by SMRTportal so you may not be able to use the original files right away.

        Comment


        • #5
          ENA record appears to have the same number of spots: ftp://ftp.sra.ebi.ac.uk/vol1/fastq/S...03880.fastq.gz

          Comment


          • #6
            We downloaded the ENA fatsq file. It is exactly what we get as result of the SRA toolkit. So probably only 46K sequences are usable. What is still unclear is why the NCBI archive website shows the "zero" reads as sequences, e.g. SRA|SRR2003880.1

            Comment


            • #7
              Dear all,

              Not sure if you have resolved your problem, but I had a similar problem with PacBio reads, but from a different data set. After reading this thread, I asked NCBI's Helpdesk and they explained to me that PacBio data is special in that multiple reads with a lot of errors are used to form consensus reads. It is these consensus reads that are output with no options to fastq-dump:

              Code:
              fastq-dump SRR2003880
              If the raw reads are required, you need to supply the --table SEQUENCE option. i.e.,

              Code:
              fastq-dump --table SEQUENCE SRR2003880
              I hope this helps someone!

              Ray

              Comment


              • #8
                That is really helpful. Thank you, Ray.

                In my case SRR1284074, if use
                #fastq-dump SRR1284074
                Rejected 163480 SPOTS because SPOTLEN < 1
                Read 163482 spots for SRR1284074
                Written 2 spots for SRR1284074

                Use "--table SEQUENCE" to dump SRR1284074, I still got 3 spots rejected.
                #fastq-dump --table SEQUENCE SRR1284074
                Rejected 3 SPOTS because SPOTLEN < 1
                Read 163482 spots for SRR1284074
                Written 163479 spots for SRR1284074

                Any more suggestions or comments on this issue are very welcome.

                Originally posted by rwan View Post
                Dear all,

                Not sure if you have resolved your problem, but I had a similar problem with PacBio reads, but from a different data set. After reading this thread, I asked NCBI's Helpdesk and they explained to me that PacBio data is special in that multiple reads with a lot of errors are used to form consensus reads. It is these consensus reads that are output with no options to fastq-dump:

                Code:
                fastq-dump SRR2003880
                If the raw reads are required, you need to supply the --table SEQUENCE option. i.e.,

                Code:
                fastq-dump --table SEQUENCE SRR2003880
                I hope this helps someone!

                Ray
                Last edited by ynwh; 12-04-2015, 07:00 AM.

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Essential Discoveries and Tools in Epitranscriptomics
                  by seqadmin




                  The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                  04-22-2024, 07:01 AM
                • seqadmin
                  Current Approaches to Protein Sequencing
                  by seqadmin


                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                  04-04-2024, 04:25 PM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, Yesterday, 08:47 AM
                0 responses
                13 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-11-2024, 12:08 PM
                0 responses
                60 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 10:19 PM
                0 responses
                60 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 09:21 AM
                0 responses
                54 views
                0 likes
                Last Post seqadmin  
                Working...
                X