Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • PacBio data - problem with SRA toolkit

    I have problems getting fasta from PacBio SRA file using SRA toolkit. For example, file SRR2003880.sra should contain about 163K sequences, it yields only 46K and those do not correspond to the same names on NCBI SRA website. I can successfully process other PacBio files, and I am using the newest version of SRA toolkit with the following command line:

    sratoolkit.2.4.5-2-win64/bin/fastq-dump.exe --fasta SRR2003880.sra

    My best guess is that the upload of the data on NCBI SRA website was incorrect. They did not answer me yet. I would very appreciate anybody's help or opinion.

    Thank you.

  • #2
    Fastq-dump appears to be rejecting reads because of this

    "Rejected 117005 SPOTS because SPOTLEN < 1".

    These reads appear to have no sequence.

    You can confirm this yourself by doing

    Code:
    $ fastq-dump -M 0 -F SRR2003880
    You can download the original HDF5 files for this record (using the "Download" tab) and verify if there are many 0 length sequences. You will need access to SMRTportal to properly process the raw data files.

    Comment


    • #3
      Thanks. But those reads show up in the NCBI website as not empty.

      Comment


      • #4
        It is possible that the download from SRA is corrupt. Best recourse there is to wait to hear back from SRA support. They generally fix these files based on my experience.

        In the mean time, hdf5 files from the download tab is the original data from the submitter. It does not appear to contain the metadata.xml file that is required by SMRTportal so you may not be able to use the original files right away.

        Comment


        • #5
          ENA record appears to have the same number of spots: ftp://ftp.sra.ebi.ac.uk/vol1/fastq/S...03880.fastq.gz

          Comment


          • #6
            We downloaded the ENA fatsq file. It is exactly what we get as result of the SRA toolkit. So probably only 46K sequences are usable. What is still unclear is why the NCBI archive website shows the "zero" reads as sequences, e.g. SRA|SRR2003880.1

            Comment


            • #7
              Dear all,

              Not sure if you have resolved your problem, but I had a similar problem with PacBio reads, but from a different data set. After reading this thread, I asked NCBI's Helpdesk and they explained to me that PacBio data is special in that multiple reads with a lot of errors are used to form consensus reads. It is these consensus reads that are output with no options to fastq-dump:

              Code:
              fastq-dump SRR2003880
              If the raw reads are required, you need to supply the --table SEQUENCE option. i.e.,

              Code:
              fastq-dump --table SEQUENCE SRR2003880
              I hope this helps someone!

              Ray

              Comment


              • #8
                That is really helpful. Thank you, Ray.

                In my case SRR1284074, if use
                #fastq-dump SRR1284074
                Rejected 163480 SPOTS because SPOTLEN < 1
                Read 163482 spots for SRR1284074
                Written 2 spots for SRR1284074

                Use "--table SEQUENCE" to dump SRR1284074, I still got 3 spots rejected.
                #fastq-dump --table SEQUENCE SRR1284074
                Rejected 3 SPOTS because SPOTLEN < 1
                Read 163482 spots for SRR1284074
                Written 163479 spots for SRR1284074

                Any more suggestions or comments on this issue are very welcome.

                Originally posted by rwan View Post
                Dear all,

                Not sure if you have resolved your problem, but I had a similar problem with PacBio reads, but from a different data set. After reading this thread, I asked NCBI's Helpdesk and they explained to me that PacBio data is special in that multiple reads with a lot of errors are used to form consensus reads. It is these consensus reads that are output with no options to fastq-dump:

                Code:
                fastq-dump SRR2003880
                If the raw reads are required, you need to supply the --table SEQUENCE option. i.e.,

                Code:
                fastq-dump --table SEQUENCE SRR2003880
                I hope this helps someone!

                Ray
                Last edited by ynwh; 12-04-2015, 07:00 AM.

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Strategies for Sequencing Challenging Samples
                  by seqadmin


                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                  03-22-2024, 06:39 AM
                • seqadmin
                  Techniques and Challenges in Conservation Genomics
                  by seqadmin



                  The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                  Avian Conservation
                  Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                  03-08-2024, 10:41 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, Yesterday, 06:37 PM
                0 responses
                10 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, Yesterday, 06:07 PM
                0 responses
                10 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 03-22-2024, 10:03 AM
                0 responses
                51 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 03-21-2024, 07:32 AM
                0 responses
                67 views
                0 likes
                Last Post seqadmin  
                Working...
                X