Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • FastQ decrypted from SRA toolkit with warnings: any loss of information?

    Hi,

    Recently we've been trying to decrypt some SRA files of the same project to get the FastQ data. As we got the FastQ files, however, we also received some warnings as shown below:
    Code:
    2014-01-22T00:04:11 fastq-dump.2.3.2 warn: column not found while
    opening table within short read archive module - column LABEL
    2014-01-22T00:04:11 fastq-dump.2.3.2 warn: column not found while
    opening table within short read archive module - column LABEL_START
    2014-01-22T00:04:11 fastq-dump.2.3.2 warn: column not found while
    opening table within short read archive module - column LABEL_LEN
    For each SRA file we decrypted "successfully(see below)", we will get exactly 5 copies of such warnings.

    A "successful" decryption here means that the FastQ files do have read information, and their sizes also seem to be reasonable. However, we're still not sure whether the decryption has led to any loss of data, especially some important information about the reads themselves (e.g. we have lost some reads).

    So here are the problems we'd like to ask:
    • Is there any difference with respect to read information between the FastQ files decrypted from SRA files with or without the warnings mentioned above?
    • If yes, what are the differences?


    Here are the details of our decryption:
    • sratoolkit used: version 2.3.2-5-centos_linux64 (the newest version when we downloaded the data and tried to decrypt them)
    • the decryption needs a repository key, and we set it up using the GUI started up by sratoolkit.jar
    • program used to decrypt SRA files: fastq-dump
    • command line used to decrypt SRA files: fastq-dump --outdir $OUTPUT_DIR --bzip2 --split-3 --keep-empty-files --log-level info $SRA_FILE
    • each SRA file is a paired-end RNA-Seq data of one biological sample produced by Illumina HiSeq 2000, and the read length is always 76bp.


    Thanks in advance!

    Yang

  • #2
    I'm a little confused that I received the following reply from GenoMax by e-mail while there's none on the forum. Anyway, here's the reply:

    Originally posted by GenoMax
    SRA toolkit error messages can be benign, data set specific etc. Perhaps there is no problem here.

    It may not hurt to send a message to SRA support. Use the "Write to helpdesk" link at the bottom of the page for the toolkit download tab. Include the dataset you are using. It is weekend so you may not hear back till Monday. In past they have sometime confirmed if there was a problem with a specific dataset.
    Thanks for the information . As for the NCBI help desk, we did write to them more than 2 weeks ago, but there was no reply. We suppose that there's something wrong with the mail servers, and since we cannot find any related topics or threads on the internet, yesterday we sent another again and also decided to ask the question here. However, as you have mentioned here, maybe we should have included our dataset IDs to tell NCBI which ones we'd like to check.

    Comment


    • #3
      The NCBI Help Desk had replied to me a few days ago to help to fix these issues. I think it would be good to share the solution here to everyone, so here's the solution:
      • The data will always be valid/complete as long as fastq-dump does not produce any error messages. It is possible for fastq-dump to produce a lot of warnings when operating on a valid data, especially when the log-level is set to 5 (default is 4).
      • The data will also always be valid/complete as it passes the vdb-validate program (i.e. all the outputs are "OK").

      Comment


      • #4
        What happens if you try samdump on the same SRA files instead?

        Comment


        • #5
          Originally posted by albireo View Post
          What happens if you try samdump on the same SRA files instead?
          Hi albireo,

          Sorry for the late reply. These SRA files are pure FastQ files, not SAM files, and I'm not sure which parameters I should set to use sam-dump to decrypt these SRA files correctly even after I have read the help page of sam-dump. Could you tell me why you're interested in the output of sam-dump?

          Comment


          • #6
            Thanks for share the information!
            May i wonder why ncbi favors SRA instead of just keep FASTQ?
            Last edited by shuoguo; 02-22-2014, 07:47 AM.

            Comment


            • #7
              Originally posted by shuoguo View Post
              Thanks for share the information!
              May i wonder why ncbi favors SRA instead of just keep FASTQ?
              As far as I know, FASTQ is itself a text-based format, so it would be better to compress them first and distribute them to save time. I don't know the reason why NCBI chose SRA instead of other popular compression format, but I guess that NCBI, by developing a new compression format itself, could have total control over anything of files compressed in this way, the most important of which should be the security issue.

              Comment


              • #8
                Originally posted by Yang Ding View Post
                As far as I know, FASTQ is itself a text-based format, so it would be better to compress them first and distribute them to save time. I don't know the reason why NCBI chose SRA instead of other popular compression format, but I guess that NCBI, by developing a new compression format itself, could have total control over anything of files compressed in this way, the most important of which should be the security issue.
                Thank you!

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Advancing Precision Medicine for Rare Diseases in Children
                  by seqadmin




                  Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
                  12-16-2024, 07:57 AM
                • seqadmin
                  Recent Advances in Sequencing Technologies
                  by seqadmin



                  Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

                  Long-Read Sequencing
                  Long-read sequencing has seen remarkable advancements,...
                  12-02-2024, 01:49 PM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, 12-17-2024, 10:28 AM
                0 responses
                33 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 12-13-2024, 08:24 AM
                0 responses
                48 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 12-12-2024, 07:41 AM
                0 responses
                34 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 12-11-2024, 07:45 AM
                0 responses
                46 views
                0 likes
                Last Post seqadmin  
                Working...
                X