Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • FastQ decrypted from SRA toolkit with warnings: any loss of information?

    Hi,

    Recently we've been trying to decrypt some SRA files of the same project to get the FastQ data. As we got the FastQ files, however, we also received some warnings as shown below:
    Code:
    2014-01-22T00:04:11 fastq-dump.2.3.2 warn: column not found while
    opening table within short read archive module - column LABEL
    2014-01-22T00:04:11 fastq-dump.2.3.2 warn: column not found while
    opening table within short read archive module - column LABEL_START
    2014-01-22T00:04:11 fastq-dump.2.3.2 warn: column not found while
    opening table within short read archive module - column LABEL_LEN
    For each SRA file we decrypted "successfully(see below)", we will get exactly 5 copies of such warnings.

    A "successful" decryption here means that the FastQ files do have read information, and their sizes also seem to be reasonable. However, we're still not sure whether the decryption has led to any loss of data, especially some important information about the reads themselves (e.g. we have lost some reads).

    So here are the problems we'd like to ask:
    • Is there any difference with respect to read information between the FastQ files decrypted from SRA files with or without the warnings mentioned above?
    • If yes, what are the differences?


    Here are the details of our decryption:
    • sratoolkit used: version 2.3.2-5-centos_linux64 (the newest version when we downloaded the data and tried to decrypt them)
    • the decryption needs a repository key, and we set it up using the GUI started up by sratoolkit.jar
    • program used to decrypt SRA files: fastq-dump
    • command line used to decrypt SRA files: fastq-dump --outdir $OUTPUT_DIR --bzip2 --split-3 --keep-empty-files --log-level info $SRA_FILE
    • each SRA file is a paired-end RNA-Seq data of one biological sample produced by Illumina HiSeq 2000, and the read length is always 76bp.


    Thanks in advance!

    Yang

  • #2
    I'm a little confused that I received the following reply from GenoMax by e-mail while there's none on the forum. Anyway, here's the reply:

    Originally posted by GenoMax
    SRA toolkit error messages can be benign, data set specific etc. Perhaps there is no problem here.

    It may not hurt to send a message to SRA support. Use the "Write to helpdesk" link at the bottom of the page for the toolkit download tab. Include the dataset you are using. It is weekend so you may not hear back till Monday. In past they have sometime confirmed if there was a problem with a specific dataset.
    Thanks for the information . As for the NCBI help desk, we did write to them more than 2 weeks ago, but there was no reply. We suppose that there's something wrong with the mail servers, and since we cannot find any related topics or threads on the internet, yesterday we sent another again and also decided to ask the question here. However, as you have mentioned here, maybe we should have included our dataset IDs to tell NCBI which ones we'd like to check.

    Comment


    • #3
      The NCBI Help Desk had replied to me a few days ago to help to fix these issues. I think it would be good to share the solution here to everyone, so here's the solution:
      • The data will always be valid/complete as long as fastq-dump does not produce any error messages. It is possible for fastq-dump to produce a lot of warnings when operating on a valid data, especially when the log-level is set to 5 (default is 4).
      • The data will also always be valid/complete as it passes the vdb-validate program (i.e. all the outputs are "OK").

      Comment


      • #4
        What happens if you try samdump on the same SRA files instead?

        Comment


        • #5
          Originally posted by albireo View Post
          What happens if you try samdump on the same SRA files instead?
          Hi albireo,

          Sorry for the late reply. These SRA files are pure FastQ files, not SAM files, and I'm not sure which parameters I should set to use sam-dump to decrypt these SRA files correctly even after I have read the help page of sam-dump. Could you tell me why you're interested in the output of sam-dump?

          Comment


          • #6
            Thanks for share the information!
            May i wonder why ncbi favors SRA instead of just keep FASTQ?
            Last edited by shuoguo; 02-22-2014, 07:47 AM.

            Comment


            • #7
              Originally posted by shuoguo View Post
              Thanks for share the information!
              May i wonder why ncbi favors SRA instead of just keep FASTQ?
              As far as I know, FASTQ is itself a text-based format, so it would be better to compress them first and distribute them to save time. I don't know the reason why NCBI chose SRA instead of other popular compression format, but I guess that NCBI, by developing a new compression format itself, could have total control over anything of files compressed in this way, the most important of which should be the security issue.

              Comment


              • #8
                Originally posted by Yang Ding View Post
                As far as I know, FASTQ is itself a text-based format, so it would be better to compress them first and distribute them to save time. I don't know the reason why NCBI chose SRA instead of other popular compression format, but I guess that NCBI, by developing a new compression format itself, could have total control over anything of files compressed in this way, the most important of which should be the security issue.
                Thank you!

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Recent Advances in Sequencing Analysis Tools
                  by seqadmin


                  The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...
                  05-06-2024, 07:48 AM
                • seqadmin
                  Essential Discoveries and Tools in Epitranscriptomics
                  by seqadmin




                  The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                  04-22-2024, 07:01 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, Yesterday, 06:35 AM
                0 responses
                15 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 05-09-2024, 02:46 PM
                0 responses
                21 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 05-07-2024, 06:57 AM
                0 responses
                18 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 05-06-2024, 07:17 AM
                0 responses
                19 views
                0 likes
                Last Post seqadmin  
                Working...
                X