Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • FastQ decrypted from SRA toolkit with warnings: any loss of information?

    Hi,

    Recently we've been trying to decrypt some SRA files of the same project to get the FastQ data. As we got the FastQ files, however, we also received some warnings as shown below:
    Code:
    2014-01-22T00:04:11 fastq-dump.2.3.2 warn: column not found while
    opening table within short read archive module - column LABEL
    2014-01-22T00:04:11 fastq-dump.2.3.2 warn: column not found while
    opening table within short read archive module - column LABEL_START
    2014-01-22T00:04:11 fastq-dump.2.3.2 warn: column not found while
    opening table within short read archive module - column LABEL_LEN
    For each SRA file we decrypted "successfully(see below)", we will get exactly 5 copies of such warnings.

    A "successful" decryption here means that the FastQ files do have read information, and their sizes also seem to be reasonable. However, we're still not sure whether the decryption has led to any loss of data, especially some important information about the reads themselves (e.g. we have lost some reads).

    So here are the problems we'd like to ask:
    • Is there any difference with respect to read information between the FastQ files decrypted from SRA files with or without the warnings mentioned above?
    • If yes, what are the differences?


    Here are the details of our decryption:
    • sratoolkit used: version 2.3.2-5-centos_linux64 (the newest version when we downloaded the data and tried to decrypt them)
    • the decryption needs a repository key, and we set it up using the GUI started up by sratoolkit.jar
    • program used to decrypt SRA files: fastq-dump
    • command line used to decrypt SRA files: fastq-dump --outdir $OUTPUT_DIR --bzip2 --split-3 --keep-empty-files --log-level info $SRA_FILE
    • each SRA file is a paired-end RNA-Seq data of one biological sample produced by Illumina HiSeq 2000, and the read length is always 76bp.


    Thanks in advance!

    Yang

  • #2
    I'm a little confused that I received the following reply from GenoMax by e-mail while there's none on the forum. Anyway, here's the reply:

    Originally posted by GenoMax
    SRA toolkit error messages can be benign, data set specific etc. Perhaps there is no problem here.

    It may not hurt to send a message to SRA support. Use the "Write to helpdesk" link at the bottom of the page for the toolkit download tab. Include the dataset you are using. It is weekend so you may not hear back till Monday. In past they have sometime confirmed if there was a problem with a specific dataset.
    Thanks for the information . As for the NCBI help desk, we did write to them more than 2 weeks ago, but there was no reply. We suppose that there's something wrong with the mail servers, and since we cannot find any related topics or threads on the internet, yesterday we sent another again and also decided to ask the question here. However, as you have mentioned here, maybe we should have included our dataset IDs to tell NCBI which ones we'd like to check.

    Comment


    • #3
      The NCBI Help Desk had replied to me a few days ago to help to fix these issues. I think it would be good to share the solution here to everyone, so here's the solution:
      • The data will always be valid/complete as long as fastq-dump does not produce any error messages. It is possible for fastq-dump to produce a lot of warnings when operating on a valid data, especially when the log-level is set to 5 (default is 4).
      • The data will also always be valid/complete as it passes the vdb-validate program (i.e. all the outputs are "OK").

      Comment


      • #4
        What happens if you try samdump on the same SRA files instead?

        Comment


        • #5
          Originally posted by albireo View Post
          What happens if you try samdump on the same SRA files instead?
          Hi albireo,

          Sorry for the late reply. These SRA files are pure FastQ files, not SAM files, and I'm not sure which parameters I should set to use sam-dump to decrypt these SRA files correctly even after I have read the help page of sam-dump. Could you tell me why you're interested in the output of sam-dump?

          Comment


          • #6
            Thanks for share the information!
            May i wonder why ncbi favors SRA instead of just keep FASTQ?
            Last edited by shuoguo; 02-22-2014, 07:47 AM.

            Comment


            • #7
              Originally posted by shuoguo View Post
              Thanks for share the information!
              May i wonder why ncbi favors SRA instead of just keep FASTQ?
              As far as I know, FASTQ is itself a text-based format, so it would be better to compress them first and distribute them to save time. I don't know the reason why NCBI chose SRA instead of other popular compression format, but I guess that NCBI, by developing a new compression format itself, could have total control over anything of files compressed in this way, the most important of which should be the security issue.

              Comment


              • #8
                Originally posted by Yang Ding View Post
                As far as I know, FASTQ is itself a text-based format, so it would be better to compress them first and distribute them to save time. I don't know the reason why NCBI chose SRA instead of other popular compression format, but I guess that NCBI, by developing a new compression format itself, could have total control over anything of files compressed in this way, the most important of which should be the security issue.
                Thank you!

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Essential Discoveries and Tools in Epitranscriptomics
                  by seqadmin




                  The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                  04-22-2024, 07:01 AM
                • seqadmin
                  Current Approaches to Protein Sequencing
                  by seqadmin


                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                  04-04-2024, 04:25 PM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, Yesterday, 08:47 AM
                0 responses
                16 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-11-2024, 12:08 PM
                0 responses
                60 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 10:19 PM
                0 responses
                60 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 09:21 AM
                0 responses
                54 views
                0 likes
                Last Post seqadmin  
                Working...
                X