Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Generate a base level quality score file

    Dear all,
    Would anyone help me know how to generate/extract a base level quality score file from fastq files that can be used to submit data at NCBI.
    The file should look like this:

    >contig0001
    51 63 70 82 82 82 90 90 90 90 86 86
    86 86 86 86 90 90 90 90 90 86 86 78...

    My sequences was generated on a Miseq which I have ordered to generate fastq files and later assembled using a CLC Genomic Workbench.
    I will appreciate any comment.
    Regards,

  • #2
    Hi Sergioo,

    I have a tool which can split fastq files into fasta+qual:

    reformat.sh in=assembly.fastq out=assembly.fasta qfout=assembly.qual qin=33

    -Brian

    Comment


    • #3
      @Brian: Sergioo wants an "average" Q-score value for each position/column from a set of aligned reads in a contig (at least that is my interpretation).

      Comment


      • #4
        Originally posted by GenoMax View Post
        @Brian: Sergioo wants an "average" Q-score value for each position/column from a set of aligned reads in a contig (at least that is my interpretation).
        Often, per-base quality is provided by the assembler; some assemblers generate a fastq assembly. I'm not really sure in this case, as I don't use CLC, but those quality values are way too high for an average of bases covering a location. Actually they are too high for any normal interpretation, but usually unrealistically high QVs like that come straight from an assembler.

        Sergioo, is your current assembly in fastq or fasta?

        Comment


        • #5
          @Brian: Those values in the example above are made-up (based on a PM). Sergioo wanted to show the format but chose values that were not in phred scale in that example.

          Comment


          • #6
            Ah - OK, I don't know of anything that specifically accepts an assembly and reads, and outputs average quality scores by location. There is a program, ALE, that will generate per-base accuracy estimates based on the read mapping, though not in that format.

            Comment


            • #7
              Thanks all for your comments, I am really stuck on this one. The following quoted message is from NCBI
              "Sequence quality is used in Sanger sequencing. For the high throughput sequencing
              methods, you need to consult the instrument manufacture or the sequencing center
              for how to convert the Q-score, since it is machine/method specific
              ".

              I will work on your suggestions and if I sort it out I will update here.
              Thanks again

              Comment


              • #8
                Email CLC tech support to see if there is a way to export an average value for a column of aligned nucleotides in the contigs?

                Comment


                • #9
                  I believe Miseq has always used ASCII-33 quality scores; all data is ASCII-33 except (at this point) really old Illumina data, ending before the MiSeq was released. That means for every ASCII character in the quality string, you subtract 33 to find the quality value. Reformat.sh will automatically detect the quality encoding if you don't specify it.

                  However, can you paste or link to the specific text of the requirement you are trying to fulfill? I was under the impression that submitting a fasta was acceptable. NCBI has various odd requirements that are often ignored; some of the are along the lines of "fewer than 1 error per X bases", which is of course impossible to determine. But if they require per-base quality values in the assembly, it certainly does not make sense to derive them by averaging mapped values - in that case, an area supported by 100 Q30 reads would have lower quality than an area supported by only 1 read that was Q40, which is silly. So could you please clarify what is being requested?
                  Last edited by Brian Bushnell; 12-22-2014, 05:55 PM.

                  Comment


                  • #10
                    Brian do you have a way to average those scores for a column of aligned nucloetides from constituent reads in a contig? Sergioo wants an average score for a particular position (column-wise) to submit to NCBI for a consensus sequence generated from aligned reads. I am not sure if that is the way to do it but that is the request.

                    Comment


                    • #11
                      Sorry, I don't. I may write something like that in the near future, because I need to analyze how NextSeq data accuracy and Q-scores are influenced by genome content (to determine whether the error is random or not), but I would not recommend that anyone wait on it.

                      Comment


                      • #12
                        Originally posted by Brian Bushnell View Post

                        However, can you paste or link to the specific text of the requirement you are trying to fulfill? I was under the impression that submitting a fasta was acceptable. NCBI has various odd requirements that are often ignored; some of the are along the lines of "fewer than 1 error per X bases", which is of course impossible to determine. But if they require per-base quality values in the assembly, it certainly does not make sense to derive them by averaging mapped values - in that case, an area supported by 100 Q30 reads would have lower quality than an area supported by only 1 read that was Q40, which is silly. So could you please clarify what is being requested?
                        Please see this link http://www.ncbi.nlm.nih.gov/assembly...ubmission/#ex1
                        (Submitting a haploid assembly: submitting WGS contigs only)

                        The quality score file is not a must.
                        I saw that they strictly recommend submitters to produce one and I wanted to do so, maybe I should go on and just submit fasta files only to make things easier.

                        Comment


                        • #13
                          Originally posted by GenoMax View Post
                          Brian do you have a way to average those scores for a column of aligned nucloetides from constituent reads in a contig? Sergioo wants an average score for a particular position (column-wise) to submit to NCBI for a consensus sequence generated from aligned reads. I am not sure if that is the way to do it but that is the request.
                          At least that is the way I understood
                          I'm sorry if I am confusing you all
                          Thanks

                          Comment


                          • #14
                            OK, that is enlightening, thanks for sharing it.

                            The fact that NCBI "highly recommends" a quality file with any fasta submission indicates that whoever wrote that clause probably does not know anything about assembly. When you download a genome, nobody cares whether there are quality values associated with it; I cannot imagine why NCBI would make that specific silly requirement, but as I mentioned, they do have a lot of other silly requirements, so it's not surprising.

                            I suggest that you submit without qualities. They are only valid in the context of an assembler that assigns quality scores to the assembly and outputs a fastq assembly, or in post-evaluating the assembly based on the raw reads, but that is not common. I believe that many organizations submit assemblies with faked set QVs (such as Q40 for all bases) to get past such barriers to advancement. This would not be good advice in an ideal world, but in the real world, you cannot evaluate the actual quality of a new assembly, so you can choose:

                            a) Do nothing.
                            b) Release your organism with no quality scores.
                            c) Release your organism with fake quality scores.
                            d) Wait until someone else releases the genome of the organism you are working on. Then use mapping to decide that you are under a 1/1000 error rate in concordance with the already-released genome. Thus, conclude you can release it, and do so, but since it has already been released, nobody will care.

                            I recommend b - that you release with no quality scores, because the only realistic way to release 99% of genomes is with b) no quality values, c) fake quality values, or d) wrong quality values. Earth will be much better off if people choose b.

                            Comment


                            • #15
                              Originally posted by Brian Bushnell View Post

                              a) Do nothing.
                              b) Release your organism with no quality scores.
                              c) Release your organism with fake quality scores.
                              d) Wait until someone else releases the genome of the organism you are working on. Then use mapping to decide that you are under a 1/1000 error rate in concordance with the already-released genome. Thus, conclude you can release it, and do so, but since it has already been released, nobody will care.

                              I recommend b - that you release with no quality scores, because the only realistic way to release 99% of genomes is with b) no quality values, c) fake quality values, or d) wrong quality values. Earth will be much better off if people choose b.
                              Thanks for your recommendations, I will go on with "b".

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Essential Discoveries and Tools in Epitranscriptomics
                                by seqadmin




                                The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                                04-22-2024, 07:01 AM
                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Today, 08:47 AM
                              0 responses
                              12 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              60 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              59 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              54 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X