Seqanswers Leaderboard Ad

**Brian Bushnell** · 12-22-2014, 10:56 AM

Hi Sergioo,

I have a tool which can split fastq files into fasta+qual:

reformat.sh in=assembly.fastq out=assembly.fasta qfout=assembly.qual qin=33

-Brian

**GenoMax** · 12-22-2014, 11:13 AM

@Brian: Sergioo wants an "average" Q-score value for each position/column from a set of aligned reads in a contig (at least that is my interpretation).

**Brian Bushnell** · 12-22-2014, 11:18 AM

Originally posted by GenoMax View Post

@Brian: Sergioo wants an "average" Q-score value for each position/column from a set of aligned reads in a contig (at least that is my interpretation).

Often, per-base quality is provided by the assembler; some assemblers generate a fastq assembly. I'm not really sure in this case, as I don't use CLC, but those quality values are way too high for an average of bases covering a location. Actually they are too high for any normal interpretation, but usually unrealistically high QVs like that come straight from an assembler.

Sergioo, is your current assembly in fastq or fasta?

**GenoMax** · 12-22-2014, 11:59 AM

@Brian: Those values in the example above are made-up (based on a PM). Sergioo wanted to show the format but chose values that were not in phred scale in that example.

**Brian Bushnell** · 12-22-2014, 01:44 PM

Ah - OK, I don't know of anything that specifically accepts an assembly and reads, and outputs average quality scores by location. There is a program, ALE, that will generate per-base accuracy estimates based on the read mapping, though not in that format.

**Sergioo** · 12-22-2014, 05:30 PM

Thanks all for your comments, I am really stuck on this one. The following quoted message is from NCBI
"Sequence quality is used in Sanger sequencing. For the high throughput sequencing
methods, you need to consult the instrument manufacture or the sequencing center
for how to convert the Q-score, since it is machine/method specific".

I will work on your suggestions and if I sort it out I will update here.
Thanks again

**GenoMax** · 12-22-2014, 05:44 PM

Email CLC tech support to see if there is a way to export an average value for a column of aligned nucleotides in the contigs?

**Brian Bushnell** · 12-22-2014, 05:50 PM

I believe Miseq has always used ASCII-33 quality scores; all data is ASCII-33 except (at this point) really old Illumina data, ending before the MiSeq was released. That means for every ASCII character in the quality string, you subtract 33 to find the quality value. Reformat.sh will automatically detect the quality encoding if you don't specify it.

However, can you paste or link to the specific text of the requirement you are trying to fulfill? I was under the impression that submitting a fasta was acceptable. NCBI has various odd requirements that are often ignored; some of the are along the lines of "fewer than 1 error per X bases", which is of course impossible to determine. But if they require per-base quality values in the assembly, it certainly does not make sense to derive them by averaging mapped values - in that case, an area supported by 100 Q30 reads would have lower quality than an area supported by only 1 read that was Q40, which is silly. So could you please clarify what is being requested?

**GenoMax** · 12-22-2014, 06:00 PM

Brian do you have a way to average those scores for a column of aligned nucloetides from constituent reads in a contig? Sergioo wants an average score for a particular position (column-wise) to submit to NCBI for a consensus sequence generated from aligned reads. I am not sure if that is the way to do it but that is the request.

**Brian Bushnell** · 12-22-2014, 06:05 PM

Sorry, I don't. I may write something like that in the near future, because I need to analyze how NextSeq data accuracy and Q-scores are influenced by genome content (to determine whether the error is random or not), but I would not recommend that anyone wait on it.

**Sergioo** · 12-22-2014, 06:56 PM

Originally posted by Brian Bushnell View Post

However, can you paste or link to the specific text of the requirement you are trying to fulfill? I was under the impression that submitting a fasta was acceptable. NCBI has various odd requirements that are often ignored; some of the are along the lines of "fewer than 1 error per X bases", which is of course impossible to determine. But if they require per-base quality values in the assembly, it certainly does not make sense to derive them by averaging mapped values - in that case, an area supported by 100 Q30 reads would have lower quality than an area supported by only 1 read that was Q40, which is silly. So could you please clarify what is being requested?

Please see this link http://www.ncbi.nlm.nih.gov/assembly...ubmission/#ex1
(Submitting a haploid assembly: submitting WGS contigs only)

The quality score file is not a must.
I saw that they strictly recommend submitters to produce one and I wanted to do so, maybe I should go on and just submit fasta files only to make things easier.

**Sergioo** · 12-22-2014, 06:58 PM

Originally posted by GenoMax View Post

Brian do you have a way to average those scores for a column of aligned nucloetides from constituent reads in a contig? Sergioo wants an average score for a particular position (column-wise) to submit to NCBI for a consensus sequence generated from aligned reads. I am not sure if that is the way to do it but that is the request.

At least that is the way I understood

I'm sorry if I am confusing you all
Thanks

**Brian Bushnell** · 12-22-2014, 07:21 PM

OK, that is enlightening, thanks for sharing it.

The fact that NCBI "highly recommends" a quality file with any fasta submission indicates that whoever wrote that clause probably does not know anything about assembly. When you download a genome, nobody cares whether there are quality values associated with it; I cannot imagine why NCBI would make that specific silly requirement, but as I mentioned, they do have a lot of other silly requirements, so it's not surprising.

I suggest that you submit without qualities. They are only valid in the context of an assembler that assigns quality scores to the assembly and outputs a fastq assembly, or in post-evaluating the assembly based on the raw reads, but that is not common. I believe that many organizations submit assemblies with faked set QVs (such as Q40 for all bases) to get past such barriers to advancement. This would not be good advice in an ideal world, but in the real world, you cannot evaluate the actual quality of a new assembly, so you can choose:

a) Do nothing.
b) Release your organism with no quality scores.
c) Release your organism with fake quality scores.
d) Wait until someone else releases the genome of the organism you are working on. Then use mapping to decide that you are under a 1/1000 error rate in concordance with the already-released genome. Thus, conclude you can release it, and do so, but since it has already been released, nobody will care.

I recommend b - that you release with no quality scores, because the only realistic way to release 99% of genomes is with b) no quality values, c) fake quality values, or d) wrong quality values. Earth will be much better off if people choose b.

**Sergioo** · 12-22-2014, 08:06 PM

Originally posted by Brian Bushnell View Post

a) Do nothing.
b) Release your organism with no quality scores.
c) Release your organism with fake quality scores.
d) Wait until someone else releases the genome of the organism you are working on. Then use mapping to decide that you are under a 1/1000 error rate in concordance with the already-released genome. Thus, conclude you can release it, and do so, but since it has already been released, nobody will care.

I recommend b - that you release with no quality scores, because the only realistic way to release 99% of genomes is with b) no quality values, c) fake quality values, or d) wrong quality values. Earth will be much better off if people choose b.

Thanks for your recommendations, I will go on with "b".

Topics	Statistics	Last Post
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, Today, 08:47 AM	0 responses 12 views 0 likes	Last Post by seqadmin Today, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 59 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 54 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM

Seqanswers Leaderboard Ad

Announcement

Generate a base level quality score file

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News