Seqanswers Leaderboard Ad

**David_H** · 04-29-2015, 07:47 AM

Hi Brian,

This looks great ... it might be very useful for me.

A couple of questions/comments,

First, I am, for my sins, trying to do de novo assembly with data generated by the NextSeq V1 chemistry. I have initial (fairly poor) assemblies made using SPAdes after adapter trimming, overlap of PE reads, quality trimming and error correction (all done with your tools). Is it a good idea to use such an assembly to build the recalibration matrices for the raw reads (and then go back and start the whole assembly process again)?

As an alternative, I was wondering whether the error/QS profiles for the V1 chemistry are sufficiently uniform that a general model could be used (for example the matrices that you made on your data). I suspect not as my initial QS by position distros were quite different from yours (less good in general and with a "bell shaped curve" where the median QS was highest around position 50), but I'd be interested to hear your opinion.

Anyway, I'll go and play with this new toy and let you know how it goes.

best

David

**Brian Bushnell** · 04-29-2015, 09:24 AM

Hi David,

The general plan is to be able to do one of two things:

1) If in the same run you ran PhiX or some other haploid with a good reference, use that for calibration of all the data.
2) If all your data is from a new organism that you are assembling denovo, assemble it, map to the assembly, use that sam file to calibrate, then reassemble the calibrated data.

So, yes, I would suggest trying option 2 in this case, if you did not spike in at least a few million PhiX reads. It would be nice to have a generic set of matrices that could recalibrate all data from a given platform, but I don't know if that's wise. I suspect that even run-to-run variation on the same machine with the same reagent batches would make it optimal to always calibrate on data from the same run; I've seen substantially different quality accuracy profiles between runs on our NextSeq.

I've been spending time perfecting the recalibration and have not yet had a chance to do any testing on how well it improves assemblies or really the effects on anything (other than noting that it slightly increases the BBMerge merge rate), so I would be interested to hear if it improves your assemblies. I do plan to do some extensive testing soon, though, as we have several hundred single cells done on NextSeq V1 chemistry waiting to be assembled, and I want to develop an optimal methodology first.

We have noted that assembling NextSeq V1 amplified single cells with SPAdes, quality-trimming to Q15 then normalizing seem to greatly improve the assembly (compared to no quality trim and no normalization), but that's taken from a sample size of 1.

-Brian

**David_H** · 04-29-2015, 09:38 AM

Hi Brian,

Unfortunately no PhiX spike in (or at least not available to me, this is a project where we were roped in AFTER experimental design and execution). I'll try correcting QS on the original assembly, I'm just concerned what the impact of really systematic errors might be (but I don't know exactly how the recalibration works ... is it "ACGT specific"?). I agree with you on the general model, but we have 40 or so small fungal genomes done by the same provider, so maybe there is enough data to do some comparisons later, for now i have got to make the best use of the existing data and advise whether to repeat with V2 (obviously desirable, but we are talking about italian funding here!). In any case, this particular problem should be relatively transient if the V2 chem really represents a significant improvement.

Thanks again and I'll get back to you in the (relatively) near future.

D

**Brian Bushnell** · 04-29-2015, 10:01 AM

Hi David,

You can correct in various ways. There are multiple calibration matrices generated; by default, 2 are used:

qp: Tracks match and mismatch rates by quality score and position in read (tuples like (37,120).
qb012: Tracks match and mismatch rates by quality score, current base, and the preceding 2 bases. (tuples like {37,A,G,A}).

In the current version (and I believe in that version you are using) you can enable/disable matrices with BBDuk using flags like "loadqb012=f loadqp=f loadq102=t".

That would disable the quality/position matrix and quality/bases matrix, and enable the matrix that calibrates using the quality score and the trailing and leading quality scores, which would make it completely sequence-independent. With extreme-GC organisms, it may be better to disable the qb012 matrix; I just need to do more testing. I'll post a description of how the recalibration works later. The results of the recalibration vary depending on which matrices are used.

**Brian Bushnell** · 05-01-2015, 04:34 PM

CalcTrueQuality now has a new version that should be substantially more accurate. It tracks read 1 and read 2 independently, is able to take into account indel-type errors, and can internally run 2-pass recalibration (though I still think that's generally overkill except for producing pretty graphs). Also, sam/bam files can now be recalibrated (before they were needed to make the matrices, but only fastq files could be recalibrated). The ability to handle indels may seem unimportant for Illumina, but with very low quality reads, sometimes substitution errors will be called as indels instead if it gives a better alignment score. So, ignoring indels and reads with indels in them can lead to inflated scores. But also, it can now be used with 454 and IonTorrent data. New results, for the same dataset:

The weighted average deviation from the correct quality score is now 0.25, down from 4.15 in the raw data. And even though this is NextSeq V1 data, the largest bin is for Q41.

Attached Files

qahist_3492.png (23.4 KB, 446 views)

**micheal_your** · 11-09-2015, 07:03 AM

Hi Brian,
This is Michael You. I am recently processing pacbio CCS reads. I find it is different from Illumina reads. I want to get the average quality and average coverage of each CCS reads. Does the BBtools can be help to achieve my aims， or could you tell me some other ways to do that?
Thank you very much!

Best,
Michael You

**GenoMax** · 11-09-2015, 07:19 AM

@Michael: Average Q scores does not make logical sense for NGS reads. What do you mean by average coverage of each CCS reads? Are you analyzing this data using SMRTportal or you just have the sequence files?

**micheal_your** · 11-09-2015, 01:26 PM

Hi GenoMax,

Thank you timely reply! The average coverage that I mean is the pass times of each CCS reads. Due to the higher error tendency than Hi-seq, I want to obtain the CCS reads with promise by find parameters to filter the CCS reads at first. I have the raw data and the CCS file and I don't use SMRTportal. Could you give me some advice?

Best,
Michael You

**Brian Bushnell** · 11-09-2015, 01:51 PM

Michael,

Unless it's somewhere in the header of the reads, there's no way to calculate how many passes there were for each CCS read - you'd have to look in the H5 file or else use SMRTportal.

You can, however, calculate the error rates of the CCS reads, as claimed by the qualities:

reformat.sh in=ccs.fastq qhist=qhist.txt aqhist=aqhist.txt

...or filter to remove reads above a certain error rate:

reformat.sh in=ccs.fastq out=filtered.fq maq=20

That will remove anything with an expected average error rate over 1% (phred-scaled Q20).

You cannot use CalcTrueQuality on PacBio reads, but if you have a reference, you can still map them with BBMap to calculate the actual error rates using mhist and qhist:

mapPacBio.sh in=ccs.fq ref=ref.fa mhist=mhist.txt qhist=qhist.txt maxlen=2000

**micheal_your** · 11-09-2015, 02:45 PM

Hi Brain,

Thank you for your kindly help! I have run the reformat.sh bash, and the obtained result is a statistic file (aqhist). Can I get the qualities of each CCS reads as follwing example?

>m150911_033912_42177R_c100856912550000001823190901241657_s1_p0/8/ccs avg score: 41.978088
>m150911_033912_42177R_c100856912550000001823190901241657_s1_p0/15/ccs avg score: 29.492865
>m150911_033912_42177R_c100856912550000001823190901241657_s1_p0/16/ccs avg score: 41.947678

By the way, as for the pass times of each CCS read, can I estimate them by using the length of subreads/length of CCS reads?

Thank you very much!

Best,
Michael You

**Brian Bushnell** · 11-09-2015, 03:11 PM

Hi Michael,

In answer to your first question, no. Reformat can filter based on average quality, but it will not append the quality to the read name.
As for the second question, the number of passes is the length of all of the subreads divided by the length of the consensus, so if you have both of those, then yes.

**N311V** · 11-09-2015, 03:29 PM

Is this only a problem for the V1 kit?

**Brian Bushnell** · 11-09-2015, 03:33 PM

V2 has better quality than V1, and the scores are more accurate (though not perfect), but they are still quantized, unfortunately. They do not allow you to turn that off.

That said, I have not yet had a chance to test any V2 reads that we produced since their latest software upgrade. We went back to V1 and stayed with it until very recently because the software for calling barcodes was broken in V2 for a long time, and thus demultiplexing did not work correctly with V2. Supposedly it does now. At any rate, my statement is based on older data with the original version of their V2 kit software.

P.S. Regarding this thread:

My testing of the HiSeq4000 showed very bad performance at 2x150bp. I would not recommend using it above 2x100, at least with the current chemistry.

**N311V** · 11-10-2015, 04:20 PM

Hi Brian,

Thanks for the reply and heads up regarding the HiSeq4000.

I have some V2 RNA-seq reads I'd like to run through CalcTrueQuality but they've been mapped with TopHat. Can I remap them with bbmap to get cigar strings with "=" and "X" symbols for match and mismatch or is there a faster/better alternative? Does anyone know if STAR produces the needed cigars?

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, Yesterday, 11:49 AM	0 responses 15 views 0 likes	Last Post by seqadmin Yesterday, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 62 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Introducing CalcTrueQuality, a quality-score recalibrator

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News