SEQanswers

Go Back   SEQanswers > Sequencing Technologies/Companies > Pacific Biosciences



Similar Threads
Thread Thread Starter Forum Replies Last Post
Quiver Gators Pacific Biosciences 3 08-15-2016 09:35 AM
SAM header tags with lowercase letters wolma Bioinformatics 1 07-28-2014 01:16 AM
Help needed to polish pacbiotoca assembly with Quiver cwzkevin Bioinformatics 4 01-13-2014 05:24 PM
Need help in running quiver julio514 Pacific Biosciences 1 11-08-2013 04:32 PM
lowercase and n in consensus sequences by mpileup ymwur Genomic Resequencing 0 05-27-2013 07:58 PM

Reply
 
Thread Tools
Old 08-04-2014, 12:22 PM   #1
jbadalam
Member
 
Location: Saint Paul, MN

Join Date: Jan 2014
Posts: 10
Default lowercase output from quiver

I recently completed resequencing of a bacterial strain with a complete reference genome. The HGAP3 assembly resulted in a single contig, but there were several areas with lowercase bases (~500 bp each). SMRTview showed that I had uniform coverage of unambiguously mapped subreads spanning these regions. No big deal, I thought--I'll just convert them to all caps and run another pass through quiver, but the same regions came back lowercase, suggesting quiver didn't touch them at all.

What's interesting is that if I BLAST one lowercase region with ~200bp flanking, I hit 960/961 bases from the published reference strain (with the single mismatch being a potential deletion in a homopolymer stretch of 8 C's). Most (but not all) of the other lowercase regions seem to have one deletion at a homopolymer with respect to the published reference. Could homopolymers alone be throwing off the read mapping and somehow tagging that local region as "suspect"? Either way I don't think these lowercase regions are worrisome, but I'm still confused as to how they arise and how quiver behaves with respect to them.

My questions are:
1) what causes quiver to produce a consensus with areas of lowercase bases, even if subreads unambiguously map to these regions without dips or spikes in coverage? Did these areas not get polished at all--and if so, how can I determine consensus accuracy for the genome if indels might still be present in lowercase regions?
2) do lowercase bases in the consensus.fasta output factor into the calculation of consensus accuracy?
3) does quiver ignore lowercase bases in a reference that is uploaded for the RS_Resequencing pipeline--meaning should I make sure all bases are uppercase before uploading a reference to SMRTportal?
jbadalam is offline   Reply With Quote
Old 08-05-2014, 11:12 AM   #2
rhall
Senior Member
 
Location: San Francisco

Join Date: Aug 2012
Posts: 318
Default

Homopolymers errors such as this should not effect mapping, they are present because quiver has not had a chance to correct them, they are the most common error in the consensus called during assembly.
Quote:
1) what causes quiver to produce a consensus with areas of lowercase bases, even if subreads unambiguously map to these regions without dips or spikes in coverage? Did these areas not get polished at all--and if so, how can I determine consensus accuracy for the genome if indels might still be present in lowercase regions?
By default a lowercase base indicates no consensus call has been made and the base is just the reference sequence, this occurs when a 500bp region does not have any unambiguously mapped spanning reads. I'm don't know how this would happen if there is coverage and high mapqv. One possibility is that the mapping all terminate at a single point, i.e. the reads have no evidence for the reference to be joined at that position, the coverage and mapQV could look good, but the alignment should show the blunt join. Is the lowercase region in the middle of the sequence? before circularization the end of the sequence will be lowercase due to the overlap and ambiguous mapping.
If the lowercase is due to ambiguous mapping, most likely, then the option to ignore ambiguous mapped reads can be turned off in the protocol settings.
Quote:
2) do lowercase bases in the consensus.fasta output factor into the calculation of consensus accuracy?
The consensus (concordance) accuracy is calculated between the reference and the consensus, but I'm not sure if lowercase bases in the consensus sequence are ignored, I'll have to check.
Quote:
3) does quiver ignore lowercase bases in a reference that is uploaded for the RS_Resequencing pipeline--meaning should I make sure all bases are uppercase before uploading a reference to SMRTportal?
No lowercase bases are not treated any differently when uploaded as a reference.

Last edited by rhall; 08-05-2014 at 11:14 AM. Reason: typo
rhall is offline   Reply With Quote
Old 08-05-2014, 01:37 PM   #3
jbadalam
Member
 
Location: Saint Paul, MN

Join Date: Jan 2014
Posts: 10
Default

The lowercase regions all occur within the sequence. One HGAP run produced a single contig with self-overlapping ends which I removed before uploading to SMRT Portal as a reference for running subsequent passes through quiver. I iterated a few times to get the consensus QV >50 and after each pass the same regions came back lowercase in the consensus, even if I changed to uppercase before uploading the reference.

The lowercase regions occur throughout the assembly, probably 10-15 or so per megabase. All of them are spanned by unambiguously mapped reads, most of which are significantly longer than the lowercase regions themselves, so I highly doubt these are misassemblies, especially since BLASTing them with flanking uppercase sequence hits the published reference genome across the entire query.

Is it possible that the chemistry information provided in the raw data is different from what was actually used during the sequencing run? I know that quiver behaves differently with different chemistries.
jbadalam is offline   Reply With Quote
Old 08-05-2014, 01:48 PM   #4
rhall
Senior Member
 
Location: San Francisco

Join Date: Aug 2012
Posts: 318
Default

The lowercase is independent of chemistry, and will only occur if the region does not have highQV spanning reads. Try running with allow ambiguous mapped reads to see if the lowercase bases disappear. Are the lowercase regions always 500bp?
rhall is offline   Reply With Quote
Old 08-05-2014, 02:29 PM   #5
jbadalam
Member
 
Location: Saint Paul, MN

Join Date: Jan 2014
Posts: 10
Default

Thanks, I'll try allowing ambiguously mapped reads and post back.

I didn't count the lengths of every occurrence, only the first 10 or so, and yes, they are all exactly 500 bp.
jbadalam is offline   Reply With Quote
Old 08-12-2014, 02:51 PM   #6
jbadalam
Member
 
Location: Saint Paul, MN

Join Date: Jan 2014
Posts: 10
Default

I've looked into this further, and what appears to be happening is that quiver is calling variants (quite a few of them on the first pass, as expected), but not capturing all of these variants in the output consensus.fasta file. For example, if quiver calls 297 variants (284 insertions, 13 deletions), I would assume the total length of the consensus sequence should increase by 271 bp relative to the reference - is this correct? On another pass I actually noticed the consensus get smaller despite ~50 net insertions being called. This has happened for multiple genomes and so I don't think that one particular dataset is causing this odd behavior.

Subsequent quiver passes (taking the consensus from one as the input to the next) show a similar trend in that the consensus sequence length does not change as predicted by the variants called. And while the 500-bp lowercase regions keep showing up, their position and number appear to change randomly. Some are 1000 bp and in total they comprise about 1% of the genome.

I'm wondering what might be causing quiver to defer to the reference for consensus calling at these regions given that their position along the genome seems to be arbitrary with each pass. I haven't checked each lowercase region but for the ones I have I see plenty of coverage in spanning reads and no breakpoints at the lowercase junction.

For one genome, essentially all of the variants that quiver called appear to be correct (backed up by short reads), but in another, it seems to be missing 2/3 variants (if the short read data is to be trusted)--and only some, but not all, of the variants that do get called are showing up in the consensus.
jbadalam is offline   Reply With Quote
Old 08-12-2014, 04:46 PM   #7
rhall
Senior Member
 
Location: San Francisco

Join Date: Aug 2012
Posts: 318
Default

Variants can be multiple bases.
The 500 multiple lowercase regions are due to low mapQV reads, probably due to repetitive elements. These change in multiple rounds because you are not converging on the correct consensus.
rhall is offline   Reply With Quote
Old 08-13-2014, 06:59 PM   #8
jbadalam
Member
 
Location: Saint Paul, MN

Join Date: Jan 2014
Posts: 10
Default

Unfortunately that doesn't really explain everything I'm seeing. On the first quiver pass, there were 365 total variants, of which 17 were single base deletions, and 346 were single base insertions. The other two variants were multiple base insertions totaling 5 bp. The reference length was 3819884 bp, and the quiver consensus was 3820109 bp. So while the consensus did expand by 225 bp, shouldn't it have been 334 bp longer? Maybe I'm missing something?

The 500-bp lowercase regions are showing up where I have >130x spanning coverage of high map QV reads. Some might have 1 or 2 low map QV reads but this is also true for other parts of the genome that aren't lowercase.

The positions of repeat sequences (>500 bp, >85% identity, determined with nucmer) have no correlation with the location of lowercase regions. The repeats are also spanned by high map QV reads.

If I re-run quiver (taking the consensus as the new reference), the 500-bp lowercase regions move to other non-repetitive regions of the genome that have similar coverage of spanning high map QV reads. At this point I'm lost as to whether I might benefit from having more data for polishing or if this is a bug.

Last edited by jbadalam; 08-14-2014 at 07:14 AM. Reason: typo
jbadalam is offline   Reply With Quote
Old 08-13-2014, 09:07 PM   #9
rhall
Senior Member
 
Location: San Francisco

Join Date: Aug 2012
Posts: 318
Default

To double check, this is within SMRT Portal? What version are you using? If you don't mind sharing your data, I'd gladly take a look to see if we can get to the bottom of this strange behavior. I'll send you a message with my email.
rhall is offline   Reply With Quote
Old 08-14-2014, 07:13 AM   #10
jbadalam
Member
 
Location: Saint Paul, MN

Join Date: Jan 2014
Posts: 10
Default

Thanks! Will send my data over to you.

Yes, this is within SMRT Portal. Since I work only with microbial genomes we are running it locally on one standalone machine (8 cores, 32 Gb RAM). I am using:
SMRTAnalysis version v2.2.0.p3 build 137015
Daemon version v2.2.0 build 132105
SMRTpipe version v2.2.0 build 132739
SMRT Portal version v2.2.0 build 133335
SMRT View version v2.2.0 build 132578
jbadalam is offline   Reply With Quote
Old 08-25-2014, 05:41 PM   #11
rhall
Senior Member
 
Location: San Francisco

Join Date: Aug 2012
Posts: 318
Default

Keeping the thread up to date.
The problems with the consensus resulted from quiver not converging due to low quality data, likely resulting from the overloading of sample on SMRT Cells, http://www.pacb.com/samplenet/SMRTCe...mendations.pdf. Updates have been made to the quiver code that result in better performance on data such as this, https://github.com/PacificBioscience...ster/CHANGELOG.
rhall is offline   Reply With Quote
Old 03-21-2016, 07:02 AM   #12
Marcela Uliano
Member
 
Location: Berlin, Germany

Join Date: Apr 2012
Posts: 18
Default

Hey guys,

I have a related question, although it's different case, maybe you will be able to give me some inputs.
So, I have a hybrid assembly (illumina + pacbio subreads - DBG2OCL) for an eukaryotic genome, and now I'm in doubt if I should quiver this assembly or not.
Once, as I read, quiver calls consensus based on the pacbio quality of the aligned PacBio, what it would do to the regions that are covered only by my Illumina contigs?
Does it make sense to run quiver in a hybrid assembly at all?

Thanks a lot for the help! =)
Marcela Uliano is offline   Reply With Quote
Old 03-21-2016, 09:40 AM   #13
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,800
Default

Quote:
Originally Posted by Marcela Uliano View Post
Hey guys,

I have a related question, although it's different case, maybe you will be able to give me some inputs.
So, I have a hybrid assembly (illumina + pacbio subreads - DBG2OCL) for an eukaryotic genome, and now I'm in doubt if I should quiver this assembly or not.
Once, as I read, quiver calls consensus based on the pacbio quality of the aligned PacBio, what it would do to the regions that are covered only by my Illumina contigs?
Does it make sense to run quiver in a hybrid assembly at all?

Thanks a lot for the help! =)
For reference cross-posted: https://www.biostars.org/p/182647/
GenoMax is offline   Reply With Quote
Old 04-16-2017, 06:01 AM   #14
feixue1039
Member
 
Location: China

Join Date: Mar 2011
Posts: 18
Default

I find that after Quiver, sequence length of each contig increases ~0.12%. If you are looking at the the mapping result with IGV and load the reference sequence (before polish analysis) and pbalign output bam file as input, you may need to take the coordinate shift into account.
feixue1039 is offline   Reply With Quote
Reply

Tags
hgap, homopolymer, lowercase, pacbio, quiver

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 09:52 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO