Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • lowercase output from quiver

    I recently completed resequencing of a bacterial strain with a complete reference genome. The HGAP3 assembly resulted in a single contig, but there were several areas with lowercase bases (~500 bp each). SMRTview showed that I had uniform coverage of unambiguously mapped subreads spanning these regions. No big deal, I thought--I'll just convert them to all caps and run another pass through quiver, but the same regions came back lowercase, suggesting quiver didn't touch them at all.

    What's interesting is that if I BLAST one lowercase region with ~200bp flanking, I hit 960/961 bases from the published reference strain (with the single mismatch being a potential deletion in a homopolymer stretch of 8 C's). Most (but not all) of the other lowercase regions seem to have one deletion at a homopolymer with respect to the published reference. Could homopolymers alone be throwing off the read mapping and somehow tagging that local region as "suspect"? Either way I don't think these lowercase regions are worrisome, but I'm still confused as to how they arise and how quiver behaves with respect to them.

    My questions are:
    1) what causes quiver to produce a consensus with areas of lowercase bases, even if subreads unambiguously map to these regions without dips or spikes in coverage? Did these areas not get polished at all--and if so, how can I determine consensus accuracy for the genome if indels might still be present in lowercase regions?
    2) do lowercase bases in the consensus.fasta output factor into the calculation of consensus accuracy?
    3) does quiver ignore lowercase bases in a reference that is uploaded for the RS_Resequencing pipeline--meaning should I make sure all bases are uppercase before uploading a reference to SMRTportal?

  • #2
    Homopolymers errors such as this should not effect mapping, they are present because quiver has not had a chance to correct them, they are the most common error in the consensus called during assembly.
    1) what causes quiver to produce a consensus with areas of lowercase bases, even if subreads unambiguously map to these regions without dips or spikes in coverage? Did these areas not get polished at all--and if so, how can I determine consensus accuracy for the genome if indels might still be present in lowercase regions?
    By default a lowercase base indicates no consensus call has been made and the base is just the reference sequence, this occurs when a 500bp region does not have any unambiguously mapped spanning reads. I'm don't know how this would happen if there is coverage and high mapqv. One possibility is that the mapping all terminate at a single point, i.e. the reads have no evidence for the reference to be joined at that position, the coverage and mapQV could look good, but the alignment should show the blunt join. Is the lowercase region in the middle of the sequence? before circularization the end of the sequence will be lowercase due to the overlap and ambiguous mapping.
    If the lowercase is due to ambiguous mapping, most likely, then the option to ignore ambiguous mapped reads can be turned off in the protocol settings.
    2) do lowercase bases in the consensus.fasta output factor into the calculation of consensus accuracy?
    The consensus (concordance) accuracy is calculated between the reference and the consensus, but I'm not sure if lowercase bases in the consensus sequence are ignored, I'll have to check.
    3) does quiver ignore lowercase bases in a reference that is uploaded for the RS_Resequencing pipeline--meaning should I make sure all bases are uppercase before uploading a reference to SMRTportal?
    No lowercase bases are not treated any differently when uploaded as a reference.
    Last edited by rhall; 08-05-2014, 10:14 AM. Reason: typo

    Comment


    • #3
      The lowercase regions all occur within the sequence. One HGAP run produced a single contig with self-overlapping ends which I removed before uploading to SMRT Portal as a reference for running subsequent passes through quiver. I iterated a few times to get the consensus QV >50 and after each pass the same regions came back lowercase in the consensus, even if I changed to uppercase before uploading the reference.

      The lowercase regions occur throughout the assembly, probably 10-15 or so per megabase. All of them are spanned by unambiguously mapped reads, most of which are significantly longer than the lowercase regions themselves, so I highly doubt these are misassemblies, especially since BLASTing them with flanking uppercase sequence hits the published reference genome across the entire query.

      Is it possible that the chemistry information provided in the raw data is different from what was actually used during the sequencing run? I know that quiver behaves differently with different chemistries.

      Comment


      • #4
        The lowercase is independent of chemistry, and will only occur if the region does not have highQV spanning reads. Try running with allow ambiguous mapped reads to see if the lowercase bases disappear. Are the lowercase regions always 500bp?

        Comment


        • #5
          Thanks, I'll try allowing ambiguously mapped reads and post back.

          I didn't count the lengths of every occurrence, only the first 10 or so, and yes, they are all exactly 500 bp.

          Comment


          • #6
            I've looked into this further, and what appears to be happening is that quiver is calling variants (quite a few of them on the first pass, as expected), but not capturing all of these variants in the output consensus.fasta file. For example, if quiver calls 297 variants (284 insertions, 13 deletions), I would assume the total length of the consensus sequence should increase by 271 bp relative to the reference - is this correct? On another pass I actually noticed the consensus get smaller despite ~50 net insertions being called. This has happened for multiple genomes and so I don't think that one particular dataset is causing this odd behavior.

            Subsequent quiver passes (taking the consensus from one as the input to the next) show a similar trend in that the consensus sequence length does not change as predicted by the variants called. And while the 500-bp lowercase regions keep showing up, their position and number appear to change randomly. Some are 1000 bp and in total they comprise about 1% of the genome.

            I'm wondering what might be causing quiver to defer to the reference for consensus calling at these regions given that their position along the genome seems to be arbitrary with each pass. I haven't checked each lowercase region but for the ones I have I see plenty of coverage in spanning reads and no breakpoints at the lowercase junction.

            For one genome, essentially all of the variants that quiver called appear to be correct (backed up by short reads), but in another, it seems to be missing 2/3 variants (if the short read data is to be trusted)--and only some, but not all, of the variants that do get called are showing up in the consensus.

            Comment


            • #7
              Variants can be multiple bases.
              The 500 multiple lowercase regions are due to low mapQV reads, probably due to repetitive elements. These change in multiple rounds because you are not converging on the correct consensus.

              Comment


              • #8
                Unfortunately that doesn't really explain everything I'm seeing. On the first quiver pass, there were 365 total variants, of which 17 were single base deletions, and 346 were single base insertions. The other two variants were multiple base insertions totaling 5 bp. The reference length was 3819884 bp, and the quiver consensus was 3820109 bp. So while the consensus did expand by 225 bp, shouldn't it have been 334 bp longer? Maybe I'm missing something?

                The 500-bp lowercase regions are showing up where I have >130x spanning coverage of high map QV reads. Some might have 1 or 2 low map QV reads but this is also true for other parts of the genome that aren't lowercase.

                The positions of repeat sequences (>500 bp, >85% identity, determined with nucmer) have no correlation with the location of lowercase regions. The repeats are also spanned by high map QV reads.

                If I re-run quiver (taking the consensus as the new reference), the 500-bp lowercase regions move to other non-repetitive regions of the genome that have similar coverage of spanning high map QV reads. At this point I'm lost as to whether I might benefit from having more data for polishing or if this is a bug.
                Last edited by jbadalam; 08-14-2014, 06:14 AM. Reason: typo

                Comment


                • #9
                  To double check, this is within SMRT Portal? What version are you using? If you don't mind sharing your data, I'd gladly take a look to see if we can get to the bottom of this strange behavior. I'll send you a message with my email.

                  Comment


                  • #10
                    Thanks! Will send my data over to you.

                    Yes, this is within SMRT Portal. Since I work only with microbial genomes we are running it locally on one standalone machine (8 cores, 32 Gb RAM). I am using:
                    SMRTAnalysis version v2.2.0.p3 build 137015
                    Daemon version v2.2.0 build 132105
                    SMRTpipe version v2.2.0 build 132739
                    SMRT Portal version v2.2.0 build 133335
                    SMRT View version v2.2.0 build 132578

                    Comment


                    • #11
                      Keeping the thread up to date.
                      The problems with the consensus resulted from quiver not converging due to low quality data, likely resulting from the overloading of sample on SMRT Cells, http://www.pacb.com/samplenet/SMRTCe...mendations.pdf. Updates have been made to the quiver code that result in better performance on data such as this, https://github.com/PacificBioscience...ster/CHANGELOG.

                      Comment


                      • #12
                        Hey guys,

                        I have a related question, although it's different case, maybe you will be able to give me some inputs.
                        So, I have a hybrid assembly (illumina + pacbio subreads - DBG2OCL) for an eukaryotic genome, and now I'm in doubt if I should quiver this assembly or not.
                        Once, as I read, quiver calls consensus based on the pacbio quality of the aligned PacBio, what it would do to the regions that are covered only by my Illumina contigs?
                        Does it make sense to run quiver in a hybrid assembly at all?

                        Thanks a lot for the help! =)

                        Comment


                        • #13
                          Originally posted by Marcela Uliano View Post
                          Hey guys,

                          I have a related question, although it's different case, maybe you will be able to give me some inputs.
                          So, I have a hybrid assembly (illumina + pacbio subreads - DBG2OCL) for an eukaryotic genome, and now I'm in doubt if I should quiver this assembly or not.
                          Once, as I read, quiver calls consensus based on the pacbio quality of the aligned PacBio, what it would do to the regions that are covered only by my Illumina contigs?
                          Does it make sense to run quiver in a hybrid assembly at all?

                          Thanks a lot for the help! =)
                          For reference cross-posted: https://www.biostars.org/p/182647/

                          Comment


                          • #14
                            I find that after Quiver, sequence length of each contig increases ~0.12%. If you are looking at the the mapping result with IGV and load the reference sequence (before polish analysis) and pbalign output bam file as input, you may need to take the coordinate shift into account.

                            Comment

                            Latest Articles

                            Collapse

                            • seqadmin
                              Strategies for Sequencing Challenging Samples
                              by seqadmin


                              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                              03-22-2024, 06:39 AM
                            • seqadmin
                              Techniques and Challenges in Conservation Genomics
                              by seqadmin



                              The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                              Avian Conservation
                              Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                              03-08-2024, 10:41 AM

                            ad_right_rmr

                            Collapse

                            News

                            Collapse

                            Topics Statistics Last Post
                            Started by seqadmin, Yesterday, 06:37 PM
                            0 responses
                            7 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, Yesterday, 06:07 PM
                            0 responses
                            7 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 03-22-2024, 10:03 AM
                            0 responses
                            49 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 03-21-2024, 07:32 AM
                            0 responses
                            66 views
                            0 likes
                            Last Post seqadmin  
                            Working...
                            X