Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Are gaps in split alignments reported in mpileup output?

    Will a split alignment covering a large indel report the intervening bases as gaps, or '-' alleles, in an mpileup output from that alignment?

    I have two pooled sequencing samples that represent a mix of two genomes. I want to determine the relative allele frequencies in the two pools. The two genomes have extensive structural variation that I would like to avoid ignoring in the analysis.

    I generated a mixed sequence by aligning the genomes, and generating a new reference in which any gaps in one sequence in the alignment are replaced by the corresponding allele in the other genome.

    I then aligned the reads from the pools to the mixed reference using bwa to allow for split alignments, followed by running mpileup and determining allele frequencies (ended up using a script from popoolation to parse the vcf and get these values). I filtered out anything below Q20 mapping quality prior to determining allele frequencies.

    However, whenever I have a site that is in a large indel that has a base in Genome A, but no base in Genome B, the '-' alleles are not being reported.

    Does the alignment contain this information? Are the '-' alleles not actually recorded when a read has a split alignment?

    Short indels appear to be working correctly, but not long ones. Do I need to be using a different set of tools to analyze these regions separately from the SNPs and short indels?

    Any help would be greatly appreciated.

  • #2
    You can capture long deletions when mapping with BBMap; it places them in a single gapped alignment rather than as two alignments, allowing you to analyze with the same tools you use for short indels. Just add the flag "maxindel=200000" (the default is 16000).

    Comment


    • #3
      Thanks for the tip!

      I will give it a shot as that would be much simpler than having to map separately to each genome, undergo the analysis, and then synthesize the results.

      Comment


      • #4
        Hate to double post but I thought this might be better posted in this context rather than in the main BBmap support thread.

        When I align my sequencing reads from the pool to the mixed reference, I get a lot of gap ('-') alleles being reported in regions that are actually almost completely conserved in the two parental genomes. The conserved parental allele is still found, but almost half of the alleles at these sites are reported as a gap allele.

        This does not happen when mapping with BWA to one of the parental genomes (though that has its own set of issues).

        This occurs even when I toss ambiguous mappings and reduce the max indel size to 5000, which is more realistic for these small (~85 kb) genomes.

        Unfortunately, this results in extensive false positives in the final results. Any insight into why this might be happening?

        Comment


        • #5
          How do you know these are false positives? And, can you explain in more detail what you are doing? Like, what do you mean by the parent genomes, for example, and where the reads came from... only viruses have genomes ~85kb, but they don't have parents that I'm aware of. Does BBMap yield the same gaps when mapped to the parent genomes? I guess, a more thorough explanation of the experiment would be helpful.

          Comment


          • #6
            Apologies for a lack of clarity on my part, I have limited experience with short-read alignment. Thank you for your super fast replies.

            Originally posted by Brian Bushnell View Post
            Like, what do you mean by the parent genomes, for example, and where the reads came from... only viruses have genomes ~85kb, but they don't have parents that I'm aware of.
            The "parental" genomes are yeast mitochondrial genomes which explains the short (but long compared to metazoan) genome lengths of ~85 kb though they do vary in size quite a bit between strains. These genomes are fairly problematic for alignments in some ways due to big repetitive AT rich intergenic areas interspersed with short (30 bp) repetitive GC rich areas, but the repeat lengths are generally much lower than the read length. I can imagine the lack of complexity giving rise to issues in the AT-rich areas.

            I refer to them as parents because I conducted an experiment in which two haploid strains with identical nuclear genomes but different mtDNAs, are allowed to mate which, in yeast, can produce recombinant mtDNAs. I have the full mitochondrial genomes of each parent (and I am reasonably confident of the accuracy).

            Originally posted by Brian Bushnell View Post
            I guess, a more thorough explanation of the experiment would be helpful.
            I sequenced one pool in which I selected for diploid cells after mating the two parental haploids, and sequenced a second pool in which an aliquot of the first pool had undergone a selection. The reads come from these two pools and each pool is being mapped to the same reference separately.

            My end goal is to analyze the difference in allele frequencies in the selected pool vs the unselected pool to help map the genetic basis to phenotypic differences between the two parental mtDNAs.

            Originally posted by Brian Bushnell View Post
            How do you know these are false positives? And, can you explain in more detail what you are doing? Does BBMap yield the same gaps when mapped to the parent genomes?
            I am not totally certain these are false positives and I can think of a couple of biological reasons for this to occur. I think its more straight-forward to explain what I do know that makes me suspect these alignments are spurious.

            1. The parental references are nearly invariable in this region. The region does contain introns known to be absent in many strains and thus I cannot rule out such a change without further investigation.
            2. The alignments are showing gaps mapping across exons of COX1, a critical respiratory gene. The second pool was grown in conditions requiring respiration to grow, but the alignment of these reads also shows a high frequency of '-' alleles in this region spanning across exons.

            I am currently mapping the reads to one of the parental genomes and will update this post, or add one below, as soon as I have that information.

            Edit: The '-' alleles are also appearing when aligning directly to one of the parental genomes.

            Frankly, not nearly enough is known about the recombination dynamics of yeast mtDNA. If these alignments are not spurious that could prove to be extremely interesting.
            Last edited by JWolters; 09-02-2016, 10:10 AM.

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Strategies for Sequencing Challenging Samples
              by seqadmin


              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
              03-22-2024, 06:39 AM
            • seqadmin
              Techniques and Challenges in Conservation Genomics
              by seqadmin



              The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

              Avian Conservation
              Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
              03-08-2024, 10:41 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, 03-27-2024, 06:37 PM
            0 responses
            12 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 03-27-2024, 06:07 PM
            0 responses
            11 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 03-22-2024, 10:03 AM
            0 responses
            53 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 03-21-2024, 07:32 AM
            0 responses
            68 views
            0 likes
            Last Post seqadmin  
            Working...
            X