Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Are gaps in split alignments reported in mpileup output?

    Will a split alignment covering a large indel report the intervening bases as gaps, or '-' alleles, in an mpileup output from that alignment?

    I have two pooled sequencing samples that represent a mix of two genomes. I want to determine the relative allele frequencies in the two pools. The two genomes have extensive structural variation that I would like to avoid ignoring in the analysis.

    I generated a mixed sequence by aligning the genomes, and generating a new reference in which any gaps in one sequence in the alignment are replaced by the corresponding allele in the other genome.

    I then aligned the reads from the pools to the mixed reference using bwa to allow for split alignments, followed by running mpileup and determining allele frequencies (ended up using a script from popoolation to parse the vcf and get these values). I filtered out anything below Q20 mapping quality prior to determining allele frequencies.

    However, whenever I have a site that is in a large indel that has a base in Genome A, but no base in Genome B, the '-' alleles are not being reported.

    Does the alignment contain this information? Are the '-' alleles not actually recorded when a read has a split alignment?

    Short indels appear to be working correctly, but not long ones. Do I need to be using a different set of tools to analyze these regions separately from the SNPs and short indels?

    Any help would be greatly appreciated.

  • #2
    You can capture long deletions when mapping with BBMap; it places them in a single gapped alignment rather than as two alignments, allowing you to analyze with the same tools you use for short indels. Just add the flag "maxindel=200000" (the default is 16000).

    Comment


    • #3
      Thanks for the tip!

      I will give it a shot as that would be much simpler than having to map separately to each genome, undergo the analysis, and then synthesize the results.

      Comment


      • #4
        Hate to double post but I thought this might be better posted in this context rather than in the main BBmap support thread.

        When I align my sequencing reads from the pool to the mixed reference, I get a lot of gap ('-') alleles being reported in regions that are actually almost completely conserved in the two parental genomes. The conserved parental allele is still found, but almost half of the alleles at these sites are reported as a gap allele.

        This does not happen when mapping with BWA to one of the parental genomes (though that has its own set of issues).

        This occurs even when I toss ambiguous mappings and reduce the max indel size to 5000, which is more realistic for these small (~85 kb) genomes.

        Unfortunately, this results in extensive false positives in the final results. Any insight into why this might be happening?

        Comment


        • #5
          How do you know these are false positives? And, can you explain in more detail what you are doing? Like, what do you mean by the parent genomes, for example, and where the reads came from... only viruses have genomes ~85kb, but they don't have parents that I'm aware of. Does BBMap yield the same gaps when mapped to the parent genomes? I guess, a more thorough explanation of the experiment would be helpful.

          Comment


          • #6
            Apologies for a lack of clarity on my part, I have limited experience with short-read alignment. Thank you for your super fast replies.

            Originally posted by Brian Bushnell View Post
            Like, what do you mean by the parent genomes, for example, and where the reads came from... only viruses have genomes ~85kb, but they don't have parents that I'm aware of.
            The "parental" genomes are yeast mitochondrial genomes which explains the short (but long compared to metazoan) genome lengths of ~85 kb though they do vary in size quite a bit between strains. These genomes are fairly problematic for alignments in some ways due to big repetitive AT rich intergenic areas interspersed with short (30 bp) repetitive GC rich areas, but the repeat lengths are generally much lower than the read length. I can imagine the lack of complexity giving rise to issues in the AT-rich areas.

            I refer to them as parents because I conducted an experiment in which two haploid strains with identical nuclear genomes but different mtDNAs, are allowed to mate which, in yeast, can produce recombinant mtDNAs. I have the full mitochondrial genomes of each parent (and I am reasonably confident of the accuracy).

            Originally posted by Brian Bushnell View Post
            I guess, a more thorough explanation of the experiment would be helpful.
            I sequenced one pool in which I selected for diploid cells after mating the two parental haploids, and sequenced a second pool in which an aliquot of the first pool had undergone a selection. The reads come from these two pools and each pool is being mapped to the same reference separately.

            My end goal is to analyze the difference in allele frequencies in the selected pool vs the unselected pool to help map the genetic basis to phenotypic differences between the two parental mtDNAs.

            Originally posted by Brian Bushnell View Post
            How do you know these are false positives? And, can you explain in more detail what you are doing? Does BBMap yield the same gaps when mapped to the parent genomes?
            I am not totally certain these are false positives and I can think of a couple of biological reasons for this to occur. I think its more straight-forward to explain what I do know that makes me suspect these alignments are spurious.

            1. The parental references are nearly invariable in this region. The region does contain introns known to be absent in many strains and thus I cannot rule out such a change without further investigation.
            2. The alignments are showing gaps mapping across exons of COX1, a critical respiratory gene. The second pool was grown in conditions requiring respiration to grow, but the alignment of these reads also shows a high frequency of '-' alleles in this region spanning across exons.

            I am currently mapping the reads to one of the parental genomes and will update this post, or add one below, as soon as I have that information.

            Edit: The '-' alleles are also appearing when aligning directly to one of the parental genomes.

            Frankly, not nearly enough is known about the recombination dynamics of yeast mtDNA. If these alignments are not spurious that could prove to be extremely interesting.
            Last edited by JWolters; 09-02-2016, 10:10 AM.

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Recent Innovations in Spatial Biology
              by seqadmin


              Spatial biology is an exciting field that encompasses a wide range of techniques and technologies aimed at mapping the organization and interactions of various biomolecules in their native environments. As this area of research progresses, new tools and methodologies are being introduced, accompanied by efforts to establish benchmarking standards and drive technological innovation.

              3D Genomics
              While spatial biology often involves studying proteins and RNAs in their...
              Yesterday, 07:30 PM
            • seqadmin
              Advancing Precision Medicine for Rare Diseases in Children
              by seqadmin




              Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
              12-16-2024, 07:57 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, 12-30-2024, 01:35 PM
            0 responses
            26 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 12-17-2024, 10:28 AM
            0 responses
            41 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 12-13-2024, 08:24 AM
            0 responses
            55 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 12-12-2024, 07:41 AM
            0 responses
            41 views
            0 likes
            Last Post seqadmin  
            Working...
            X