Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • kintany
    Junior Member
    • May 2012
    • 5

    Allelic Imbalance with GSNAP

    Hi,

    I'm working on creating simple pipeline for allelic imbalance analysis for our lab. We want to analyze allelic imbalance for F1 crosses between two mice strains (CAST-EiJ and 129S1-SvlmJ). So one allele comes from CAST-EiJ genome, the second one from 129S1-SvlmJ. I need to map reads to two alleles and then run different tests.
    The first issue is reference mapping bias. To overcome this bias, I want to use variant-aware aligner GSNAP. It requires one fasta sequence (considered as 'reference' for this task) and a list of SNPs between two alleles (it my case, this is the same as SNPs between two strains). I have two fasta files with sequences for mice strains. I also downloaded VCF files for both strains, but these VCF files describe difference between our strains and reference genome, mm10. So I probably need to create my own VCF file just from two fasta sequences. Could you please help me here? Do I need to write my own script to do this (probably align and list differences) or there is a program to do this?
    Or maybe I'm complicating things and all this can be done other way? Thank a lot!
  • Brian Bushnell
    Super Moderator
    • Jan 2014
    • 2709

    #2
    I think, maybe, you are over-complicating it. The point of mapping is to place reads on their origin; discarding reads with low identity to their origin incurs ref-bias. So, reference-bias is generally an artifact of mapping programs that have insufficient sensitivity. I suggest you try BBMap, which has very high sensitivity (meaning, it can align reads with low identity to the reference).

    In your specific case, since you have fasta files for two different mouse strains, I suggest using BBSplit to allocate the reads to the different strains, then use BBMap to map to each strain independently.

    Comment

    • kintany
      Junior Member
      • May 2012
      • 5

      #3
      Brian, thank you for the answer!

      Reference bias is generally not about sensitivity. It is allelic mapping bias: read carrying the alternative allele of a variant has at least one mismatch, and thus have lower probability to align correctly that the reference reads. And this would be true regardless overall sensitivity of aligner.

      BBMap sounds good, thank you. You write

      "I suggest using BBSplit to allocate the reads to the different strains".

      What do you mean by "allocate"? You suggest to map reads to individual genomes, right? And is Pileup.sh able to calculate coverage for two alleles then? Thank you!

      Comment

      • Brian Bushnell
        Super Moderator
        • Jan 2014
        • 2709

        #4
        Originally posted by kintany View Post
        Reference bias is generally not about sensitivity. It is allelic mapping bias: read carrying the alternative allele of a variant has at least one mismatch, and thus have lower probability to align correctly that the reference reads.
        I'm not sure I agree with that. Basically, a perfect aligner would map all reads somewhere. So even if a read has some mismatches, it should get mapped to its origin, as long as the sensitivity is sufficient. In rare cases, changes would make it map to somewhere else better, which would incur ref bias; but in my experience, the leading cause of ref bias is mapper insensitivity (meaning, reads that don't match the reference simply don't get mapped), rather than coincidental matches to other parts of the genome due to mutations or errors.

        BBMap sounds good, thank you. You write

        "I suggest using BBSplit to allocate the reads to the different strains".

        What do you mean by "allocate"? You suggest to map reads to individual genomes, right? And is Pileup.sh able to calculate coverage for two alleles then? Thank you!
        If you give BBSplit multiple reference fastas, it will take a single input fastq (or two paired fastqs) and produce multiple output fastqs, one per reference. The outputs will be the reads that best match each reference. You can specify what should be done with reads matching multiple references equally well with the "ambig2" flag (ambig2=toss, ambig2=all, etc).

        As for Pileup - all it does is calculate the coverage according to a sam/bam file. So, for example, if all reads were mapped correctly:

        Code:
        pileup.sh in=mapped.sam out=stats.txt
        That would tell you the coverage on a per-scaffold basis. It does not have any understanding of multiple alleles, but it will correctly report the coverage of a sam/bam file that was mapped to multiple concatenated references. representing different alleles.

        Comment

        Latest Articles

        Collapse

        • SEQadmin2
          Nine Things a Sample Prep Scientist Thinks About Before Sequencing
          by SEQadmin2


          I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

          Here are nine questions we think about, in roughly the order they matter, before...
          06-18-2026, 07:11 AM
        • SEQadmin2
          From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
          by SEQadmin2


          Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


          The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
          ...
          06-02-2026, 10:05 AM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by SEQadmin2, 06-17-2026, 06:09 AM
        0 responses
        40 views
        0 reactions
        Last Post SEQadmin2  
        Started by SEQadmin2, 06-09-2026, 11:58 AM
        0 responses
        102 views
        0 reactions
        Last Post SEQadmin2  
        Started by SEQadmin2, 06-05-2026, 10:09 AM
        0 responses
        123 views
        0 reactions
        Last Post SEQadmin2  
        Started by SEQadmin2, 06-04-2026, 08:59 AM
        0 responses
        114 views
        0 reactions
        Last Post SEQadmin2  
        Working...