SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Statistical test for allelic imbalance r_j_p Bioinformatics 1 10-29-2013 04:39 PM
RNA-Seq: Statistical inference of allelic imbalance from transcriptome data. Newsbot! Literature Watch 0 12-02-2010 02:00 AM
Allelic imbalance and expression kenosaki RNA Sequencing 1 08-09-2010 06:37 AM

Reply
 
Thread Tools
Old 12-09-2016, 08:49 AM   #1
kintany
Junior Member
 
Location: Boston

Join Date: May 2012
Posts: 5
Arrow Allelic Imbalance with GSNAP

Hi,

I'm working on creating simple pipeline for allelic imbalance analysis for our lab. We want to analyze allelic imbalance for F1 crosses between two mice strains (CAST-EiJ and 129S1-SvlmJ). So one allele comes from CAST-EiJ genome, the second one from 129S1-SvlmJ. I need to map reads to two alleles and then run different tests.
The first issue is reference mapping bias. To overcome this bias, I want to use variant-aware aligner GSNAP. It requires one fasta sequence (considered as 'reference' for this task) and a list of SNPs between two alleles (it my case, this is the same as SNPs between two strains). I have two fasta files with sequences for mice strains. I also downloaded VCF files for both strains, but these VCF files describe difference between our strains and reference genome, mm10. So I probably need to create my own VCF file just from two fasta sequences. Could you please help me here? Do I need to write my own script to do this (probably align and list differences) or there is a program to do this?
Or maybe I'm complicating things and all this can be done other way? Thank a lot!
kintany is offline   Reply With Quote
Old 12-09-2016, 09:00 AM   #2
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

I think, maybe, you are over-complicating it. The point of mapping is to place reads on their origin; discarding reads with low identity to their origin incurs ref-bias. So, reference-bias is generally an artifact of mapping programs that have insufficient sensitivity. I suggest you try BBMap, which has very high sensitivity (meaning, it can align reads with low identity to the reference).

In your specific case, since you have fasta files for two different mouse strains, I suggest using BBSplit to allocate the reads to the different strains, then use BBMap to map to each strain independently.
Brian Bushnell is offline   Reply With Quote
Old 12-09-2016, 01:12 PM   #3
kintany
Junior Member
 
Location: Boston

Join Date: May 2012
Posts: 5
Default

Brian, thank you for the answer!

Reference bias is generally not about sensitivity. It is allelic mapping bias: read carrying the alternative allele of a variant has at least one mismatch, and thus have lower probability to align correctly that the reference reads. And this would be true regardless overall sensitivity of aligner.

BBMap sounds good, thank you. You write

"I suggest using BBSplit to allocate the reads to the different strains".

What do you mean by "allocate"? You suggest to map reads to individual genomes, right? And is Pileup.sh able to calculate coverage for two alleles then? Thank you!
kintany is offline   Reply With Quote
Old 12-09-2016, 02:13 PM   #4
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Quote:
Originally Posted by kintany View Post
Reference bias is generally not about sensitivity. It is allelic mapping bias: read carrying the alternative allele of a variant has at least one mismatch, and thus have lower probability to align correctly that the reference reads.
I'm not sure I agree with that. Basically, a perfect aligner would map all reads somewhere. So even if a read has some mismatches, it should get mapped to its origin, as long as the sensitivity is sufficient. In rare cases, changes would make it map to somewhere else better, which would incur ref bias; but in my experience, the leading cause of ref bias is mapper insensitivity (meaning, reads that don't match the reference simply don't get mapped), rather than coincidental matches to other parts of the genome due to mutations or errors.

Quote:
BBMap sounds good, thank you. You write

"I suggest using BBSplit to allocate the reads to the different strains".

What do you mean by "allocate"? You suggest to map reads to individual genomes, right? And is Pileup.sh able to calculate coverage for two alleles then? Thank you!
If you give BBSplit multiple reference fastas, it will take a single input fastq (or two paired fastqs) and produce multiple output fastqs, one per reference. The outputs will be the reads that best match each reference. You can specify what should be done with reads matching multiple references equally well with the "ambig2" flag (ambig2=toss, ambig2=all, etc).

As for Pileup - all it does is calculate the coverage according to a sam/bam file. So, for example, if all reads were mapped correctly:

Code:
pileup.sh in=mapped.sam out=stats.txt
That would tell you the coverage on a per-scaffold basis. It does not have any understanding of multiple alleles, but it will correctly report the coverage of a sam/bam file that was mapped to multiple concatenated references. representing different alleles.
Brian Bushnell is offline   Reply With Quote
Reply

Tags
allele-specific rna-seq, rna-seq aligners

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 10:51 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO