SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Quantifying allele-specific expression imbalance using NGS jastop Bioinformatics 2 08-29-2012 08:44 AM
allele-specific expression baohua100 Bioinformatics 0 05-11-2011 12:32 AM
Test of allele-specific expression by genome and transcritome sequencing baohua100 RNA Sequencing 1 05-02-2011 07:00 AM
allele-specific expression by direct genome and transcritpome sequencing baohua100 Bioinformatics 0 05-01-2011 02:02 AM
PubMed: Allele-specific expression assays using Solexa. Newsbot! Literature Watch 0 09-11-2009 03:01 AM

Reply
 
Thread Tools
Old 07-28-2010, 07:58 AM   #1
dariober
Senior Member
 
Location: Cambridge, UK

Join Date: May 2010
Posts: 311
Default RNAseq: Pipeline to detect allele specific expression

Hello,

I wanted to ask your advice about a pipeline to detect differences in allelic expression between RNAseq libraries.

I have two RNAseq libraries, say Control and Treated. The questions I would like to ask are:
1) For a given transcripts with heterozygous SNPs, are some alleles more expressed than others?
2) Are some alleles induced/repressed in the Treated library?

The pipeline I plan to use to go about this is:

- For each library, align reads against the reference genome using bowtie/tophat (output in SAM format)
- Use samtools to produce a pileup file including the consensus calling. E.g:
Code:
samtools pileup -cvf reference.fa alignment.sorted.bam > ouput.pileup
- Extract from the pileup files the heterozygous SNPs. I.e. those where the consensus column is one of R (A or G), Y (C or T), S (G or C), W (A or T), K (G or T), M (A or C) and SNP phred score is >=30 (or >=20?).
- For each selected SNP in each library: Do a binomial test to see if allele1:allele2 != 50:50. Correct the p-values to have FDR<0.1
- For each selected SNP: Compared the two libraries with Fisher's exact test to see if one allele is more expressed in one of the two library (i.e. if there is inducible allele specific expression). Correct the p-values to have FDR<0.1.

Some pitfalls I can think of:
- Reads with SNP alleles not in the reference have less chance to be aligned thus introducing a bias in favour of the reference alleles. How dramatic can this bias be?
- Is the SNP filtering above correct? (Stringent enough?)
- Some transcripts will have more than one SNP locus. However the pipeline above tests each SNP independently. Is there any (...manageable) way to account for this? (I guess a better way would be to test the allelic expression based on haplotypes rather than individual SNPs).

Any and all comments welcome. Hopefully these will be useful to others.

Dario
dariober is offline   Reply With Quote
Old 07-28-2010, 09:13 AM   #2
epigen
Senior Member
 
Location: Germany

Join Date: May 2010
Posts: 101
Default

Your pipeline is almost the same as mine. I used different aligners (SOLiD data) and required Phred score >=20 and additionally 20<= number of reads<= 100 for the SNP position. Also excluded positions with indels and >1/4 at the end of a read.

I read some papers that mentioned the bias for the reference sequence but ended up ignoring it because it seemed there is no easy and reliable way to get rid of it. How bad the bias is was left unclear to me.
The papers are:
Heap et al. 2010, Hum Mol Genet 19:122
Tuch et al. 2010, PLoSOne 0009317
Degner et al. 2009, Bioinformatics 25:3207
Chepelev et al. 2009, Nucl Acids Res 37:e105

Since was interested in alleles that are ~50:50 in the treated sample but biased for one allele in the control, I had rather few candidate SNPs and can't give advice on multiple SNPs per transcript. But I'd be very interested in learning more!

Barbara
epigen is offline   Reply With Quote
Old 08-25-2010, 05:06 AM   #3
GoneSouth
Member
 
Location: Vienna

Join Date: Aug 2008
Posts: 11
Default

If I understand your pipeline correctly, you will not get the SNPs which are the most differentiated between the two samples.
If you have only allele eg: 'T' in one sample and only allele 'A' in the other?
GoneSouth is offline   Reply With Quote
Old 08-25-2010, 05:16 AM   #4
epigen
Senior Member
 
Location: Germany

Join Date: May 2010
Posts: 101
Default

As I mentioned, we wanted heterozygous SNPs in the treated sample (e.g. A and T allele present at ~50:50) that were homozygous in the control sample (only A or T allele present). So a total switch from T to A was not interesting for me.
epigen is offline   Reply With Quote
Old 11-03-2010, 02:41 AM   #5
sparks
Senior Member
 
Location: Kuala Lumpur, Malaysia

Join Date: Mar 2008
Posts: 126
Default

Hi, Novoalign can align reads against a reference genome with IUB ambiguous NA codes. This should remove any allelic bias, you just need to code SNPs into the genome using the IUB codes.
Colin
sparks is offline   Reply With Quote
Old 01-15-2011, 03:22 AM   #6
polarise
Member
 
Location: Ireland

Join Date: Jan 2011
Posts: 13
Default

Hi,

I hope it's not too late to comment on this post.

Two things: one, I think there needs to be a threshold or number of reads that align to a particular site for it to be called 'potential ASE'. There is a paper by Fontanillas et al. (2010) in which they show that the number of reads affects the power to detect ASE.

Secondly, I propose that, where possible, there should be reference sequences biased to the population such that there would not be a bias in the Degner- sense (see Degner et a. 2009 paper).

Otherwise, I agree with the pipeline.

P. Korir
polarise is offline   Reply With Quote
Old 01-17-2011, 05:58 AM   #7
dariober
Senior Member
 
Location: Cambridge, UK

Join Date: May 2010
Posts: 311
Default

Quote:
Originally Posted by polarise View Post
I hope it's not too late to comment on this post.
No, it's not! Many thanks to everyone for sharing their comments and expertise!

Dario
dariober is offline   Reply With Quote
Old 01-17-2011, 07:45 AM   #8
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 625
Default

Since this thread has been revived not too long ago I thought I might ask a question about allele specific alignments as well.

We were thinking of taking a slightly different approach to analyse sequencing data in an allele-specific manner and I would like to know if someone can spot any obvious problems with it.

We are mainly working with mouse models from back-crossed strains for which we have a known set of genome-wide SNPs available, so the approach I'll describe here is probably only useful if detailed genome information is available. To avoid some of the pitfalls Dario mentioned above (especially biasing alignments towards the reference alleles without SNPs) we wrote a program which takes in sequencing file(s) and aligns the sequence (or sequence pair) to two reference genomes in parallel (using Bowtie). In order for this to work we made up individual versions of mouse (or yeast) genomes and indexed them separately (this should also work with e.g. the normal mouse genome and one chromosome of a different strain).

We then analyse the alignment outputs on the fly and decide whether an alignment is specific for one or the other genome (i.e. has a uniquely best alignment for one of them) or if it aligns to both genomes equally well (e.g. if it doesn't contain a SNP which allowing us to say something about the origin). It will then print the alignments to different files (unique 1, unique 2, aligns to both) with a number of useful information (mapping positions, genomic sequences and mismatch information). All downstream analysis would then start from these files (such as percent sequence from which allele, basecall/alignment qualities at positions of SNPs and so on).

In its initial version ASAP (for allele specific alignment pipeline) is intended to be used for ChIP-Seq data, but we were thinking of extending this to RNA-Seq data as well. This of course would have the disadvantage that - at least in its current form - it doesn't work for spliced mapping, however it should still work for the majority of reads which do not span splice junctions anyway.

I would be happy to know if people think this is a useful tool or if you can see major flaws which did not occur to us so far. I should mention that ASAP is not fully ready yet, but single-end alignments are nearly working and the paired-end option is yet to be implemented (overall progress is I'd say around 87%. Roughly).

thanks,
Felix
fkrueger is offline   Reply With Quote
Old 01-19-2011, 03:50 AM   #9
epigen
Senior Member
 
Location: Germany

Join Date: May 2010
Posts: 101
Default candidates not allele-specific

Finally we got the results from the validation of our candidates - and a big disappointment: almost all turned out to be expressed perfectly 50:50 instead of allele-specific!
Since in some papers they only consider reads with different start sites, I next tried duplicate removal with Picard, which left me with less than half of the reads (~20 Mio remaining per sample), and used the new samtools pileup version to call SNPs. This resulted in only a handful of new candidates, which are currently being checked in the lab.
I guess we'll have to sequence more because with 20 Mio reads left, we can only detect highly expressed genes. The problem I see in duplicate removal is that true PCR duplicates can't be distinguished from reads that align to the same position by chance just because the gene/allele is so highly expressed. And duplicate removal will of course keep the read with the highest mapping score, which will be the one with the reference SNP.

Has any of you worked out a successful approach?

@fkrueger:
To me ASAP sounds like a clever idea. Unfortunately it's not applicable for humans (my data). Even if you sequence the individual's DNA first to find the SNPs, you never know which SNP belongs to which chromosome so you can't figure out the correct combinations ...
epigen is offline   Reply With Quote
Old 07-17-2015, 01:46 PM   #10
metheuse
Member
 
Location: US

Join Date: Jan 2013
Posts: 78
Default

I'm also trying to measure allele-specific-expression by exome-seq and RNA-seq, with the former to find heteroSNPs and alleles, and the latter to calculate ASE on the identified positions. I found this post to be quite helpful and it's just what I was proposing to do.

I also did some method search and found some programs specifically designed for ASE (MMSeq, eXpress, AlleleSeq, etc), but they all require construction of haplotypes (two versions of reference genome/transcriptome), which makes a lot of sense and should be the most accurate way to avoid mapping bias and help integrating SNP-level ASE into gene-level ASE. However in most of the cases, people don't have sequencing data from trio. There are several tools for haplotyping from exome-seq data, like GATK's ReadBackedPhasing and HapCompass, but they can only phase some blocks of the variants which are close to each other and thus were covered by continuous reads.

I got to know an aligner called GSNAP that can align RNA-seq reads with tolerance of SNPs. I'm gonna try this to see how the output is different from Tophat alignment.
metheuse is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 02:37 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO