SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
cufflinks: Intron not within scaffold kenphi Bioinformatics 0 12-05-2012 04:54 AM
intron phase crayfish Bioinformatics 0 07-17-2012 06:53 AM
Intron coverage in RNAseq dariober Bioinformatics 19 11-29-2011 02:51 AM
Data retention policy for Solexa bioinfosm Illumina/Solexa 1 10-17-2008 03:25 AM
Data retention dsturgill Bioinformatics 6 09-17-2008 12:18 AM

Reply
 
Thread Tools
Old 04-14-2014, 02:36 PM   #1
xguo
Member
 
Location: Maryland

Join Date: Jul 2008
Posts: 48
Default DEXSeq for intron retention

Dear all,
I'm trying to use DEXSeq to detect intron retention events in diseased human samples. Essentially we need to count how many reads map to each intron and then use DEXSeq to do statistical comparison. My question is how we shall prepare the intron annotation and count files for DEXSeq. One way is to retrieve intron gtf file from UCSC table browser, and collapse them into intronic counting bins. The other way is to extract intron coordinates based on exon gtf file prepared by dexseq_prepare_annotation.py included in DEXSeq.

Does anyone have experience in identification of differential intron retention using DEXSeq? It'll be great if you can share your insights in the preparation of intron count files.

thanks
Xiang
xguo is offline   Reply With Quote
Old 04-15-2014, 01:35 AM   #2
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

Here's an example script I wrote in R that takes an annotation GFF file prepared by DEXSeq and adds "intronic_part" records with associated gene_ids. You'll find both the original exonic bins and the newer intronic bins in the resulting file (you'd need to change the file names), which you may or may not want. You could also modify this script to make the intronic bins "exonic bins" and just look at exons and introns at once (I don't know how well that'd work) or remove the old exonic bins and simply label the intronic_parts exonic_parts, which would probably make your life easier in DEXSeq (I just modified DEXSeq). For the latter, just change the "asGFF2()" function.

Code:
#!/usr/bin/Rscript
library(GenomicRanges)
library(rtracklayer)
library(parallel)

gff <- import.gff2("Mus_musculus.GRCm38.71.DEXSeq.gff", asRangedData=F)
#Add a level for "intronic_part"
elementMetadata(gff)$type <- factor(elementMetadata(gff)$type, levels=c(levels(elementMetadata(gff)$type), "intronic_part"))
#Fix the exonic_part_number to be a properly formatted character
USE <- which(!is.na(elementMetadata(gff)$exonic_part_number))
exonic_parts <- sprintf("%03i", elementMetadata(gff)$exonic_part_number[USE])
elementMetadata(gff)$exonic_part_number <- as.character(elementMetadata(gff)$exonic_part_number)
elementMetadata(gff)$exonic_part_number[USE] <- exonic_parts

#Split by gene_id
grl <- split(gff, elementMetadata(gff)$gene_id)
#Add the introns as "intronic_parts
add_introns <- function(gr) {
    exons <- gr[which(elementMetadata(gr)$type=="exonic_part"),]
    if(length(exons) > 1) {
        seqname <- seqnames(exons)[-1]
        starts <- end(exons)+1
        starts <- starts[-length(starts)]
        ends <- start(exons)-1
        ends <- ends[-1]
        bounds <- IRanges(start=starts, end=ends)
        strand <- strand(exons)[-1]
        introns <- GRanges(seqnames=seqname, ranges=bounds, strand=strand)
        intron_ids <- sprintf("%03i", c(1:length(introns)))
        #Remove 0-width introns
        DISCARD <- which(width(introns) <= 0)
        if(length(DISCARD) > 0) {
            introns <- introns[-DISCARD]
            intron_ids <- intron_ids[-DISCARD] #Set intron numbers so they follow their respective exonic parts
        }
        if(length(introns) > 0) {
            #create the meta-data
            df <- as.data.frame(elementMetadata(exons))
            nrows <- length(introns)
            metadf <- df[1:nrows,] #does this need to deal with gene_id and transcripts differently?
            metadf <- transform(metadf, gene_id=as.character(gene_id), transcripts=as.character(transcripts))
            metadf$transcripts <- as.character(c(rep(NA, nrows)))
            metadf$type <- factor(c(rep("intronic_part", nrows)), levels=levels(metadf$type))
            metadf$exonic_part_number <- intron_ids
            elementMetadata(introns) <- metadf
            #Merge the GRanges
            gr <- append(gr, introns)
            gr <- gr[order(start(gr), elementMetadata(gr)$type),] #resort
        }
    }
    return(gr)
}
with_introns <- endoapply(grl, add_introns) 
#reorder things
chroms <- sapply(with_introns, function(x) as.factor(seqnames(x))[1])
starts <- sapply(with_introns, function(x) start(x)[1])
o <- order(chroms, starts)
with_introns2 <- with_introns[o]
##Merge into a GRange
#with_introns2 <- unlist(with_introns2, use.names=F, recursive=T)
#Create GFF formatted output
asGFF2 <- function(x) {
    df <- as.data.frame(x)
    aggregates <- which(df$type == "aggregate_gene")
    meta <- character(nrow(df))
    meta[aggregates] <- sprintf("gene_id \"%s\"", df$gene_id[aggregates])
    #This gives introns a transcript "NA" field, which may not be ideal
    meta[-aggregates] <- sprintf("transcripts \"%s\"; exonic_part_number \"%s\"; gene_id \"%s\"", df$transcripts[-aggregates], df$exonic_part_number[-aggregates], df$gene_id[-aggregates])
    paste(df$seqnames, "dexseq_prepare_annotation.py", df$type, df$start, df$end, ".", df$strand, ".", meta, sep="\t")
}
outputGFF <- unlist(lapply(with_introns2, asGFF2))
write.table(outputGFF, file="Mus_musculus.GRCm38.71.DEXSeq.introns.gff", row.names=F, col.names=F, quote=F)
dpryan is offline   Reply With Quote
Old 04-15-2014, 06:04 PM   #3
xguo
Member
 
Location: Maryland

Join Date: Jul 2008
Posts: 48
Default

Dear Devon,
Thanks a lot for sharing your code. I'll give it a try.

regards
-Xiang
xguo is offline   Reply With Quote
Old 04-16-2014, 12:40 AM   #4
areyes
Senior Member
 
Location: Heidelberg

Join Date: Aug 2010
Posts: 165
Default

The perl version of @dpryan's script (be careful, its perl!, :P)

Quote:
open(FILE, "Homo_sapiens.GRCh37.68.DEXSeq.gtf");
my $previousGene = "";
while(<FILE>){
next if $_ =~ /aggregate\_gene/;
$_ =~ /gene_id \"(\S+)\"/;
$currentGene = $1;
@lineinfo = split( /\t/, $_ );
$currentStart = $lineinfo[3];
$currentEnd = $lineinfo [4];
$_ =~ /transcripts \"(\S+)\"/;
$transcripts = $1;
$_ =~ /exonic\_part\_number \"(\S+)\"/;
$exonPart = $1;
$_ =~ /gene\_id \"(\S+)\"/;
$geneID = $1;
if( $previousGene eq $currentGene ){
if( $currentStart - $previousEnd > 1 ){
$exonPart = $exonPart - 1;
$exonPart = sprintf( "%3.3d", $exonPart );
$nPart = $exonPart."i";
$end = $currentStart - 1;
$start = $previousEnd + 1;
print "$lineinfo[0]\t$lineinfo[1]\t$lineinfo[2]\t$start\t$end\t.\t$lineinfo[6]\t.\ttranscripts \"$transcripts\"; exonic_part_number \"$nPart\"; gene_id \"$geneID\"\n";
}
}
print $_;
$previousGene = $currentGene;
$previousEnd = $currentEnd;
}
close(FILE);
areyes is offline   Reply With Quote
Old 04-16-2014, 07:46 AM   #5
xguo
Member
 
Location: Maryland

Join Date: Jul 2008
Posts: 48
Default

Hi, Alejandro,
Thanks for posting the perl script. It is simple and works great. Can we still generate visualization graph with plotDEXSeq if we use the annotation file with both exons and introns?

thanks
Xiang
xguo is offline   Reply With Quote
Old 11-20-2014, 08:45 AM   #6
kajot
Member
 
Location: Germany

Join Date: Dec 2013
Posts: 16
Default

Hi

I tried the script from Devon, thanks AGAIN for help It works nicely and I managed to run DEXSeq with the new intronic annotation. I decided to delete exons and rename introns to exonic_part, to fool DEXSeq into thinking that we are dealing with a normal everyday situation...

I am looking at the result table, filtered with padj < 0.01 and discarding all entries with NA padj or log2FC. I am still dealing with around 270 reported hits. However, after inspecting some of those hits visually in IGV and also by looking at read count, I can see that the majority of events comes from low-coverage regions with most of the events having 0 - 100 counts.

The dataset I am looking at doesn't have a staggering depth, it's a mammalian tissue with ~45M paired-end reads on average per sample (polyA selected) so simply looking at intron retention here might be not the most exciting thing to do, but it just makes me wonder:

How prone is DEXSeq to report an intron as a hit (differential usage) if the read counts are overall low ? Doesn't the FDR go very high with the decreasing sequencing depth in case of splicing/intron retention analysis ? Since DEXSeq is fitting the model based on dispersion which is also affected by read count (as far as I understand it), is it also taking the low coverage into account while reporting the p.adjusted values ?

Last edited by kajot; 11-20-2014 at 09:00 AM. Reason: added RNAseq prep detail
kajot is offline   Reply With Quote
Old 11-20-2014, 09:09 AM   #7
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

What sort of fold-changes are you seeing for those questionable cases?
dpryan is offline   Reply With Quote
Old 11-20-2014, 09:45 AM   #8
kajot
Member
 
Location: Germany

Join Date: Dec 2013
Posts: 16
Default

I am not sure if my quick coverage look-up for those introns makes sense, maybe this would be important - I took an average of counts from all conditions for any given intron and divided this value by the width of the intron (all based on values in DEXSeq output).
I think it gives a rough estimate about the abundance of reads for a given event ?

Below is a plot of those events. There is a cutoff for abs(log2FC) of 0.5. The y-axis is the value I just mentioned (rough estimate of coverage).

kajot is offline   Reply With Quote
Old 11-21-2014, 12:26 AM   #9
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

So you basically plotted an averaged RPKM variant versus fold-change. Statistical power ends up coming from the absolute number of counts in each sample and is independent of the length of an intron (or exon or gene). I realize that this may seem somewhat counter-intuitive, but that's the way the math works.

One things to look at is the distribution of reads in an intron. If they tend to be clumpy or mostly clustered near the exon:intron bounds, then I would guess that they're not accurately representing actual exon inclusion events. I don't know of a simple metric for that, though I think RNAseqQC has a coverage metrics of some sort that can indicate bias...maybe that'd be useful in this case.
dpryan is offline   Reply With Quote
Old 11-21-2014, 08:48 AM   #10
kajot
Member
 
Location: Germany

Join Date: Dec 2013
Posts: 16
Default

So the more "flat" coverage I get towards the center of the intron, the more reliable indication this is that I am dealing with a legitimate intron retention ? Is my understanding right ?

And related to this, are you aware of any clean way of turning the HTSeq counts into RPKM ? I could write a script myself, but why re-invent the wheel.
If there is nothing I could use then what would be the best value to normalize (the "M" part in RPKM) ? Million of aligned pairs ? Million of "counts" from HTSeq ?


Have a great weekend!

Piotr
kajot is offline   Reply With Quote
Old 11-21-2014, 09:52 AM   #11
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

I would say yes, the flatter the coverage the more likely it is to be real.

Regarding RPKMs, you just need the gene (or other feature) lengths. I'm pretty sure I've posted a script to take a GTF and output union gene model lengths before. I'll have to look around for it (if nothing else, I'll just post it to github). You can then divide the normalized counts by those values and then divide by a million and you'll have RPKM (or FPKM if you used paired-end reads).
dpryan is offline   Reply With Quote
Old 11-21-2014, 12:12 PM   #12
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

I posted an example R script here. This was originally written in the context of making an input file for CQN, so you can comment out all of the GC% related lines if that really matters to you.
dpryan is offline   Reply With Quote
Old 06-30-2015, 01:24 AM   #13
sheibani
Junior Member
 
Location: Vienna

Join Date: Jun 2014
Posts: 6
Default

---------------------------------------------------

Last edited by sheibani; 07-30-2015 at 03:02 AM.
sheibani is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 09:05 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO