SEQanswers

SEQanswers (http://seqanswers.com/forums/index.php)
-   Bioinformatics (http://seqanswers.com/forums/forumdisplay.php?f=18)
-   -   Low number of replicates DESeq (http://seqanswers.com/forums/showthread.php?t=41155)

uqfgaiti 02-22-2014 08:36 PM

Low number of replicates DESeq
 
Hi all,

I am using DESeq for DGE analysis.

I have STRANDED RNA-Seq data for 4 developmental stages with no replicates.
To have a more reliable DGE I should have replicates and so I obtained (from another lab member) UNSTRANDED RNA-Seq data with 3 replicates per stage.

Before doing a DGE, I thought to test the correlation between these samples, just to show that similar samples “cluster” together. If so, I can then use the unstranded data for my DGE analysis to have more replicates per each stage.

I mapped the raw reads to the genome using TOPHAT, sorted the bam files by name and used htseq-count to get the raw reads counts for both the data. For the stranded data I used the option -s yes and for the unstranded data I used -s no.

I used DESeq to include metadata and for normalization, and I removed the genes that always have a 0 value. I then calcualted the correlation which was really low.

I then tried to use htseq-count with the option -s reverse for the stranded data and still got really low correlation.

So I reran htseq-count on the stranded data selecting the option -s no and in this way I got a very similar number of total counts between the unstranded and stranded data (while both cases before the stranded ones were double in number). I then included metadata, estimated the new size factors, normalized and calculated the new correlation. Both Pearson and Spearman performed pretty well, confirmed by both a PCA and correlogram.

Though, I'd still like to figure out a way to use the stranded counts. I am not sure if I lose some information running htseq-count using -s no on the stranded data.

What I had in mind was using unstranded data to estimate the level of variation to get a threshold for DE detection but still use the stranded data as expression values. Not sure if I can do that though given one is stranded and the other is not.

I would like to hear from you if you have any thoughts about this.

Let me know if you need more information to better understand the issue.

Thanks a lot
Federico

blanco 02-25-2014 03:35 AM

Hi Federico,
have you confirmed that the data you have is indeed strand specific?
You can use the infer_experiment.py script from the RSeQC package to check it out - https://code.google.com/p/rseqc/

Depending on the library preparation protocol you either use the '-s reverse' or '-s yes' option. Do you know the specifics of the library prep? In any case, you will find out if you manage to run 'infer_experiment.py'.

/blanco

uqfgaiti 02-25-2014 03:46 AM

Hi Blanco,

to be sure I did a quick test using infer_experiment.py,it is indeed forward-reverse. I also did a blat to the genome and I got the same result.

Quote:

This is PairEnd Data
Fraction of reads explained by "1++,1--,2+-,2-+": 0.9189
Fraction of reads explained by "1+-,1-+,2++,2--": 0.0811
Fraction of reads explained by other combinations: 0.0000

1++,1–,2+-,2-+

read1 mapped to ‘+’ strand indicates parental gene on ‘+’ strand
read1 mapped to ‘-‘ strand indicates parental gene on ‘-‘ strand
read2 mapped to ‘+’ strand indicates parental gene on ‘-‘ strand
read2 mapped to ‘-‘ strand indicates parental gene on ‘+’ strand
Based on this I ran TOPHAT with fr-secondstrand and htseq-count with -s yes

Simon Anders 02-25-2014 10:39 PM

Have you had a look at the alignments yet? Pick a two or three genes for which the counts between stranded and unstranded libraries differ strongly and two or three genes for which they agree reasonably well, then look at the eads at these loci with a genome browser (e.g., IGV). Do you see antisense transcripts which should not be there? Do you see many reads with low alignment quality? Any other peculiarity?

uqfgaiti 02-25-2014 11:24 PM

Hi Simon,

Good idea!
I am sorting my bam files at the moment. I will write back once I had a look at the alignment

In the meantime thanks for the suggestion

uqfgaiti 02-26-2014 04:30 PM

Hi Simon,

I have had a look at the alignments.
When the counts between stranded and unstraded librarires agree reasonably well, it looks alright. I can see that when they differ strongly it is because the locus is particulary complex, with a lot of antisense transcription and overlapping genes, which are pickep up only by the stranded libraries.

I've also noticed that in a lot of cases, genes that are well supported by alignment reads, have a count of 0. This happens more often for unstranded libraries, I guess particularly when there is antisense transcription. But it also happens for stranded libraries. Highly expressed genes with good alignment support have a count 0.

I wonder if running htseq-count with
Quote:

-m union
could have influenced this.
For each developmental stage, around 15-20 million counts are ambigous.

Would -m intersection_nonempty more appropriate in your opinion?

Thanks for help
Federico

Simon Anders 02-27-2014 12:38 AM

Does your GTF file contain "exon" lines for all the anti-sense transcripts? If so, this would explain the discrepancies.

htseq-count considers a read as "ambiguous" if it maps to more than one feature (gene). So, if a read maps to a position which is covered by a regular gene on one strand and by an overlapping antisense transcript on the other strand, then this read will be counted as ambiguous if you have set "stranded" to "no", because there is no information to decide whether the read originated from the sense of from the antisense transcript. For "stranded=yes", however, the read will be counted for the feature that is on the same strand as the read.

uqfgaiti 02-27-2014 01:02 AM

yes my gtf file contains all exon lines for antisense transcripts as well.
this explains the discrepancy since I used -s yes for the stranded libraries.
But why sometimes the counts are zero when the read maps to a position which is covered by a regular gene on one strand and a overlapping antisense gene on the opposite strand?

selecting -stranded yes should account for this,shouldn't it?

thanks

Simon Anders 02-27-2014 01:11 AM

Write down the names of a few such reads, then use the '-o' option to get an output SAM file, grep in it for the read names and find out this way what htseq-count has done with these reads.

uqfgaiti 02-27-2014 02:16 AM

Ok, thanks again for help!

Let you know how it goes

uqfgaiti 02-27-2014 03:19 AM

Hi Simon,

I might have figured this out.

I think the issue is that I have particular loci in the genome that have overlapping genes on the same strand. Though, these genes have a different ID so that's way I get a count of 0 .

If a read maps to an exon shared by several genes, this will appear to htseq-count as ambigous.

So at the moment I am having zero counts for all these particular loci, which sometimes contain also really highly expressed genes.

Do you have any idea on how I could account for this?

Let me know if I'm not clear

Thanks
Federico

ret4andry 02-27-2014 03:55 AM

I am now indexing my reference with your suggestions above. Let me know if I'm not clear

http://watchfree.me/17/w.png

Simon Anders 02-27-2014 09:16 AM

Crossposting a question on two forums is never a good idea. At least the link to the other thread should be provided: https://stat.ethz.ch/pipermail/bioco...ry/057986.html

uqfgaiti 02-27-2014 01:26 PM

Hi Simon,

sorry for that. I'm still a beginner in this field and i'm still learning how all this works.
The thread on bioconductor list is more focused on DESeq2 analysis than htseq-count but I was indeed going to provide the link to this thread since the discussion was shifting towards htseq-count again.

Did you have a chance to read my previous post?

I think the issue I'm having (other than having the correct design for DESeq2) is that there are particular loci in the genome that have overlapping genes on the same strand. Though, these genes have a different ID so that's way I get a count of 0 .

As explaind in htseq manual If a read maps to an exon shared by several genes, this will appear to htseq-count as ambigous.

So at the moment I am having zero counts for all these particular loci, which sometimes contain also really highly expressed genes that I'd like to account for in the subsequent DGE analysis.

Do you have any idea on how I could account for this?

Thanks

Simon Anders 02-27-2014 10:00 PM

Overlapping genes on the same strand cannot be distinguished in principle -- at least not by reads that map to the overlapping part. So, if the overlap is small, these will only be a few read, and you could ignore the issue. If, however, the overlap is big, you should consider the two genes as different isoforms of the same gene, and hence give them the same name.

So, I would somehow edit the GTF file such that whenever there are two genes A and B with large overlap, you replace all occurances of the gene IDs "A" and "B" with something like "A+B".

uqfgaiti 02-27-2014 10:02 PM

Hi Simon,

that's the way I was thinking too.

It's nice to hear this from you. I'm not too off

Thanks again for help
Federico


All times are GMT -8. The time now is 09:12 AM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.