SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
DESeq without replicates austinpa Bioinformatics 43 07-15-2014 07:38 PM
Help Using DESeq without replicates aprilw Bioinformatics 6 12-19-2013 08:07 AM
DESeq without replicates gfmgfm Bioinformatics 2 07-09-2013 03:11 AM
Biological replicates with very different number of reads mblue RNA Sequencing 3 02-13-2012 08:08 AM
DESeq: question about with replicates and without any replicates. nb509 RNA Sequencing 2 10-25-2011 07:04 AM

Reply
 
Thread Tools
Old 02-22-2014, 09:36 PM   #1
uqfgaiti
Member
 
Location: Australia, Brisbane

Join Date: Nov 2012
Posts: 13
Default Low number of replicates DESeq

Hi all,

I am using DESeq for DGE analysis.

I have STRANDED RNA-Seq data for 4 developmental stages with no replicates.
To have a more reliable DGE I should have replicates and so I obtained (from another lab member) UNSTRANDED RNA-Seq data with 3 replicates per stage.

Before doing a DGE, I thought to test the correlation between these samples, just to show that similar samples “cluster” together. If so, I can then use the unstranded data for my DGE analysis to have more replicates per each stage.

I mapped the raw reads to the genome using TOPHAT, sorted the bam files by name and used htseq-count to get the raw reads counts for both the data. For the stranded data I used the option -s yes and for the unstranded data I used -s no.

I used DESeq to include metadata and for normalization, and I removed the genes that always have a 0 value. I then calcualted the correlation which was really low.

I then tried to use htseq-count with the option -s reverse for the stranded data and still got really low correlation.

So I reran htseq-count on the stranded data selecting the option -s no and in this way I got a very similar number of total counts between the unstranded and stranded data (while both cases before the stranded ones were double in number). I then included metadata, estimated the new size factors, normalized and calculated the new correlation. Both Pearson and Spearman performed pretty well, confirmed by both a PCA and correlogram.

Though, I'd still like to figure out a way to use the stranded counts. I am not sure if I lose some information running htseq-count using -s no on the stranded data.

What I had in mind was using unstranded data to estimate the level of variation to get a threshold for DE detection but still use the stranded data as expression values. Not sure if I can do that though given one is stranded and the other is not.

I would like to hear from you if you have any thoughts about this.

Let me know if you need more information to better understand the issue.

Thanks a lot
Federico
uqfgaiti is offline   Reply With Quote
Old 02-25-2014, 04:35 AM   #2
blanco
Member
 
Location: Iceland

Join Date: Apr 2012
Posts: 28
Default

Hi Federico,
have you confirmed that the data you have is indeed strand specific?
You can use the infer_experiment.py script from the RSeQC package to check it out - https://code.google.com/p/rseqc/

Depending on the library preparation protocol you either use the '-s reverse' or '-s yes' option. Do you know the specifics of the library prep? In any case, you will find out if you manage to run 'infer_experiment.py'.

/blanco
blanco is offline   Reply With Quote
Old 02-25-2014, 04:46 AM   #3
uqfgaiti
Member
 
Location: Australia, Brisbane

Join Date: Nov 2012
Posts: 13
Default

Hi Blanco,

to be sure I did a quick test using infer_experiment.py,it is indeed forward-reverse. I also did a blat to the genome and I got the same result.

Quote:
This is PairEnd Data
Fraction of reads explained by "1++,1--,2+-,2-+": 0.9189
Fraction of reads explained by "1+-,1-+,2++,2--": 0.0811
Fraction of reads explained by other combinations: 0.0000

1++,1–,2+-,2-+

read1 mapped to ‘+’ strand indicates parental gene on ‘+’ strand
read1 mapped to ‘-‘ strand indicates parental gene on ‘-‘ strand
read2 mapped to ‘+’ strand indicates parental gene on ‘-‘ strand
read2 mapped to ‘-‘ strand indicates parental gene on ‘+’ strand
Based on this I ran TOPHAT with fr-secondstrand and htseq-count with -s yes
uqfgaiti is offline   Reply With Quote
Old 02-25-2014, 11:39 PM   #4
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 991
Default

Have you had a look at the alignments yet? Pick a two or three genes for which the counts between stranded and unstranded libraries differ strongly and two or three genes for which they agree reasonably well, then look at the eads at these loci with a genome browser (e.g., IGV). Do you see antisense transcripts which should not be there? Do you see many reads with low alignment quality? Any other peculiarity?
Simon Anders is offline   Reply With Quote
Old 02-26-2014, 12:24 AM   #5
uqfgaiti
Member
 
Location: Australia, Brisbane

Join Date: Nov 2012
Posts: 13
Default

Hi Simon,

Good idea!
I am sorting my bam files at the moment. I will write back once I had a look at the alignment

In the meantime thanks for the suggestion
uqfgaiti is offline   Reply With Quote
Old 02-26-2014, 05:30 PM   #6
uqfgaiti
Member
 
Location: Australia, Brisbane

Join Date: Nov 2012
Posts: 13
Default

Hi Simon,

I have had a look at the alignments.
When the counts between stranded and unstraded librarires agree reasonably well, it looks alright. I can see that when they differ strongly it is because the locus is particulary complex, with a lot of antisense transcription and overlapping genes, which are pickep up only by the stranded libraries.

I've also noticed that in a lot of cases, genes that are well supported by alignment reads, have a count of 0. This happens more often for unstranded libraries, I guess particularly when there is antisense transcription. But it also happens for stranded libraries. Highly expressed genes with good alignment support have a count 0.

I wonder if running htseq-count with
Quote:
-m union
could have influenced this.
For each developmental stage, around 15-20 million counts are ambigous.

Would -m intersection_nonempty more appropriate in your opinion?

Thanks for help
Federico
uqfgaiti is offline   Reply With Quote
Old 02-27-2014, 01:38 AM   #7
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 991
Default

Does your GTF file contain "exon" lines for all the anti-sense transcripts? If so, this would explain the discrepancies.

htseq-count considers a read as "ambiguous" if it maps to more than one feature (gene). So, if a read maps to a position which is covered by a regular gene on one strand and by an overlapping antisense transcript on the other strand, then this read will be counted as ambiguous if you have set "stranded" to "no", because there is no information to decide whether the read originated from the sense of from the antisense transcript. For "stranded=yes", however, the read will be counted for the feature that is on the same strand as the read.
Simon Anders is offline   Reply With Quote
Old 02-27-2014, 02:02 AM   #8
uqfgaiti
Member
 
Location: Australia, Brisbane

Join Date: Nov 2012
Posts: 13
Default

yes my gtf file contains all exon lines for antisense transcripts as well.
this explains the discrepancy since I used -s yes for the stranded libraries.
But why sometimes the counts are zero when the read maps to a position which is covered by a regular gene on one strand and a overlapping antisense gene on the opposite strand?

selecting -stranded yes should account for this,shouldn't it?

thanks
uqfgaiti is offline   Reply With Quote
Old 02-27-2014, 02:11 AM   #9
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 991
Default

Write down the names of a few such reads, then use the '-o' option to get an output SAM file, grep in it for the read names and find out this way what htseq-count has done with these reads.
Simon Anders is offline   Reply With Quote
Old 02-27-2014, 03:16 AM   #10
uqfgaiti
Member
 
Location: Australia, Brisbane

Join Date: Nov 2012
Posts: 13
Default

Ok, thanks again for help!

Let you know how it goes
uqfgaiti is offline   Reply With Quote
Old 02-27-2014, 04:19 AM   #11
uqfgaiti
Member
 
Location: Australia, Brisbane

Join Date: Nov 2012
Posts: 13
Default

Hi Simon,

I might have figured this out.

I think the issue is that I have particular loci in the genome that have overlapping genes on the same strand. Though, these genes have a different ID so that's way I get a count of 0 .

If a read maps to an exon shared by several genes, this will appear to htseq-count as ambigous.

So at the moment I am having zero counts for all these particular loci, which sometimes contain also really highly expressed genes.

Do you have any idea on how I could account for this?

Let me know if I'm not clear

Thanks
Federico
uqfgaiti is offline   Reply With Quote
Old 02-27-2014, 04:55 AM   #12
ret4andry
Junior Member
 
Location: N virginia

Join Date: Feb 2014
Posts: 3
Default

I am now indexing my reference with your suggestions above. Let me know if I'm not clear

ret4andry is offline   Reply With Quote
Old 02-27-2014, 10:16 AM   #13
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 991
Default

Crossposting a question on two forums is never a good idea. At least the link to the other thread should be provided: https://stat.ethz.ch/pipermail/bioco...ry/057986.html
Simon Anders is offline   Reply With Quote
Old 02-27-2014, 02:26 PM   #14
uqfgaiti
Member
 
Location: Australia, Brisbane

Join Date: Nov 2012
Posts: 13
Default

Hi Simon,

sorry for that. I'm still a beginner in this field and i'm still learning how all this works.
The thread on bioconductor list is more focused on DESeq2 analysis than htseq-count but I was indeed going to provide the link to this thread since the discussion was shifting towards htseq-count again.

Did you have a chance to read my previous post?

I think the issue I'm having (other than having the correct design for DESeq2) is that there are particular loci in the genome that have overlapping genes on the same strand. Though, these genes have a different ID so that's way I get a count of 0 .

As explaind in htseq manual If a read maps to an exon shared by several genes, this will appear to htseq-count as ambigous.

So at the moment I am having zero counts for all these particular loci, which sometimes contain also really highly expressed genes that I'd like to account for in the subsequent DGE analysis.

Do you have any idea on how I could account for this?

Thanks
uqfgaiti is offline   Reply With Quote
Old 02-27-2014, 11:00 PM   #15
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 991
Default

Overlapping genes on the same strand cannot be distinguished in principle -- at least not by reads that map to the overlapping part. So, if the overlap is small, these will only be a few read, and you could ignore the issue. If, however, the overlap is big, you should consider the two genes as different isoforms of the same gene, and hence give them the same name.

So, I would somehow edit the GTF file such that whenever there are two genes A and B with large overlap, you replace all occurances of the gene IDs "A" and "B" with something like "A+B".
Simon Anders is offline   Reply With Quote
Old 02-27-2014, 11:02 PM   #16
uqfgaiti
Member
 
Location: Australia, Brisbane

Join Date: Nov 2012
Posts: 13
Default

Hi Simon,

that's the way I was thinking too.

It's nice to hear this from you. I'm not too off

Thanks again for help
Federico
uqfgaiti is offline   Reply With Quote
Reply

Tags
deseq, htseq-count, replicates

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 05:32 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO