SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
RNA-Seq: Differential expression analysis for paired RNA-seq data. Newsbot! Literature Watch 0 03-28-2013 02:00 AM
lncRNA differential expression using RNA-seq LP_SEP23 Bioinformatics 1 11-18-2012 02:00 PM
RNA-Seq: Differential expression--the next generation and beyond. Newsbot! Literature Watch 0 01-03-2012 02:40 AM
RNA-seq for differential expression Inti RNA Sequencing 0 01-17-2011 06:46 AM

Reply
 
Thread Tools
Old 05-13-2013, 09:02 AM   #1
bob-loblaw
Member
 
Location: /home/bob

Join Date: Jun 2012
Posts: 59
Default RNA-Seq, lower coverage shows more differential expression

Hi All,

I've recently being assessing some RNA-Seq data that I have from a pilot project. When we begin our actual study though we'll have much more samples and hence we want to use less coverage to save money where possible. Basically we're looking at differential expression between 2 conditions, however when I randomly extracted 1/3 of the reads for each file and mapped them (using the exact same pipeline as the whole files) and then looked at differential expression, I found about 4 times more differentially expressed genes than with the all of the data.

Any ideas why? I've been doing this using DESeq

I also performed some clustering and found that the samples from the pilot study tend to fall into 2 fairly distinct groups, but look at differential expression within those groups isn't viable because of the sample size (only 3 samples in each group) and so DESeq doesn't detect anything as being significantly differentially expressed.
bob-loblaw is offline   Reply With Quote
Old 05-20-2013, 01:26 AM   #2
Wolfgang Huber
Senior Member
 
Location: Heidelberg, Germany

Join Date: Aug 2009
Posts: 109
Default

Dear bob loblaw

thank you. The behaviour you report is not reasonable, and somewhere in your workflow or tool chain there must be a mistake. Can you report the sequence of steps (script) that you perform, to reproduce your observation?

Best wishes
Wolfgang
__________________
Wolfgang Huber
EMBL
Wolfgang Huber is offline   Reply With Quote
Old 05-20-2013, 06:32 AM   #3
bob-loblaw
Member
 
Location: /home/bob

Join Date: Jun 2012
Posts: 59
Default

Hi Wolfgang,

I've located at one problem in my workflow that resulted in only fraction of my reads mapping. I'm hoping that this problem can be attributed to this. I'm currently re-processing my samples now. Thanks though!
bob-loblaw is offline   Reply With Quote
Old 05-20-2013, 10:48 AM   #4
colaneri
Member
 
Location: Durham

Join Date: Jul 2012
Posts: 30
Default Normalizing libraries sequenced a different deep

I have a question somehow related to this discussion.
If I have libraries to compare that have been sequenced with different deep:
example:
Lib1; 12,000,000 reads
Lib2; 17,000,000 reads
Lib3; 9,000,000 reads

It is ok to randomly pick up 9,000,000 reads from the Lib1 and 2 raw data?
colaneri is offline   Reply With Quote
Old 05-21-2013, 02:05 AM   #5
bob-loblaw
Member
 
Location: /home/bob

Join Date: Jun 2012
Posts: 59
Default

Quote:
Originally Posted by colaneri View Post
I have a question somehow related to this discussion.
If I have libraries to compare that have been sequenced with different deep:
example:
Lib1; 12,000,000 reads
Lib2; 17,000,000 reads
Lib3; 9,000,000 reads

It is ok to randomly pick up 9,000,000 reads from the Lib1 and 2 raw data?
I think so, assuming that aside from varying coverages they are all prepared in exactly the same way.
bob-loblaw is offline   Reply With Quote
Old 05-21-2013, 02:20 AM   #6
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,168
Default

Quote:
Originally Posted by colaneri View Post
I have a question somehow related to this discussion.
If I have libraries to compare that have been sequenced with different deep:
example:
Lib1; 12,000,000 reads
Lib2; 17,000,000 reads
Lib3; 9,000,000 reads

It is ok to randomly pick up 9,000,000 reads from the Lib1 and 2 raw data?
But why do this? All well designed software for differential expression analysis will account for varying library depth.
kmcarr is offline   Reply With Quote
Old 05-21-2013, 02:34 AM   #7
bob-loblaw
Member
 
Location: /home/bob

Join Date: Jun 2012
Posts: 59
Default

Quote:
Originally Posted by kmcarr View Post
But why do this? All well designed software for differential expression analysis will account for varying library depth.
Thats a good point, although the poster didn't say how they were comparing the libraries.
bob-loblaw is offline   Reply With Quote
Old 05-21-2013, 01:23 PM   #8
sdriscoll
I like code
 
Location: San Diego, CA, USA

Join Date: Sep 2009
Posts: 438
Default

i good rule of thumb when doing random subsetting of data is to do it more than once so you can observe that the results are stable and not jumping around due to the random sampling. So...sure go ahead and subset down to 9,000,000 reads but run a few iterations of the entire pipeline. if the results are stable you're good. if not then subsetting may not be appropriate, for whatever reason.
sdriscoll is offline   Reply With Quote
Old 05-29-2013, 05:44 AM   #9
bob-loblaw
Member
 
Location: /home/bob

Join Date: Jun 2012
Posts: 59
Default

Quote:
Originally Posted by Wolfgang Huber View Post
Dear bob loblaw

thank you. The behaviour you report is not reasonable, and somewhere in your workflow or tool chain there must be a mistake. Can you report the sequence of steps (script) that you perform, to reproduce your observation?

Best wishes
Wolfgang
Well my workflow is very simple. I'm using bowtie2 to map my reads to a reference database, and then I'm using samtools idxstats function to create a count table. Then I merge rows with duplicate IDs, then I put it into R and run DESeq on it. It's a very simple workflow.

I've also noticed that for the random subset of reads and the whole file the exact same proportion of reads is mapped (i.e. if 67.4% is mapped for the whole file, then in the subset file 67.4% is also mapped) which at least indicates that this difference in the number of differentially expressed genes between the whole file and subset isn't down to some non random effect in the subsetting process.

Any ideas?
Thanks
bob-loblaw is offline   Reply With Quote
Old 05-29-2013, 07:19 AM   #10
sdriscoll
I like code
 
Location: San Diego, CA, USA

Join Date: Sep 2009
Posts: 438
Default

So, to rephrase, you're mapping to a transcriptome reference and not a genome, correct? When you do this its very important that the rows you merge be all features that share exonic sequence - as in all alternative isoforms of a gene and even multi copy genes that reside in separate loci. Otherwise you're going to run into some confusion for sure. Bowtie is not designed nor capable of making alignment decisions between features with shared sequence beyond total random selection. It does make the same decision each time you run it but that's by design...they do a random number seeding trick to ensure this happens. So even if you're using an Ensemble annotation for mouse or human simply merging counts based on any of the provided ids isn't enough to remove the ambiguity problem.
sdriscoll is offline   Reply With Quote
Old 05-29-2013, 07:22 AM   #11
bob-loblaw
Member
 
Location: /home/bob

Join Date: Jun 2012
Posts: 59
Default

Quote:
Originally Posted by sdriscoll View Post
So, to rephrase, you're mapping to a transcriptome reference and not a genome, correct? When you do this its very important that the rows you merge be all features that share exonic sequence - as in all alternative isoforms of a gene and even multi copy genes that reside in separate loci. Otherwise you're going to run into some confusion for sure. Bowtie is not designed nor capable of making alignment decisions between features with shared sequence beyond total random selection. It does make the same decision each time you run it but that's by design...they do a random number seeding trick to ensure this happens. So even if you're using an Ensemble annotation for mouse or human simply merging counts based on any of the provided ids isn't enough to remove the ambiguity problem.
I'm mapping against the DNA sequences of predicted proteins from sequenced genomes. The only rows that I merge are the ones which have identical IDs I hadn't even considered merging isoforms...

We originally did it to cut down on computational time more than anything (although it actually doesn't make that big of a difference in terms of the size of the count table)

Last edited by bob-loblaw; 05-29-2013 at 09:12 AM.
bob-loblaw is offline   Reply With Quote
Old 05-29-2013, 08:30 AM   #12
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

This is all sounds as if some mistake happens during counting. Your approach of using samtools idxstats is rathe runorthodox, and I wonder if it is correct. It might be much safer to use some well-tested tool to obtain a count table instead of using some home-brewn solution.
Simon Anders is offline   Reply With Quote
Old 05-29-2013, 08:34 AM   #13
bob-loblaw
Member
 
Location: /home/bob

Join Date: Jun 2012
Posts: 59
Default

Quote:
Originally Posted by Simon Anders View Post
This is all sounds as if some mistake happens during counting. Your approach of using samtools idxstats is rathe runorthodox, and I wonder if it is correct. It might be much safer to use some well-tested tool to obtain a count table instead of using some home-brewn solution.
From what I can see idxstats works perfectly, it needs to a bit of tweeking in R so that DESeq can read it, but nothing major.

I am open to suggestions though, can you give some examples of those well tested tools? Thanks

Last edited by bob-loblaw; 05-29-2013 at 08:36 AM.
bob-loblaw is offline   Reply With Quote
Old 05-29-2013, 08:37 AM   #14
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

Actually, I don't get it. How do you get per-gene count with idxstat? I thought it only tells you the number of reads mapped to each reference sequence, i.e., to each chromosome. How do you get individual genes?
Simon Anders is offline   Reply With Quote
Old 05-29-2013, 08:45 AM   #15
bob-loblaw
Member
 
Location: /home/bob

Join Date: Jun 2012
Posts: 59
Default

Quote:
Originally Posted by Simon Anders View Post
Actually, I don't get it. How do you get per-gene count with idxstat? I thought it only tells you the number of reads mapped to each reference sequence, i.e., to each chromosome. How do you get individual genes?

Maybe that depends on the reference that you've mapped to?

For me it throws out a table which has every gene ID from the genome I'm mapping to, along with it's length, and how many reads are mapped to that reference. For some of them it just gives me and organism and coordinates for the genome, but I can get more information about that from a GFF file I have of the annotations.
bob-loblaw is offline   Reply With Quote
Old 05-29-2013, 08:52 AM   #16
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

This is what sdriscoll meant, when he said that you are mapping to a transcriptome. You have provided your aligner with a FASTA file which did not contain one sequence for each chromosome but one sequence for each transcript. (Otherwise, how would samtools know where the transcripts are, as you haven't supplied a GFF file.) This is known as "mapping against the transcriptome" and it is "bad" if you don't know exactly what you are doing, for various reasons that you'll find in old threads here.
Simon Anders is offline   Reply With Quote
Old 05-29-2013, 09:07 AM   #17
bob-loblaw
Member
 
Location: /home/bob

Join Date: Jun 2012
Posts: 59
Default

Quote:
Originally Posted by Simon Anders View Post
This is what sdriscoll meant, when he said that you are mapping to a transcriptome. You have provided your aligner with a FASTA file which did not contain one sequence for each chromosome but one sequence for each transcript. (Otherwise, how would samtools know where the transcripts are, as you haven't supplied a GFF file.) This is known as "mapping against the transcriptome" and it is "bad" if you don't know exactly what you are doing, for various reasons that you'll find in old threads here.
Ah okay. I misunderstood, I thought sdriscoll was asking if I was mapping against a reference assembled from transcriptomics data, as opposed to the DNA sequences of predicted proteins from sequenced genomes (Which is what I'm using). Sorry for being a "student" and for for making a "mistake". I'll correct that post
bob-loblaw is offline   Reply With Quote
Old 05-29-2013, 09:15 AM   #18
sdriscoll
I like code
 
Location: San Diego, CA, USA

Join Date: Sep 2009
Posts: 438
Default

Exactly as Simon said. If you were mapping to a genome reference then idxstats would return read counts per chromosome. It's absolutely more complicated to map to a transcriptome reference. A couple tools for that are eXpress and RSEM but neither of those will help you get counts at the gene level without you providing some knowledge of which references are from the same gene.

Probably the most straightforward approach is to align your reads to a genome reference (full chromosome sequences) with Tophat or STAR, if you have the RAM for it, then to count hits to genes with something like htseq-count which can find overlaps of genomic coordinates with gene features annotated in a GTF file.
sdriscoll is offline   Reply With Quote
Old 05-29-2013, 09:16 AM   #19
sdriscoll
I like code
 
Location: San Diego, CA, USA

Join Date: Sep 2009
Posts: 438
Default

If you really want to align to this database you're using I suggest trying RSEM.
sdriscoll is offline   Reply With Quote
Old 05-29-2013, 09:17 AM   #20
bob-loblaw
Member
 
Location: /home/bob

Join Date: Jun 2012
Posts: 59
Default

Quote:
Originally Posted by sdriscoll View Post
If you really want to align to this database you're using I suggest trying RSEM.
I'll look into it. Thank you.
bob-loblaw is offline   Reply With Quote
Reply

Tags
coverage, deseq, rna-seq

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 02:43 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO