SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
de novo Transcriptome analysis to get differential expression data nareshvasani Ion Torrent 1 07-23-2015 04:12 AM
Alignment/transcriptome assembly/differential expression analysis with 40bp reads? heytreeful Illumina/Solexa 4 03-11-2013 09:54 AM
de novo transcriptome differential expression problem slavailn Bioinformatics 6 05-18-2012 09:40 AM

Reply
 
Thread Tools
Old 02-24-2014, 03:19 AM   #1
dalesan
Member
 
Location: portugal

Join Date: Feb 2011
Posts: 15
Question Differential Expression: is it better to map reads to genome or transcriptome?

Hello, hello.

This may be a very naive question, but I haven't been able to find an established answer yet in the literature, nor here (bad search technique, perhaps).

I'm about to embark upon a differential expression analysis (Arabidopsis) but before doing so, I wanted to know if any of you can comment on the potential benefits/drawbacks of using the transcriptome as a reference rather than the genome?

Presumably, by using the transcriptome, one circumvents having to deal with junction libraries and such. On the other hand, when using a transcriptome annotation, one will have to limit the mapping to the most representative transcript isoform so as to avoid multireads.

In your experience, if we choose to disregard splicing for the moment and focus only on DE, would you map your reads to the genome or the transcriptome, and why?

I'm thinking I might just do both and compare the results. But it would be nice to hear your thoughts.

Thanks!!
dalesan is offline   Reply With Quote
Old 02-24-2014, 04:49 AM   #2
TiborNagy
Senior Member
 
Location: Budapest

Join Date: Mar 2010
Posts: 329
Default

I am prefer the genome mapping, because I can check annotation errors. However, some species does not have assembled genome, only contigs. I this case maybe the transcriptome is a better choice. On the other hand transcriptome some times smaller than the genome, so if you do not have enough computer power, you can choose transcriptome.
TiborNagy is offline   Reply With Quote
Old 02-24-2014, 05:17 AM   #3
rskr
Senior Member
 
Location: Santa Fe, NM

Join Date: Oct 2010
Posts: 250
Default

I think there is more to it than having or not having compute power. I've found mappings to transcriptomes to be much cleaner, the biggest culprit in genome mappings are pseudo genes. Furthermore, I'm not sure that there are statistically sound ways to address finding new transcripts at the same time as finding differential expression, seems like two fundamentally different questions. For example if one population had a different transcript than another, would there be any way to quantify that?
rskr is offline   Reply With Quote
Old 02-24-2014, 02:23 PM   #4
Bukowski
Senior Member
 
Location: UK

Join Date: Jan 2010
Posts: 390
Default

My opinion is that the transcriptome is currently not well characterised enough to serve as a suitable reference for RNA-Seq. The genomes of the model organisms may have problems of incompleteness, but at least provide a scaffold to hang your RNA-Seq off and allow the discovery phase.

I agree if all you're doing is DE of genes with a few million reads, you might as well just map to the transcriptome. But in my experience that's rarely what people want from an RNA-Seq experiment - because that's what arrays are for.
Bukowski is offline   Reply With Quote
Old 02-24-2014, 03:40 PM   #5
dalesan
Member
 
Location: portugal

Join Date: Feb 2011
Posts: 15
Default

Thanks guys for the replies. I appreciate your input. At this point, I think I'm going to run a test comparison between using the genome vs transcriptome to see how congruent the results are when it comes to simple DE testing.

As Bukowski mentioned, mapping to the genome offers you much more information, including discovery of novel transcripts and isoforms. I do have another phase of my project that will consider alternative splicing and I'll definitely be mapping to the genome for this.
dalesan is offline   Reply With Quote
Old 02-24-2014, 04:51 PM   #6
rskr
Senior Member
 
Location: Santa Fe, NM

Join Date: Oct 2010
Posts: 250
Default

Quote:
Originally Posted by Bukowski View Post
- because that's what arrays are for.
Right that's what arrays are for, but in a limited and expensive manner.
rskr is offline   Reply With Quote
Old 03-14-2014, 09:27 AM   #7
sazz
Member
 
Location: Istanbul, Turkey

Join Date: Oct 2012
Posts: 28
Default

Quote:
Originally Posted by dalesan View Post
Thanks guys for the replies. I appreciate your input. At this point, I think I'm going to run a test comparison between using the genome vs transcriptome to see how congruent the results are when it comes to simple DE testing.

As Bukowski mentioned, mapping to the genome offers you much more information, including discovery of novel transcripts and isoforms. I do have another phase of my project that will consider alternative splicing and I'll definitely be mapping to the genome for this.
Dalesan,

I would appreciate if you can share your results, because I also wonder how much it differs when mapped on genome or transcriptome.
sazz is offline   Reply With Quote
Old 03-15-2014, 01:37 AM   #8
dalesan
Member
 
Location: portugal

Join Date: Feb 2011
Posts: 15
Default

Quote:
Originally Posted by sazz View Post
Dalesan,

I would appreciate if you can share your results, because I also wonder how much it differs when mapped on genome or transcriptome.
Sure thing, sazz. Maybe by the end of next week I'll have something to share.
dalesan is offline   Reply With Quote
Old 03-15-2014, 11:29 AM   #9
sazz
Member
 
Location: Istanbul, Turkey

Join Date: Oct 2012
Posts: 28
Default

Quote:
Originally Posted by dalesan View Post
Sure thing, sazz. Maybe by the end of next week I'll have something to share.
Well, I have already made a comparison btw genome and transcriptome mapping, while all the other parameters were exactly same.

First of all; in my experiment, I have control and target shRNA transduced cell line (human) and for my RNA-seq, I prepared 3 replicates from each. Total read number for all is around 110M (Single End, 50bp).

I run Tophat with -g 1 option to get uniquely mapped reads. (it was ~70% hit for transcriptome mapping)

When I compared CuffDiff output btw those 2 approach;
There are 1983 significantly differentially expressed genes (q<0.01) in intersection and 107 for only whole genome mapping, and 94 for only transcriptome mapping.

So for my data, if there is a difference, it seems like a small one and I don't think it will make a change in downstream analysis (I haven't tried yet.)
sazz is offline   Reply With Quote
Old 03-15-2014, 09:20 PM   #10
Zapages
Member
 
Location: NJ

Join Date: Oct 2012
Posts: 94
Default

I used to do transcripts, but I was told specifically to never to use them again as you will get more gene isoforms information through mapping it to the genome.

Please test this out by using the BAM files that are outputted through Tophat 2 and map it with the whole genome or transcript using NCBI IGV.
Zapages is offline   Reply With Quote
Old 03-16-2014, 06:30 AM   #11
rskr
Senior Member
 
Location: Santa Fe, NM

Join Date: Oct 2010
Posts: 250
Default

Quote:
Originally Posted by Zapages View Post
I used to do transcripts, but I was told specifically to never to use them again as you will get more gene isoforms information through mapping it to the genome.

Please test this out by using the BAM files that are outputted through Tophat 2 and map it with the whole genome or transcript using NCBI IGV.
Well, never do genome mapping because you'll spend more time studying pseudo genes. Now, what are you going to do?

Anyway it doesn't make sense to me that you would get isoforms via genome mapping that you wouldn't get via transcriptome mapping, furthermore why would you be looking for different isoforms when you are quantifying relative expression? Is this one of the things where you are just answering the question you want to answer?
rskr is offline   Reply With Quote
Old 03-16-2014, 11:07 AM   #12
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Quote:
Originally Posted by rskr View Post
Well, never do genome mapping because you'll spend more time studying pseudo genes. Now, what are you going to do?
Those are not too hard to identify, as they lack introns and typically have a lot of SNPs with regards to genes. Anyway, pseudogenes also interfere with DNA mapping (in human, for example, many are not in HG19); should DNA mapping be done to the transcriptome as well, to avoid interference?

Quote:
Anyway it doesn't make sense to me that you would get isoforms via genome mapping that you wouldn't get via transcriptome mapping
Often the genome is fairly good, but transcriptomes of complex organisms are probably all incomplete. You can't expect a complete transcriptome from organisms with many life stages or tissue types when some isoforms and genes may only be expressed at certain times.

Quote:
furthermore why would you be looking for different isoforms when you are quantifying relative expression? Is this one of the things where you are just answering the question you want to answer?
Some isoforms are tissue- or condition-specific, and if a gene changes from 99% isoform A to 99% isoform B, that could be very important. Assuming that all the isoforms of a gene are functionally identical would mean there is no reason for alternative splicing to even exist.

Mapping to a transcriptome, you'll be somewhat limited to answering questions that have already been answered, or at least asked. It's like searching for minerals only using a map of known mineral deposits; you'll never discover anything truly novel.

Also, mapping to a genome is more objective and repeatable. Mapping to a transcriptome is very subjective, as there are a huge number of ways to design one. Add a single gene, or a single transcript, and the mappings of all reads may be affected. So, how do you choose which transcripts and isoforms to include? All of them? Just the longest for each gene? Just a full concatenation of all exons per gene? Just the ones that were known prior to date XYZ, or also the two new ones your lab found that you think are relevant? You'll get different results based on this purely subjective decision, possibly allowing results to be tweaked as desired.

Last edited by Brian Bushnell; 03-16-2014 at 11:11 AM.
Brian Bushnell is offline   Reply With Quote
Old 03-16-2014, 12:19 PM   #13
dalesan
Member
 
Location: portugal

Join Date: Feb 2011
Posts: 15
Default

Quote:
Originally Posted by Brian Bushnell View Post
Also, mapping to a genome is more objective and repeatable. Mapping to a transcriptome is very subjective, as there are a huge number of ways to design one. Add a single gene, or a single transcript, and the mappings of all reads may be affected. So, how do you choose which transcripts and isoforms to include? All of them? Just the longest for each gene? Just a full concatenation of all exons per gene? Just the ones that were known prior to date XYZ, or also the two new ones your lab found that you think are relevant? You'll get different results based on this purely subjective decision, possibly allowing results to be tweaked as desired.
Excellent points, Brian. I didn't think of it this way, in terms of the repeatability aspect. In my analysis I've limited the mapping to simply the longest isoform in the annotation. Neverthless, I'm curious to see how the results compare when I get back to my desk tomorrow.
dalesan is offline   Reply With Quote
Old 03-16-2014, 05:26 PM   #14
gringer
David Eccles (gringer)
 
Location: Wellington, New Zealand

Join Date: May 2011
Posts: 836
Default

I would recommend mapping to the genome, but using the transcriptome as a mapping template to pick up splice boundaries, etc.. In other words, something like what Tophat does. Mapping to the genome makes novel isoforms a bit easier to pick up, and mapping to the transcriptome will give you more descriptive output (e.g. proper gene names) with a bit less work. I would expect that thaliana should have a fairly well-annotated transcriptome, so you'll be losing a lot by ignoring annotated genetic features.
gringer is offline   Reply With Quote
Old 03-16-2014, 07:05 PM   #15
rskr
Senior Member
 
Location: Santa Fe, NM

Join Date: Oct 2010
Posts: 250
Default

Quote:
Originally Posted by gringer View Post
I would recommend mapping to the genome, but using the transcriptome as a mapping template to pick up splice boundaries, etc.. In other words, something like what Tophat does. Mapping to the genome makes novel isoforms a bit easier to pick up, and mapping to the transcriptome will give you more descriptive output (e.g. proper gene names) with a bit less work. I would expect that thaliana should have a fairly well-annotated transcriptome, so you'll be losing a lot by ignoring annotated genetic features.
IMO it is obvious that Tophat went to transcriptome mapping because they were unable to solve the pseudo gene problem, what remains to be seen is does using the genome actually bring anything to the table besides huge hardware requirements, and short leading and trailing non-coding isoform? Could whatever it does bring to the table be done later with the reads that don't map to a transcript? In an analysis that is different than differential expression, like an isoform search...

Furthermore, I think most poorly characterized organisms get the transcriptomes done first since they are easier, and provide a majority of the useful information, which sort of renders the argument about uncharacterized organisms, mute.
rskr is offline   Reply With Quote
Old 03-19-2014, 03:56 AM   #16
dalesan
Member
 
Location: portugal

Join Date: Feb 2011
Posts: 15
Default Results of my comparison of mapping to transcriptome vs genome and DEG

So, I've finally gotten around to comparing the results of my differential gene expression analysis based on mapping to the transcriptome and genome of Arabidopsis.

I used bowtie v 1.0.0 to map to a filtered transcriptome containing only the longest gene model isoform of each gene. I have about 30 million paired end reads for each of my 4 samples (2 control, 2 treated) and roughly 45-50% of these reads mapped to the transcriptome. For the genome alignment, I used tophat v.2.0.10 and observed that 75-80% of the reads aligned.

After summarizing counts, I used DESeq 2 v1.2.10 for DEG analysis.

Similar to sazz, I didn't observe a huge difference, but one certainly exists.

What I plan on doing is using the combined information from this analysis to work with the total number of DEGs found from both analyses (1764) rather than just the intersection (1278) or the 1536 or 1506 found in the transcriptome and genome, respectively.

Can you think of any reason to object to this line of reasoning?
dalesan is offline   Reply With Quote
Old 03-19-2014, 04:16 AM   #17
gringer
David Eccles (gringer)
 
Location: Wellington, New Zealand

Join Date: May 2011
Posts: 836
Default

Quote:
Originally Posted by dalesan View Post
I used bowtie v 1.0.0 to map to a filtered transcriptome containing only the longest gene model isoform of each gene.... For the genome alignment, I used tophat v.2.0.10 and observed that 75-80% of the reads aligned.
Is there any particular reason why you used bowtie and not bowtie2? Were you specifically telling tophat to use bowtie, rather than bowtie2 (the default)?

I ask because if the bowtie versions are different, you'll be comparing a bit more than just genome vs transcriptome.

Additional question, did you use the transcriptome GTF file when mapping using tophat? I assume not, because that is likely to result in all transcriptome reads being picked up.
gringer is offline   Reply With Quote
Old 03-19-2014, 04:40 AM   #18
dalesan
Member
 
Location: portugal

Join Date: Feb 2011
Posts: 15
Default

Quote:
Originally Posted by gringer View Post
Is there any particular reason why you used bowtie and not bowtie2? Were you specifically telling tophat to use bowtie, rather than bowtie2 (the default)?

I ask because if the bowtie versions are different, you'll be comparing a bit more than just genome vs transcriptome.

Additional question, did you use the transcriptome GTF file when mapping using tophat? I assume not, because that is likely to result in all transcriptome reads being picked up.
Actually, there was no particular reason that I chose to use bowtie 1. I just read up on the differences between bowtie 1 and 2 now, and it seems like bowtie2 has some nice improvements (affine gaps, local alignments, better ways to handle read pairs). Looks like I'll be re-running my pipeline one more time to see what the differences are in one alignment method vs another.

And yes, I did tell tophat to use bowtie 1. I did in fact use the transcriptome GTF file when mapping with tophat. I don't think it would pick up all transcripts because my transcriptome index file was indexed using a fasta file of only the longest/most representative gene models for each gene.

In your experience, have you noticed a big difference between bowtie 1 and 2? Does it warrant that I re-do my analysis?

Thanks for your questions!
dalesan is offline   Reply With Quote
Old 03-25-2014, 05:27 AM   #19
dalesan
Member
 
Location: portugal

Join Date: Feb 2011
Posts: 15
Smile Updated results and summary

Hello All,

I thought I'd chime in again with some updated results after re-running my DEG analysis using bowtie2 to map reads to both the Arabidopsis transcriptome (1 representative isoform per locus) and Arabidopsis genome.

For transcriptome alignments, I used bowtie2 and allowed for local alignments. For the genome mapping, I used tophat (which only allows bowtie2 to run in end-to-end mode, i.e. no local alignments). I used DESeq2 for my DEG analyses.

My original question was to figure out whether it's "better" to just run a DE analysis on a well characterized organism using its transcriptome or if it's "better" to use its genome. What I've come to discover is that at least for arabidopsis, if you have the time, it's a good idea to do both because in my case I was able to recover an additional 10-15% DEGs when considering the unique DEGs found in each of the mapping scenarios. For example, in my re-analyis using bowtie 2 instead of bowtie 1, I uncovered a total of 1667 DEGs, 1348 (~81%) of which were in common between the transcriptome and genome mappings

As I had previously conducted DEG analyses with bowtie 1 alignments, I decided to look at the differences in the DEGs found between the bowtie1 vs bowtie 2 mapping against the genome (using tophat) and bowtie1 vs bowtie 2 against the transcriptome.

I was very happy to find little differences in the genome mappings. 97.5% of the differentially expressed genes were shared across the two alignment versions, which is fantastic. It's important to note here that the end-to-end alignment mode was used by default as the local alignment option in bowtie2 isn't supported when mapping to the genome in tophat.



I next checked for the overlap between bowtie1 and bowtie 2 transcriptome alignments. Here, there was less concordance: only 77.3% of the differentially expressed genes were shared across the two alignment versions.



I imagine this is largely attributable to me having used the local alignment option during the mapping. Notably, a greater percentage of my raw reads were mapped as a result of invoking the local alignment option, roughly 65-70% (bowtie2) versus 55-57% (bowtie1).

So the next question is, which of the two transcriptome alignments is more "trustworthy"? For now, I can't really say. If anyone has insight into whether or not it's worth using the local alignment option of bowtie2, I'd love to hear it.

In summary, I found 1764 DEGs using bowtie 1 and 1667 using bowtie 2, in a combined analysis of transcriptome + genome mappings. There was great agreement (97.5% shared DEGs) in genome mappings between bowtie 1 and bowtie 2, presumably due to end-to-end mode being used during alignment. I observed considerably less concordance in the transcriptome alignments (77.3% shared DEGs), probably due to me invoking the local alignment option in bowtie 2. However, more of my raw reads (~10%) were able to be mapped to the transcriptome using the local alignment option. As for my original question of whether it's better to use the transcriptome or genome for mapping -- I think if you have access to both, and the resources -- use them both. I was able to recover an additional 10-15% DEGs when considering the unique DEGs found in each of the mapping scenarios. Going forward from here, I plan on using the DEGs from the bowtie2 mappings rather than the bowtie 1 mappings for all other downstream analyses.

I'd love to hear your feedback and I hope that this short comparison was somehow useful for someone.

Cheers,
Dale
dalesan is offline   Reply With Quote
Old 04-19-2014, 10:54 AM   #20
geneart
Member
 
Location: DC area

Join Date: Sep 2011
Posts: 42
Default Mapping uniquely

Hi I have a very basic question about read mapping. For differential expression analysis of NGS data , many papers I have read , mention that they discard non -unique mapping reads. However I could not find a good summarized explanation for doing so. From what I gather and understand, the more unique the read is the better certainty it is to call its location as the technique itself could introduce some mismatches and bring about non specific mapping and also the unique location depth would still account for the naturally exisiting SNPs if any.
Have I understood this right or is there a better explanation of why we take only unique mapping reads to perform differential expression?
THanks in advance
geneart is offline   Reply With Quote
Reply

Tags
differential expression, genome, mapping, transcriptome

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 03:33 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO