SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
HTSeq-count on bam dvanic Bioinformatics 18 11-12-2014 05:52 PM
htseq-count paolo.kunder Bioinformatics 10 10-22-2014 04:45 AM
htseq-count performance dglemay Bioinformatics 8 10-23-2012 07:08 PM
multiBamCov or htseq-count to count read per feature ? NicoBxl Bioinformatics 1 07-03-2012 02:05 AM
htseq-count error sissi Bioinformatics 0 03-20-2012 11:40 PM

Reply
 
Thread Tools
Old 02-03-2013, 11:58 AM   #1
syintel87
Member
 
Location: Universe

Join Date: Dec 2012
Posts: 81
Default Which ID should be used for HTSeq-count?

1. transcripts1.gtf , transcripts2.gtf , ... , transcripts20.gtf were produced by cufflinks.

2. I combined all of the gtf file to make my own annotation file, new.combined.gtf
cuffcompare -o new original.gtf transcripts1.gtf transcripts2.gtf ... transcripts20.gtf

3. I want to use HTSeq-count, by using new.combined.gtf.
However, in new.combined.gtf, among gene_id and oID and transcript_id and tss_id, which one do I have to choose? My thought is to use oID or tssID. But some reads that have different oId share the same tssID. In other words, some reads that have been previously annotated and other reads that have not been annotated have different oID but the same tssID.


Q: My goal is to see differentially expressed genes across the five different time points. To achieve this goal, when using HTSeq-count, which id is reasonable to choose as option?



Thank you in advance!
The below is the new.combined.gtf.

Quote:
Mt:3452:Mt3.5.1Chr1 E exon 154859 155160 . + . gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "1"; oId "Medtr1g005010.1"; tss_id "TSS1";
Mt:3452:Mt3.5.1Chr1 E exon 155201 161330 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "1"; oId "Medtr1g005020.1"; tss_id "TSS2";
Mt:3452:Mt3.5.1Chr1 E exon 161501 161880 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "2"; oId "Medtr1g005020.1"; tss_id "TSS2";
Mt:3452:Mt3.5.1Chr1 E exon 162000 162267 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "3"; oId "Medtr1g005020.1"; tss_id "TSS2";
Mt:3452:Mt3.5.1Chr1 E exon 162314 162479 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "4"; oId "Medtr1g005020.1"; tss_id "TSS2";
Mt:3452:Mt3.5.1Chr1 Cufflinks exon 161506 162151 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00004880"; exon_number "1"; oId "CUFF.2.1"; tss_id "TSS4";
Mt:3452:Mt3.5.1Chr1 Cufflinks exon 162327 162630 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00004881"; exon_number "1"; oId "CUFF.2.1"; tss_id "TSS5";
Mt:3452:Mt3.5.1Chr1 Cufflinks exon 174216 175572 . + . gene_id "XLOC_000003"; transcript_id "TCONS_00007444"; exon_number "1"; oId "CUFF.9.1"; tss_id "TSS6";
Mt:3452:Mt3.5.1Chr1 Cufflinks exon 188340 188516 . + . gene_id "XLOC_000003"; transcript_id "TCONS_00007444"; exon_number "2"; oId "CUFF.9.1"; tss_id "TSS6";
Mt:3452:Mt3.5.1Chr1 Cufflinks exon 175331 175619 . + . gene_id "XLOC_000003"; transcript_id "TCONS_00014784"; exon_number "1"; oId "CUFF.5.1"; tss_id "TSS8";
Mt:3452:Mt3.5.1Chr1 Cufflinks exon 176045 176603 . + . gene_id "XLOC_000004"; transcript_id "TCONS_00004882"; exon_number "1"; oId "CUFF.8.1"; tss_id "TSS9";
Mt:3452:Mt3.5.1Chr1 Cufflinks exon 176728 177280 . + . gene_id "XLOC_000004"; transcript_id "TCONS_00004882"; exon_number "2"; oId "CUFF.8.1"; tss_id "TSS9";
Mt:3452:Mt3.5.1Chr1 Cufflinks exon 177480 177481 . + . gene_id "XLOC_000004"; transcript_id "TCONS_00004882"; exon_number "3"; oId "CUFF.8.1"; tss_id "TSS9";
Mt:3452:Mt3.5.1Chr1 Cufflinks exon 176052 177755 . + . gene_id "XLOC_000004"; transcript_id "TCONS_00007445"; exon_number "1"; oId "CUFF.10.1"; tss_id "TSS9";
Mt:3452:Mt3.5.1Chr1 E exon 176184 176603 . + . gene_id "XLOC_000004"; transcript_id "TCONS_00000003"; exon_number "1"; oId "Medtr1g005100.1"; tss_id "TSS10";
Mt:3452:Mt3.5.1Chr1 E exon 176728 176892 . + . gene_id "XLOC_000004"; transcript_id "TCONS_00000003"; exon_number "2"; oId "Medtr1g005100.1"; tss_id "TSS10";
Mt:3452:Mt3.5.1Chr1 E exon 177155 177280 . + . gene_id "XLOC_000004"; transcript_id "TCONS_00000003"; exon_number "3"; oId "Medtr1g005100.1"; tss_id "TSS10";

Last edited by syintel87; 02-03-2013 at 02:19 PM.
syintel87 is offline   Reply With Quote
Old 02-03-2013, 04:43 PM   #2
gringer
David Eccles (gringer)
 
Location: Wellington, New Zealand

Join Date: May 2011
Posts: 838
Default

Quote:
Originally Posted by syintel87 View Post
I want to use HTSeq-count, by using new.combined.gtf.
However, in new.combined.gtf, among gene_id and oID and transcript_id and tss_id, which one do I have to choose? My thought is to use oID or tssID. But some reads that have different oID share the same tssID. In other words, some reads that have been previously annotated and other reads that have not been annotated have different oID but the same tssID.
TSS -> Transcript Start Site (according to convention, and cufflinks documentation). This is related to the gene_id. There will be multiple isoforms / transcripts that have the same transcript start site, and arise from different splicing events.

It looks like 'oID' is the name of the original transcript (or cufflink's name if it's found a novel transcript):

http://cufflinks.cbcb.umd.edu/manual...uffcomp_output

There also should be a .tmap file that links transcripts to genes. The first column of this file is "The gene_name attribute of the reference GTF record for this transcript, if present. Otherwise gene_id is used." If you're doing gene-level comparisons, that's probably what you want to use. For transcript-level comparisons, the second column names are probably appropriate.

Quote:
Q: My goal is to see differentially expressed genes across the five different time points. To achieve this goal, when using HTSeq-count, which id is reasonable to choose as option?
Any particular reason why you want to use HTSeq-count? What do you propose to use as your count data?

Cufflinks has its own recommended differential expression test method, cuffdiff, which identifies differences at both a gene level and an isoform level:

http://cufflinks.cbcb.umd.edu/manual.html#gene_exp_diff
gringer is offline   Reply With Quote
Old 02-04-2013, 12:56 AM   #3
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

You're best off using gene_id. tss_id will definitely work poorly (you could maybe use that with DEXSeq) and oId will also work poorly (though not nearly as bad as tss_id).

gringer: I've done what syintel87 is trying to do, so I'll give what my reason was. In my case, I had a lot of reads mapping to UTR regions (and the occasional exon) that weren't in the annotation that I was using. cufflinks can pick a lot of these up and spit out a merged GTF file (with exons added or extended, as appropriate) that better represented the apparent structure of the genes. In my case, that really didn't end up making much of a difference in the expression analysis (it just shuffled around genes on the margin of significance), but perhaps for others it proves more useful.
dpryan is offline   Reply With Quote
Old 02-04-2013, 11:47 AM   #4
vkartha
Member
 
Location: Boston

Join Date: Feb 2012
Posts: 28
Default

I have a question, and I would really appreciate it if I was to get a prompt reply!

I am performing an RNASeq analysis and I have paired-end 101 bp reads. I want to generate count data and perform a DE analysis using DESeq where I can identify genes that are differentially expressed between 2 conditions (for which we have replicates for).

I have run ht-seq count using an alignment file (sam file sorted by read name of course) and the latest version of Ensembl's annotation file as input to obtain this count data. I ran this using two different identifiers (-i option).

1) gene_id (default, and 'preferred' for RNASeq)
2) gene_name

My preference was that I wouldn't want to go through the headache of converting these gene-id's into gene symbols which are easier for everyone to look at and identify. But what I realized was that these 2 runs (keeping the mode and everything else constant) yielded different number of rows (entries) in the resulting count dataset.

Why is this? Isn't it a 1 to 1 (id to gene name) mapping/conversion? I'm not sure what the difference in it's working is between the two runs. Am I losing anything by using gene_name instead of gene_id?

Again, any help would be greatly appreciated.


Thanks
vkartha is offline   Reply With Quote
Old 02-05-2013, 12:27 AM   #5
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

Quote:
Originally Posted by vkartha View Post
Isn't it a 1 to 1 (id to gene name) mapping/conversion?
Doesn't seem so, does it?

I quickly checked with Biomart, and the first thing I noticed is that snoRNAs that have multiple copies appear with different ENSG gene IDs but the same gene symbol. I'm sure there are many such things, probably especially (but maybe only) regarding non-coding genes.

Sometimes it seems to me that the biggest everyday task in bioinformatics is struggling with ID conversions.
Simon Anders is offline   Reply With Quote
Old 02-06-2013, 12:56 AM   #6
syintel87
Member
 
Location: Universe

Join Date: Dec 2012
Posts: 81
Default

Quote:
Originally Posted by dpryan View Post
You're best off using gene_id. tss_id will definitely work poorly (you could maybe use that with DEXSeq) and oId will also work poorly (though not nearly as bad as tss_id).

gringer: I've done what syintel87 is trying to do, so I'll give what my reason was. In my case, I had a lot of reads mapping to UTR regions (and the occasional exon) that weren't in the annotation that I was using. cufflinks can pick a lot of these up and spit out a merged GTF file (with exons added or extended, as appropriate) that better represented the apparent structure of the genes. In my case, that really didn't end up making much of a difference in the expression analysis (it just shuffled around genes on the margin of significance), but perhaps for others it proves more useful.
1.
So, you mean tss_id and oId work poorly, so that gene_id has to be used?

2.
My concern is the probability that even in the same gene, some transcripts could be up-regulated and other transcripts could be down-regulated. This is why I want to analyze data based on the transcript level. Does my thought make sense biologically?

3.
My data is RNA-seq. And my goal is to see the differentially expressed genes by using edgeR. I am concerened about some reads that are mapped to reference genome but are not annotated, so that I want to use cuffcompare to combine original.gtf file and transcripts.gtf files.
Though considering my situation, since tss_id is working poorly, the way to go is to use gene_id, ignoring transcript levels?
syintel87 is offline   Reply With Quote
Old 02-06-2013, 01:03 AM   #7
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

Quote:
Originally Posted by syintel87 View Post
1.
So, you mean tss_id and oId work poorly, so that gene_id has to be used?
Yes
Quote:
2.
My concern is the probability that even in the same gene, some transcripts could be up-regulated and other transcripts could be down-regulated. This is why I want to analyze data based on the transcript level. Does my thought make sense biologically?
That's why you use something like DEXSeq in addition to something like edgeR (or DESeq, for similarity).

Quote:
3.
My data is RNA-seq. And my goal is to see the differentially expressed genes by using edgeR. I am concerened about some reads that are mapped to reference genome but are not annotated, so that I want to use cuffcompare to combine original.gtf file and transcripts.gtf files.
Though considering my situation, since tss_id is working poorly, the way to go is to use gene_id, ignoring transcript levels?
Gene and transcript level analyses needn't be the same. I wouldn't try to shoehorn an analysis into a pipeline designed for something else when more appropriate pipelines exist.
dpryan is offline   Reply With Quote
Old 02-06-2013, 01:23 AM   #8
syintel87
Member
 
Location: Universe

Join Date: Dec 2012
Posts: 81
Default

Quote:
Originally Posted by dpryan View Post
Yes

That's why you use something like DEXSeq in addition to something like edgeR (or DESeq, for similarity).



Gene and transcript level analyses needn't be the same. I wouldn't try to shoehorn an analysis into a pipeline designed for something else when more appropriate pipelines exist.
Thank you so much!!!!

So, you mean to see differential expression at gene level, gene_id has to be used in htseq count, and edgeR analysis will be fine.

Also, to see differential expression at transcript level, tss_id(???) has to be used in htseq count, and DEXSeq or something will be fine.

Do I understand correctly?

Last edited by syintel87; 02-06-2013 at 01:29 AM.
syintel87 is offline   Reply With Quote
Old 02-06-2013, 01:37 AM   #9
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

Quote:
Originally Posted by syintel87 View Post
Thank you so much!!!!

So, you mean to see differential expression at gene level, gene_id has to be used in htseq count, and edgeR analysis will be fine.
Exactly

Quote:
Also, to see differential expression at transcript level, tss_id has to be used in htseq count, and DEXSeq or something will be fine.

Do I understand correctly?
As the Germans I work with would say, jein ("yes and no"). For DEXSeq, you would use something like dexseq_count.py (read the DEXSeq vignette). DEXSeq also technically looks at differential exon usage, which is related to transcript level expression changes but not exactly the same. Browse through bioconductor (and the literature/this forum, since a lot of them are stand-alone apps) for other ways of looking at transcript level changes.
dpryan is offline   Reply With Quote
Old 02-06-2013, 04:46 AM   #10
syintel87
Member
 
Location: Universe

Join Date: Dec 2012
Posts: 81
Default

Quote:
Originally Posted by dpryan View Post
Exactly



As the Germans I work with would say, jein ("yes and no"). For DEXSeq, you would use something like dexseq_count.py (read the DEXSeq vignette). DEXSeq also technically looks at differential exon usage, which is related to transcript level expression changes but not exactly the same. Browse through bioconductor (and the literature/this forum, since a lot of them are stand-alone apps) for other ways of looking at transcript level changes.

Now I am thinking of two ways.

1.
1) original.gtf (original annotation ) + transcripts.gtf ( output from cufflinks ) => cuffcompare => combined.gtf
2) HTSeq-count by using combined.gtf and its "gene_id".

2.
1) original.gtf (original annotation ) + transcripts.gtf ( output from cufflinks ) => cuffcompare => combined.gtf
2) python dexseq_prepare_annotation.py combined.gtf combined_DEXSeq.gtf
3) python dexseq_count.py -p no -s no -a 10 combined_DEXSeq.gtf accepted_hits.sam treatment1_counts.txt
4) read.HTSeqCounts( treatment1_counts.txt, design, flattenedfile = "combined_DEXSeq.gtf" )

Would you please give me a tip about whetehr these two ways seem to make sense?

Thank you very much!!!
syintel87 is offline   Reply With Quote
Old 02-06-2013, 05:58 PM   #11
gringer
David Eccles (gringer)
 
Location: Wellington, New Zealand

Join Date: May 2011
Posts: 838
Default

Quote:
Originally Posted by syintel87 View Post
2) HTSeq-count by using combined.gtf and its "gene_id".
HTSeq-count on its own is not going to be all that useful due to normalisation issues. It needs to be paired up with another program that can filter out the noise and make the true positive results more obvious. At least use cuffdiff if you want some results directly from cufflinks' results files.
gringer is offline   Reply With Quote
Old 02-07-2013, 12:16 AM   #12
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

Quote:
Originally Posted by gringer View Post
HTSeq-count on its own is not going to be all that useful due to normalisation issues. It needs to be paired up with another program that can filter out the noise and make the true positive results more obvious. At least use cuffdiff if you want some results directly from cufflinks' results files.
From earlier in the thread, there's an implicit "use edgeR" third step
dpryan is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 07:05 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO