![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
HTSeq-count on bam | dvanic | Bioinformatics | 18 | 11-12-2014 06:52 PM |
htseq-count | paolo.kunder | Bioinformatics | 10 | 10-22-2014 05:45 AM |
htseq-count performance | dglemay | Bioinformatics | 8 | 10-23-2012 08:08 PM |
multiBamCov or htseq-count to count read per feature ? | NicoBxl | Bioinformatics | 1 | 07-03-2012 03:05 AM |
htseq-count error | sissi | Bioinformatics | 0 | 03-21-2012 12:40 AM |
![]() |
|
Thread Tools |
![]() |
#1 | |
Member
Location: Universe Join Date: Dec 2012
Posts: 81
|
![]()
1. transcripts1.gtf , transcripts2.gtf , ... , transcripts20.gtf were produced by cufflinks.
2. I combined all of the gtf file to make my own annotation file, new.combined.gtf cuffcompare -o new original.gtf transcripts1.gtf transcripts2.gtf ... transcripts20.gtf 3. I want to use HTSeq-count, by using new.combined.gtf. However, in new.combined.gtf, among gene_id and oID and transcript_id and tss_id, which one do I have to choose? My thought is to use oID or tssID. But some reads that have different oId share the same tssID. In other words, some reads that have been previously annotated and other reads that have not been annotated have different oID but the same tssID. Q: My goal is to see differentially expressed genes across the five different time points. To achieve this goal, when using HTSeq-count, which id is reasonable to choose as option? Thank you in advance! The below is the new.combined.gtf. Quote:
Last edited by syintel87; 02-03-2013 at 03:19 PM. |
|
![]() |
![]() |
![]() |
#2 | ||
David Eccles (gringer)
Location: Wellington, New Zealand Join Date: May 2011
Posts: 838
|
![]() Quote:
It looks like 'oID' is the name of the original transcript (or cufflink's name if it's found a novel transcript): http://cufflinks.cbcb.umd.edu/manual...uffcomp_output There also should be a .tmap file that links transcripts to genes. The first column of this file is "The gene_name attribute of the reference GTF record for this transcript, if present. Otherwise gene_id is used." If you're doing gene-level comparisons, that's probably what you want to use. For transcript-level comparisons, the second column names are probably appropriate. Quote:
Cufflinks has its own recommended differential expression test method, cuffdiff, which identifies differences at both a gene level and an isoform level: http://cufflinks.cbcb.umd.edu/manual.html#gene_exp_diff |
||
![]() |
![]() |
![]() |
#3 |
Devon Ryan
Location: Freiburg, Germany Join Date: Jul 2011
Posts: 3,480
|
![]()
You're best off using gene_id. tss_id will definitely work poorly (you could maybe use that with DEXSeq) and oId will also work poorly (though not nearly as bad as tss_id).
gringer: I've done what syintel87 is trying to do, so I'll give what my reason was. In my case, I had a lot of reads mapping to UTR regions (and the occasional exon) that weren't in the annotation that I was using. cufflinks can pick a lot of these up and spit out a merged GTF file (with exons added or extended, as appropriate) that better represented the apparent structure of the genes. In my case, that really didn't end up making much of a difference in the expression analysis (it just shuffled around genes on the margin of significance), but perhaps for others it proves more useful. |
![]() |
![]() |
![]() |
#4 |
Member
Location: Boston Join Date: Feb 2012
Posts: 28
|
![]()
I have a question, and I would really appreciate it if I was to get a prompt reply!
I am performing an RNASeq analysis and I have paired-end 101 bp reads. I want to generate count data and perform a DE analysis using DESeq where I can identify genes that are differentially expressed between 2 conditions (for which we have replicates for). I have run ht-seq count using an alignment file (sam file sorted by read name of course) and the latest version of Ensembl's annotation file as input to obtain this count data. I ran this using two different identifiers (-i option). 1) gene_id (default, and 'preferred' for RNASeq) 2) gene_name My preference was that I wouldn't want to go through the headache of converting these gene-id's into gene symbols which are easier for everyone to look at and identify. But what I realized was that these 2 runs (keeping the mode and everything else constant) yielded different number of rows (entries) in the resulting count dataset. Why is this? Isn't it a 1 to 1 (id to gene name) mapping/conversion? I'm not sure what the difference in it's working is between the two runs. Am I losing anything by using gene_name instead of gene_id? Again, any help would be greatly appreciated. Thanks |
![]() |
![]() |
![]() |
#5 |
Senior Member
Location: Heidelberg, Germany Join Date: Feb 2010
Posts: 994
|
![]()
Doesn't seem so, does it?
I quickly checked with Biomart, and the first thing I noticed is that snoRNAs that have multiple copies appear with different ENSG gene IDs but the same gene symbol. I'm sure there are many such things, probably especially (but maybe only) regarding non-coding genes. Sometimes it seems to me that the biggest everyday task in bioinformatics is struggling with ID conversions. |
![]() |
![]() |
![]() |
#6 | |
Member
Location: Universe Join Date: Dec 2012
Posts: 81
|
![]() Quote:
So, you mean tss_id and oId work poorly, so that gene_id has to be used? 2. My concern is the probability that even in the same gene, some transcripts could be up-regulated and other transcripts could be down-regulated. This is why I want to analyze data based on the transcript level. Does my thought make sense biologically? 3. My data is RNA-seq. And my goal is to see the differentially expressed genes by using edgeR. I am concerened about some reads that are mapped to reference genome but are not annotated, so that I want to use cuffcompare to combine original.gtf file and transcripts.gtf files. Though considering my situation, since tss_id is working poorly, the way to go is to use gene_id, ignoring transcript levels? |
|
![]() |
![]() |
![]() |
#7 | |||
Devon Ryan
Location: Freiburg, Germany Join Date: Jul 2011
Posts: 3,480
|
![]() Quote:
Quote:
Quote:
|
|||
![]() |
![]() |
![]() |
#8 | |
Member
Location: Universe Join Date: Dec 2012
Posts: 81
|
![]() Quote:
So, you mean to see differential expression at gene level, gene_id has to be used in htseq count, and edgeR analysis will be fine. Also, to see differential expression at transcript level, tss_id(???) has to be used in htseq count, and DEXSeq or something will be fine. Do I understand correctly? Last edited by syintel87; 02-06-2013 at 02:29 AM. |
|
![]() |
![]() |
![]() |
#9 | ||
Devon Ryan
Location: Freiburg, Germany Join Date: Jul 2011
Posts: 3,480
|
![]() Quote:
Quote:
|
||
![]() |
![]() |
![]() |
#10 | |
Member
Location: Universe Join Date: Dec 2012
Posts: 81
|
![]() Quote:
Now I am thinking of two ways. 1. 1) original.gtf (original annotation ) + transcripts.gtf ( output from cufflinks ) => cuffcompare => combined.gtf 2) HTSeq-count by using combined.gtf and its "gene_id". 2. 1) original.gtf (original annotation ) + transcripts.gtf ( output from cufflinks ) => cuffcompare => combined.gtf 2) python dexseq_prepare_annotation.py combined.gtf combined_DEXSeq.gtf 3) python dexseq_count.py -p no -s no -a 10 combined_DEXSeq.gtf accepted_hits.sam treatment1_counts.txt 4) read.HTSeqCounts( treatment1_counts.txt, design, flattenedfile = "combined_DEXSeq.gtf" ) Would you please give me a tip about whetehr these two ways seem to make sense? Thank you very much!!! |
|
![]() |
![]() |
![]() |
#11 |
David Eccles (gringer)
Location: Wellington, New Zealand Join Date: May 2011
Posts: 838
|
![]()
HTSeq-count on its own is not going to be all that useful due to normalisation issues. It needs to be paired up with another program that can filter out the noise and make the true positive results more obvious. At least use cuffdiff if you want some results directly from cufflinks' results files.
|
![]() |
![]() |
![]() |
#12 | |
Devon Ryan
Location: Freiburg, Germany Join Date: Jul 2011
Posts: 3,480
|
![]() Quote:
![]() |
|
![]() |
![]() |
![]() |
Thread Tools | |
|
|