![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
htseq-count | paolo.kunder | Bioinformatics | 10 | 10-22-2014 05:45 AM |
Output of HTseq-count? | kasutubh | Bioinformatics | 8 | 08-21-2012 11:39 PM |
multiBamCov or htseq-count to count read per feature ? | NicoBxl | Bioinformatics | 1 | 07-03-2012 03:05 AM |
analysis of single and PE reads with htseq-count and DESeq | dglemay | Bioinformatics | 1 | 04-27-2012 12:23 AM |
DEXSeq vs htseq-count/DESeq counting model | jdsv | Bioinformatics | 2 | 11-20-2011 08:48 PM |
![]() |
|
Thread Tools |
![]() |
#1 |
Member
Location: Sherbrooke / Canada Join Date: Jun 2012
Posts: 72
|
![]()
Hi all,
I have a little question concerning HTSeq and DESeq. I am dealing with some bacterial RNAseq data (control + test, 2 biological replicates each). Data were obtained from Illumina, single-end 50bp. I was able to map these reads to my reference genome and generate SAM and BAM files for the alignment. Now, I would like to get some statistics done over it in order to get a list of genes that are statistically up or down regulated in between my control and test condition. As I saw here, I should be able to do that with DESeq package. So, in order to use DESeq I need a well-organized table with genes count. I thought about using HTSeq for getting such a input table. So, I installed HTSeq and I run my first SAM file with following command line: htseq-count -m intersection-nonempty -s no -t gene -i locus_tag alignWT1.sam NC_013316.gff -o WT1_out After the run was complete, I had printed in the command line genes locus_tag with genes count and also no feature: 9034352 ambiguous: 958299 not aligned: 2097930 That's okay (at least I think) since I had more then 37 millions reads for this replicate. Now, the problem is what to do next! I looked at the header of my WT1_out file, and there is way more information then it should be if I want to use this table for DESeq. So, here are my questions:
Any help would be more then welcome. Thanks in advance guys! |
![]() |
![]() |
![]() |
#2 |
Devon Ryan
Location: Freiburg, Germany Join Date: Jul 2011
Posts: 3,480
|
![]()
You don't need to do any sorting of the output files prior to using them in DESeq (the output files should actually be in the same order). I usually merge them into a single file and alter the header line to make things easier. I wrote a small program in C to do that for me (PM me if you want a copy, though it'd be very simple to write yourself in any language), but you could even just use Excel or Openoffice if you prefer (in reality, you could also just do it in R by merging the various data frames).
|
![]() |
![]() |
![]() |
#3 |
Member
Location: Sherbrooke / Canada Join Date: Jun 2012
Posts: 72
|
![]()
Thank you for your replay!
As I said, I'm a mere beginer in all aspects of bioinformatics, so writing scripts in C, R, Python and so on is out of reach for me (hopefully one day...). Also, the generated <out> files are text files of 15-30 Gb, so trying to manipulate them with Excell or any other conventional software is a suicide. I could use your C program (you run that under windows right?) if it's not too complicated or any other script for which I could get instructions how to use it. By the way, here is the header of my <out> file; does it sounds okay? ognjen@ThePresident-Studio-XPS-7100:~/Documents/RNAseq$ head WT2_out HWI-ST766:125:C0PRMACXX:6:1101:1955:2481 0 gi|260211391|emb|FN545816.1| 155731 255 50M * 0 0 ACCANCAGAAATAGTTGAAAAAGCTAAAAAAATTATTGCCGGAGAATTGG @@@D#2ABFDHHBHABHGGIIIGGFHEGHIHH@DGIIHIHIFAFHCC@GG XA:i:1 MD:Z:4G45 NM:i:1 XF:Z:CDR20291_0115 HWI-ST766:125:C0PRMACXX:6:1101:1915:2481 0 gi|260211391|emb|FN545816.1| 439745 255 50M * 0 0 GTAANGGTGAGAGAGTAACAGTAAGTCTTTTCCATACAGCTATATATGGA @@@F#2A=CFHHHJJCGHIJJCHIHEHGIIJIJIIIICHIGIIIIGIJJJ XA:i:1 MD:Z:4A45 NM:i:1 XF:Z:CDR20291_0366 HWI-ST766:125:C0PRMACXX:6:1101:1939:2483 16 gi|260211391|emb|FN545816.1| 251919 255 50M * 0 0 ACCTAGTTAAGTAAATGTGTAGGCTGAGTTATACTCCCTATGTGTTGGCT 0GD?HHCGBDGFGIIGGFIHEEE@HC<FHF<A2GBGHDDDHHDDDDD@@@ XA:i:0 MD:Z:50 NM:i:0 XF:Z:no_feature HWI-ST766:125:C0PRMACXX:6:1101:2708:2478 16 gi|260211391|emb|FN545816.1| 233067 255 50M * 0 0 TCAGAAGGTTCTTATGCTATATACAATGAAAATGGTTTAGATGGTNAAGC IIJJJJJIJJIJJIJJJJJJJIIJJJJJJJJJIJJJJHHHHHDA2#FCCC XA:i:1 MD:Z:45C4 NM:i:1 XF:Z:CDR20291_0180 HWI-ST766:125:C0PRMACXX:6:1101:2238:2479 0 gi|260211391|emb|FN545816.1| 94704 255 50M * 0 0 CTAGNAAAAGTTGGTAGATATAAATTCAATAAGAAACTTGCTTTATGTTA CCCF#4ADHHHHHJCGHIJJJJJJJJJJJJJJJJJJJJJJJJJJJIJHII XA:i:1 MD:Z:4C45 NM:i:1 XF:Z:CDR20291_0060 HWI-ST766:125:C0PRMACXX:6:1101:1837:2494 16 gi|260211391|emb|FN545816.1| 720993 255 50M * 0 0 ATATCTACTTCACTAGATTTTTCCTATGAACCTCTCTATGGAATAGATAG ##F<?:1?1*4D:*?3CGHFHBGH>FEBDA?<EAFBHFFHFD4ADDD@@@ XA:i:0 MD:Z:50 NM:i:0 XF:Z:CDR20291_0584 HWI-ST766:125:C0PRMACXX:6:1101:1993:2495 16 gi|260211391|emb|FN545816.1| 444348 255 50M * 0 0 GGTGTTGCTGCCTTAGCTCTAGGAATAGCTCAAGGTGCTTTAGATGAAGC JIJIJJJIGIJJJJJIJIJJJJIIJJJJJJJJJJJJJHHHHHFFFFFCCC XA:i:0 MD:Z:50 NM:i:0 XF:Z:CDR20291_0370 HWI-ST766:125:C0PRMACXX:6:1101:2798:2478 1:N:0:CGATGT 4 * 0 0 * * 0 0 CGGCNTGCTTTGAACACTCTAATTTTTTCAAAGTAAAAGTCCTGGTTTGC CCCF#2ADHHHHHJJJJJJJJJJJJJJJJJJJJGHIJJJGIJJJJGHIII XM:i:0 XF:Z:not_aligned HWI-ST766:125:C0PRMACXX:6:1101:2596:2496 16 gi|260211391|emb|FN545816.1| 1138777 255 50M * 0 0 AGTCACCCACCTGGGAGTGCCTAGGGTGTGAATTTACTTAGCCACCGGTG JIGGFJHGBJIGJJIGIEDJJJIJHGIJJJJGIIHFJHHHHFDDFDFC@@ XA:i:0 MD:Z:50 NM:i:0 XF:Z:no_feature HWI-ST766:125:C0PRMACXX:6:1101:3697:2475 16 gi|260211391|emb|FN545816.1| 623244 255 50M * 0 0 ATGTTATAAAAAGAAAATATGAACAATTATGTGAAAATTTAGATANTAAT IIIJHIGJJJJJGJJJIHIIIIIJJJIIIIIGJJJJJHHHHHDA2#FCC@ XA:i:1 MD:Z:45G4 NM:i:1 XF:Z:CDR20291_0513 what do you think? |
![]() |
![]() |
![]() |
#4 |
Devon Ryan
Location: Freiburg, Germany Join Date: Jul 2011
Posts: 3,480
|
![]()
Ah, I misunderstood where your confusion was. That file that you were looking at is of no use to you in this case. Those are simply the reads that were kept by htseq-count. You need to redirect the output of htseq-count to a file, which will be much smaller than the one you're looking at and will be easier for you to deal with. So:
Code:
htseq-count -m intersection-nonempty -s no -t gene -i locus_tag alignWT1.sam NC_013316.gff > WT1_out |
![]() |
![]() |
![]() |
#5 |
Member
Location: Sherbrooke / Canada Join Date: Jun 2012
Posts: 72
|
![]()
Great!!! I'm going to try it right away!
![]() |
![]() |
![]() |
![]() |
#6 |
Member
Location: Sherbrooke / Canada Join Date: Jun 2012
Posts: 72
|
![]()
Oh my God, it does work! I love you Ryan!
![]() I'm going to try DESeq soon, hopefully, there won't be any problems! Thanks again!!!! |
![]() |
![]() |
![]() |
#7 |
Member
Location: EU Join Date: Apr 2012
Posts: 14
|
![]() |
![]() |
![]() |
![]() |
#8 |
Devon Ryan
Location: Freiburg, Germany Join Date: Jul 2011
Posts: 3,480
|
![]() |
![]() |
![]() |
![]() |
#9 |
Member
Location: EU Join Date: Apr 2012
Posts: 14
|
![]()
I am not sure how to keep all different featuretypes using -t
second question is about ID, as I'd to keep more then one ID, eg. gene_id and/or transcript_id should I run HTseq twice first time having sam output and so on... |
![]() |
![]() |
![]() |
#10 | |
Devon Ryan
Location: Freiburg, Germany Join Date: Jul 2011
Posts: 3,480
|
![]() Quote:
Normally you wouldn't want to do that. I would recommend that you describe exactly what you're trying to do rather than us trying to figure it out from your so-far limited description. |
|
![]() |
![]() |
![]() |
#11 |
Member
Location: EU Join Date: Apr 2012
Posts: 14
|
![]()
Thanks,
so this i what I get: DDX11L1 17 and what I'd to have: gene_id my_id feature counts DDX11L1 145793 exon 17 13248 ncRNA 29 LOC100132062 557120 lincRNA 15 this is because I have merged various annotation types so sometimes there is no gene_id, can this be specified in HTSeq, as now I have to run it twice and merge tables |
![]() |
![]() |
![]() |
#12 |
Devon Ryan
Location: Freiburg, Germany Join Date: Jul 2011
Posts: 3,480
|
![]()
So, in other words, your annotation file is kind of a mess and you're hoping to finagle a way around that with htseq-count. Perhaps you'll have luck using using the python interface directly, I've never needed to do that. Alternatively, you might do yourself a favour for the future and simply write a script to clean up your annotation file (if an entry lacks a gene_id, then add one, if it lacks an exon feature, then add such an entry, etc.).
|
![]() |
![]() |
![]() |
#13 |
Senior Member
Location: Baltimore Join Date: Mar 2012
Posts: 120
|
![]()
Hi guys,
I was using the Cufflinks->CuffDiff protocol, but I'm now going to try and use DESeq. I'm having the same problem as in the original post. Are you telling me that the way you have to run it to use DESeq is by using "> WTout" instead of "-o WTout"? If so, why didn't Simon put that in the HTseq website? http://www-huber.embl.de/users/ander...doc/count.html The vignette for DESeq is excellent and I just went through it, but now I want to run this with my data, and I haven't found a clear way to do so. I converted all my Tophat files to SAM and then sorted all of my outputted Tophat files by name. I then ran HT-Seq count on them using this code: python -m HTSeq.scripts.count -s no -o WT1 accepted_hits.sam annotation.gtf So the outputted files were huge, even after I filtered by removing "no feature, ambiguous, etc." by using grep. Now I want to use DESeq to analyze it, but I don't know the next steps. Do I need to rerun it using the ">" instead of "-o" option? After that, I guess I can then use Ethanol's R script to merge the files together? http://seqanswers.com/forums/showthr...ighlight=htseq Last edited by billstevens; 09-04-2012 at 08:32 AM. |
![]() |
![]() |
![]() |
#14 |
Member
Location: Sherbrooke / Canada Join Date: Jun 2012
Posts: 72
|
![]()
Hi,
Yeah, you have to use > FileOUT, and I have no clue why it isn't in the vignette. As you said, vignette is well written although some details are missing... Anyway, if you use > FileOUT you will get a small (less then 1Mb) text file with all your reads. For merging, I simply used excel and then saved as text ou tab-delimited text, don't remember any more. And then, you can use DESeq in R. Good luck |
![]() |
![]() |
![]() |
#15 |
Senior Member
Location: Baltimore Join Date: Mar 2012
Posts: 120
|
![]()
Well that could have saved me a week of time! On the other hand, what's a week when I've been working on this for almost a year?
Thanks! Last edited by billstevens; 09-04-2012 at 09:21 AM. |
![]() |
![]() |
![]() |
#16 |
Member
Location: Sherbrooke / Canada Join Date: Jun 2012
Posts: 72
|
![]()
Yeah, bioinformatics can be quite frustrating... Although, the feeling is great when you get it done and you see all those beautiful tables with genes, aligned with p values and so on
![]() |
![]() |
![]() |
![]() |
#17 |
Senior Member
Location: Baltimore Join Date: Mar 2012
Posts: 120
|
![]()
Hey guys, is there an extra step needed for DESeq to adjust for library size?
What I mean is, I have three replicates, and two of them have half as many reads as the first. So my file after pasting it into Excel, looks like this Untreated Treated Gene 1 6 3 3 8 4 4 Gene 2 10 5 5 12 6 6 ..... I then use this table to use for DESeq. Does DESeq adjust for each sequencing depth? |
![]() |
![]() |
![]() |
#18 |
Devon Ryan
Location: Freiburg, Germany Join Date: Jul 2011
Posts: 3,480
|
![]() |
![]() |
![]() |
![]() |
#19 |
Senior Member
Location: Baltimore Join Date: Mar 2012
Posts: 120
|
![]()
yeah sorry, I was just about to write that.
|
![]() |
![]() |
![]() |
Tags |
deseq, htseq |
Thread Tools | |
|
|