Seqanswers Leaderboard Ad

**dpryan** · 08-26-2012, 03:38 AM

You don't need to do any sorting of the output files prior to using them in DESeq (the output files should actually be in the same order). I usually merge them into a single file and alter the header line to make things easier. I wrote a small program in C to do that for me (PM me if you want a copy, though it'd be very simple to write yourself in any language), but you could even just use Excel or Openoffice if you prefer (in reality, you could also just do it in R by merging the various data frames).

**ThePresident** · 08-26-2012, 06:57 AM

Thank you for your replay!

As I said, I'm a mere beginer in all aspects of bioinformatics, so writing scripts in C, R, Python and so on is out of reach for me (hopefully one day...). Also, the generated <out> files are text files of 15-30 Gb, so trying to manipulate them with Excell or any other conventional software is a suicide.

I could use your C program (you run that under windows right?) if it's not too complicated or any other script for which I could get instructions how to use it.

By the way, here is the header of my <out> file; does it sounds okay?

ognjen@ThePresident-Studio-XPS-7100:~/Documents/RNAseq$ head WT2_out
HWI-ST766:125:C0PRMACXX:6:1101:1955:2481 0 gi|260211391|emb|FN545816.1| 155731 255 50M * 0 0 ACCANCAGAAATAGTTGAAAAAGCTAAAAAAATTATTGCCGGAGAATTGG @@@D#2ABFDHHBHABHGGIIIGGFHEGHIHH@DGIIHIHIFAFHCC@GG XA:i:1 MD:Z:4G45 NM:i:1 XF:Z:CDR20291_0115
HWI-ST766:125:C0PRMACXX:6:1101:1915:2481 0 gi|260211391|emb|FN545816.1| 439745 255 50M * 0 0 GTAANGGTGAGAGAGTAACAGTAAGTCTTTTCCATACAGCTATATATGGA @@@F#2A=CFHHHJJCGHIJJCHIHEHGIIJIJIIIICHIGIIIIGIJJJ XA:i:1 MD:Z:4A45 NM:i:1 XF:Z:CDR20291_0366
HWI-ST766:125:C0PRMACXX:6:1101:1939:2483 16 gi|260211391|emb|FN545816.1| 251919 255 50M * 0 0 ACCTAGTTAAGTAAATGTGTAGGCTGAGTTATACTCCCTATGTGTTGGCT 0GD?HHCGBDGFGIIGGFIHEEE@HC<FHF<A2GBGHDDDHHDDDDD@@@ XA:i:0 MD:Z:50 NM:i:0 XF:Z:no_feature
HWI-ST766:125:C0PRMACXX:6:1101:2708:2478 16 gi|260211391|emb|FN545816.1| 233067 255 50M * 0 0 TCAGAAGGTTCTTATGCTATATACAATGAAAATGGTTTAGATGGTNAAGC IIJJJJJIJJIJJIJJJJJJJIIJJJJJJJJJIJJJJHHHHHDA2#FCCC XA:i:1 MD:Z:45C4 NM:i:1 XF:Z:CDR20291_0180
HWI-ST766:125:C0PRMACXX:6:1101:2238:2479 0 gi|260211391|emb|FN545816.1| 94704 255 50M * 0 0 CTAGNAAAAGTTGGTAGATATAAATTCAATAAGAAACTTGCTTTATGTTA CCCF#4ADHHHHHJCGHIJJJJJJJJJJJJJJJJJJJJJJJJJJJIJHII XA:i:1 MD:Z:4C45 NM:i:1 XF:Z:CDR20291_0060
HWI-ST766:125:C0PRMACXX:6:1101:1837:2494 16 gi|260211391|emb|FN545816.1| 720993 255 50M * 0 0 ATATCTACTTCACTAGATTTTTCCTATGAACCTCTCTATGGAATAGATAG ##F<?:1?1*4D:*?3CGHFHBGH>FEBDA?<EAFBHFFHFD4ADDD@@@ XA:i:0 MD:Z:50 NM:i:0 XF:Z:CDR20291_0584
HWI-ST766:125:C0PRMACXX:6:1101:1993:2495 16 gi|260211391|emb|FN545816.1| 444348 255 50M * 0 0 GGTGTTGCTGCCTTAGCTCTAGGAATAGCTCAAGGTGCTTTAGATGAAGC JIJIJJJIGIJJJJJIJIJJJJIIJJJJJJJJJJJJJHHHHHFFFFFCCC XA:i:0 MD:Z:50 NM:i:0 XF:Z:CDR20291_0370
HWI-ST766:125:C0PRMACXX:6:1101:2798:2478 1:N:0:CGATGT 4 * 0 0 * * 0 0 CGGCNTGCTTTGAACACTCTAATTTTTTCAAAGTAAAAGTCCTGGTTTGC CCCF#2ADHHHHHJJJJJJJJJJJJJJJJJJJJGHIJJJGIJJJJGHIII XM:i:0 XF:Z:not_aligned
HWI-ST766:125:C0PRMACXX:6:1101:2596:2496 16 gi|260211391|emb|FN545816.1| 1138777 255 50M * 0 0 AGTCACCCACCTGGGAGTGCCTAGGGTGTGAATTTACTTAGCCACCGGTG JIGGFJHGBJIGJJIGIEDJJJIJHGIJJJJGIIHFJHHHHFDDFDFC@@ XA:i:0 MD:Z:50 NM:i:0 XF:Z:no_feature
HWI-ST766:125:C0PRMACXX:6:1101:3697:2475 16 gi|260211391|emb|FN545816.1| 623244 255 50M * 0 0 ATGTTATAAAAAGAAAATATGAACAATTATGTGAAAATTTAGATANTAAT IIIJHIGJJJJJGJJJIHIIIIIJJJIIIIIGJJJJJHHHHHDA2#FCC@ XA:i:1 MD:Z:45G4 NM:i:1 XF:Z:CDR20291_0513

what do you think?

**dpryan** · 08-26-2012, 07:06 AM

Ah, I misunderstood where your confusion was. That file that you were looking at is of no use to you in this case. Those are simply the reads that were kept by htseq-count. You need to redirect the output of htseq-count to a file, which will be much smaller than the one you're looking at and will be easier for you to deal with. So:

Code:

htseq-count -m intersection-nonempty -s no -t gene -i locus_tag alignWT1.sam NC_013316.gff > WT1_out

**ThePresident** · 08-26-2012, 07:09 AM

Great!!! I'm going to try it right away!

**ThePresident** · 08-26-2012, 08:08 AM

Oh my God, it does work! I love you Ryan!

I'm going to try DESeq soon, hopefully, there won't be any problems!

Thanks again!!!!

**ekimmike** · 08-27-2012, 06:45 AM

Originally posted by dpryan View Post

Code:

htseq-count -m intersection-nonempty -s no -t gene -i locus_tag alignWT1.sam NC_013316.gff > WT1_out

Could you explain your code, would be also quite useful for me?

thanks in advance

**dpryan** · 08-27-2012, 08:54 AM

Originally posted by ekimmike View Post

Could you explain your code, would be also quite useful for me?

thanks in advance

Which code in particular would you like to know about? Merging outputs from htseq-count?

**ekimmike** · 08-29-2012, 04:06 AM

I am not sure how to keep all different featuretypes using -t

second question is about ID, as I'd to keep more then one ID, eg. gene_id and/or transcript_id
should I run HTseq twice first time having sam output and so on...

**dpryan** · 08-29-2012, 04:16 AM

Originally posted by ekimmike View Post

I am not sure how to keep all different featuretypes using -t

second question is about ID, as I'd to keep more then one ID, eg. gene_id and/or transcript_id
should I run HTseq twice first time having sam output and so on...

Ah, I assumed you were asking about code, rather than how to use htseq-count, which I have nothing to do with aside from being one of very many users.

Normally you wouldn't want to do that. I would recommend that you describe exactly what you're trying to do rather than us trying to figure it out from your so-far limited description.

**ekimmike** · 08-29-2012, 04:36 AM

Thanks,

so this i what I get:
DDX11L1 17

and what I'd to have:
gene_id my_id feature counts
DDX11L1 145793 exon 17
13248 ncRNA 29
LOC100132062 557120 lincRNA 15

this is because I have merged various annotation types so sometimes there is no gene_id, can this be specified in HTSeq, as now I have to run it twice and merge tables

**dpryan** · 08-29-2012, 04:46 AM

So, in other words, your annotation file is kind of a mess and you're hoping to finagle a way around that with htseq-count. Perhaps you'll have luck using using the python interface directly, I've never needed to do that. Alternatively, you might do yourself a favour for the future and simply write a script to clean up your annotation file (if an entry lacks a gene_id, then add one, if it lacks an exon feature, then add such an entry, etc.).

**billstevens** · 09-04-2012, 07:18 AM

Hi guys,

I was using the Cufflinks->CuffDiff protocol, but I'm now going to try and use DESeq. I'm having the same problem as in the original post.

Are you telling me that the way you have to run it to use DESeq is by using "> WTout" instead of "-o WTout"? If so, why didn't Simon put that in the HTseq website?

http://www-huber.embl.de/users/anders/HTSeq/doc/count.html

The vignette for DESeq is excellent and I just went through it, but now I want to run this with my data, and I haven't found a clear way to do so.

I converted all my Tophat files to SAM and then sorted all of my outputted Tophat files by name. I then ran HT-Seq count on them using this code:
python -m HTSeq.scripts.count -s no -o WT1 accepted_hits.sam annotation.gtf

So the outputted files were huge, even after I filtered by removing "no feature, ambiguous, etc." by using grep.

Now I want to use DESeq to analyze it, but I don't know the next steps. Do I need to rerun it using the ">" instead of "-o" option? After that, I guess I can then use Ethanol's R script to merge the files together?

DE Seq - SEQanswers

http://seqanswers.com/forums/showthread.php?t=22936&highlight=htseq

Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc

**ThePresident** · 09-04-2012, 07:37 AM

Hi,

Yeah, you have to use > FileOUT, and I have no clue why it isn't in the vignette. As you said, vignette is well written although some details are missing...

Anyway, if you use > FileOUT you will get a small (less then 1Mb) text file with all your reads. For merging, I simply used excel and then saved as text ou tab-delimited text, don't remember any more. And then, you can use DESeq in R. Good luck

**billstevens** · 09-04-2012, 08:16 AM

Well that could have saved me a week of time! On the other hand, what's a week when I've been working on this for almost a year?

Thanks!

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 17 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 22 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 16 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 46 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

HTseq count to DESeq (?)

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News