Seqanswers Leaderboard Ad

**dpryan** · 10-16-2013, 09:35 AM

Have you used htseq-count yet to get counts?

**sindrle** · 10-16-2013, 11:14 AM

Hi!
Do I have to? Im stuck here:

http://www.bioconductor.org/packages/2.13/bioc/vignettes/GenomicRanges/inst/doc/summarizeOverlaps.pdf

Im trying the "2 A First Example", but I dont understand what "features <- GRanges" is (seqnames/IRanges).

Im also not quite sure how to type in "group_id": I have 2 different groups at two timepoints (as you already know).

Thank you very much!

**dpryan** · 10-16-2013, 02:15 PM

Well, you don't have to do anything you don't want to do :P However, it's generally a bit easier to just use a quick command line script than to try to deal with GenomicRanges. The later can do it, but has a much steeper learning curve and will just produce the same results (unless you're trying to do something unusual).

Regarding your GRanges question, seqnames would be things like "chr1" (UCSC chromosome naming scheme) or "10" (Ensembl naming scheme). IRanges are genomic bounds, so just the 5' and 3' boundaries of an exon/CDS/whatever. In general, you can get that stuff all made for you by reading in a GTF or GFF file (I've used import.gff from rtracklayer). The "group_id" in the example is more of a place-holder for gene_id later on. In essence, it's probably meant to be the actual feature level that's counted over (though the don't do that in the example), so you don't actually need it that example. In a real case, you'd read in your annotation file into a GRanges object and then

Code:

 split(GRangesObject, mcols(GRangesObject)$label_of_interest)

to get a GRangesList that you would then summarizeOverlaps on. Here, label_of_interest would probably be the gene_id field of the annotation file (though it could be something else, such as sprintf("%s:E%03d", mcols(blah)$gene_id, mcols(blah)$exon), which is what you'd do for DEXSeq counts).

**sindrle** · 10-17-2013, 05:26 AM

Thank you so much!

I installed HTseq and have started HTseq-count on all 48 BAMs.

From one BAM, is this a problem?:

no_feature 7984183
ambiguous 521840
too_low_aQual 0
not_aligned 0
alignment_not_unique 13730878

And also, can you compress the HTseq-count output? Its 16gb for each file..

**dpryan** · 10-17-2013, 05:59 AM

Ah, you're saving the wrong file, don't use -o. The command to execute is something like:

Code:

samtools view namesorted.bam | htseq-count -m intersection-nonempty -s no -a 10 - GTF_FILE > namesorted.counts

That file will be much much much smaller and is all that you need.

**dpryan** · 10-17-2013, 06:01 AM

I should add, "namesorted.bam" and "GTF_FILE" above are just place-holder names. You'll want to change "namesorted.counts" to something meaningful so you can keep track of which counts are associated with which sample. Finally, those options for htseq-count are just examples, you can change them to fit your needs.

**sindrle** · 10-17-2013, 06:24 AM

Great!
I aborted the run and updated the CMD as you wrote it.

I was using the "union" option, but changed it to "intersection-nonempty". But I dont actually know why.. :P

Thanks again!

**frymor** · 10-23-2013, 03:14 AM

Originally posted by sindrle View Post

Great!
I aborted the run and updated the CMD as you wrote it.

I was using the "union" option, but changed it to "intersection-nonempty". But I dont actually know why.. :P

Thanks again!

take a look here to see the difference between the union and the intersection-nonempty.

It basically tells htseq-count how to deal with reads matched to two genes, or what to do if the overlap isn't complete.

**sindrle** · 10-23-2013, 03:51 AM

Thanks I have looked at that one, but what is the implications for differential expression?

**frymor** · 10-23-2013, 04:20 AM

As far as I understand it, depends on the option you take you'll be counting more or less reads.

If I am not mistaken, reads marked as ambiguous are not taken into consideration in the count of read numbers. So if you take union you'll have more ambiguous reads than the intersection options.

BTW, take a look at featureCounts from the sub read package. It works also very good with the data. It finds more results than htseq-counts and works faster.

Topics	Statistics	Last Post
Advanced Epigenome Editing Platform Explores Gene Regulation Mechanisms by seqadmin Started by seqadmin, Yesterday, 02:46 PM	0 responses 11 views 0 likes	Last Post by seqadmin Yesterday, 02:46 PM
Telomere Maintenance by PARP1: A New Perspective in Cancer Research by seqadmin Started by seqadmin, 05-07-2024, 06:57 AM	0 responses 13 views 0 likes	Last Post by seqadmin 05-07-2024, 06:57 AM
Enhanced Neoantigen Detection: Introducing NeoHunter by seqadmin Started by seqadmin, 05-06-2024, 07:17 AM	0 responses 17 views 0 likes	Last Post by seqadmin 05-06-2024, 07:17 AM
A Close Examination at Probiotic-Related Bacteremia by seqadmin Started by seqadmin, 05-02-2024, 08:06 AM	0 responses 23 views 0 likes	Last Post by seqadmin 05-02-2024, 08:06 AM

Seqanswers Leaderboard Ad

Announcement

DESeq2 SummarizedExperiment help

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News