Unconfigured Ad

**dpryan** · 10-16-2013, 09:35 AM

Have you used htseq-count yet to get counts?

**sindrle** · 10-16-2013, 11:14 AM

Hi!
Do I have to? Im stuck here:

http://www.bioconductor.org/packages/2.13/bioc/vignettes/GenomicRanges/inst/doc/summarizeOverlaps.pdf

Im trying the "2 A First Example", but I dont understand what "features <- GRanges" is (seqnames/IRanges).

Im also not quite sure how to type in "group_id": I have 2 different groups at two timepoints (as you already know).

Thank you very much!

**dpryan** · 10-16-2013, 02:15 PM

Well, you don't have to do anything you don't want to do :P However, it's generally a bit easier to just use a quick command line script than to try to deal with GenomicRanges. The later can do it, but has a much steeper learning curve and will just produce the same results (unless you're trying to do something unusual).

Regarding your GRanges question, seqnames would be things like "chr1" (UCSC chromosome naming scheme) or "10" (Ensembl naming scheme). IRanges are genomic bounds, so just the 5' and 3' boundaries of an exon/CDS/whatever. In general, you can get that stuff all made for you by reading in a GTF or GFF file (I've used import.gff from rtracklayer). The "group_id" in the example is more of a place-holder for gene_id later on. In essence, it's probably meant to be the actual feature level that's counted over (though the don't do that in the example), so you don't actually need it that example. In a real case, you'd read in your annotation file into a GRanges object and then

Code:

 split(GRangesObject, mcols(GRangesObject)$label_of_interest)

to get a GRangesList that you would then summarizeOverlaps on. Here, label_of_interest would probably be the gene_id field of the annotation file (though it could be something else, such as sprintf("%s:E%03d", mcols(blah)$gene_id, mcols(blah)$exon), which is what you'd do for DEXSeq counts).

**sindrle** · 10-17-2013, 05:26 AM

Thank you so much!

I installed HTseq and have started HTseq-count on all 48 BAMs.

From one BAM, is this a problem?:

no_feature 7984183
ambiguous 521840
too_low_aQual 0
not_aligned 0
alignment_not_unique 13730878

And also, can you compress the HTseq-count output? Its 16gb for each file..

**dpryan** · 10-17-2013, 05:59 AM

Ah, you're saving the wrong file, don't use -o. The command to execute is something like:

Code:

samtools view namesorted.bam | htseq-count -m intersection-nonempty -s no -a 10 - GTF_FILE > namesorted.counts

That file will be much much much smaller and is all that you need.

**dpryan** · 10-17-2013, 06:01 AM

I should add, "namesorted.bam" and "GTF_FILE" above are just place-holder names. You'll want to change "namesorted.counts" to something meaningful so you can keep track of which counts are associated with which sample. Finally, those options for htseq-count are just examples, you can change them to fit your needs.

**sindrle** · 10-17-2013, 06:24 AM

Great!
I aborted the run and updated the CMD as you wrote it.

I was using the "union" option, but changed it to "intersection-nonempty". But I dont actually know why.. :P

Thanks again!

**frymor** · 10-23-2013, 03:14 AM

Originally posted by sindrle View Post

Great!
I aborted the run and updated the CMD as you wrote it.

I was using the "union" option, but changed it to "intersection-nonempty". But I dont actually know why.. :P

Thanks again!

take a look here to see the difference between the union and the intersection-nonempty.

It basically tells htseq-count how to deal with reads matched to two genes, or what to do if the overlap isn't complete.

**sindrle** · 10-23-2013, 03:51 AM

Thanks I have looked at that one, but what is the implications for differential expression?

**frymor** · 10-23-2013, 04:20 AM

As far as I understand it, depends on the option you take you'll be counting more or less reads.

If I am not mistaken, reads marked as ambiguous are not taken into consideration in the count of read numbers. So if you take union you'll have more ambiguous reads than the intersection options.

BTW, take a look at featureCounts from the sub read package. It works also very good with the data. It finds more results than htseq-counts and works faster.

Topics	Statistics	Last Post
New AI Model Captures Long-Range Genomic Signals to Improve RNA Splice Site Prediction by SEQadmin2 Started by SEQadmin2, Today, 05:37 AM	0 responses 5 views 0 reactions	Last Post by SEQadmin2 Today, 05:37 AM
Large-Scale Protein Screen Uncovers Hidden Regulators of Alternative Polyadenylation by SEQadmin2 Started by SEQadmin2, 06-26-2026, 11:10 AM	0 responses 16 views 0 reactions	Last Post by SEQadmin2 06-26-2026, 11:10 AM
Whole-Genome Sequencing Traces Faroe Islands Ancestry to a North Atlantic Founder Population by SEQadmin2 Started by SEQadmin2, 06-17-2026, 06:09 AM	0 responses 49 views 0 reactions	Last Post by SEQadmin2 06-17-2026, 06:09 AM
Sequencing the Two-Toed Sloth Genome Reveals Jumping Genes Tied to Its Extreme Metabolism by SEQadmin2 Started by SEQadmin2, 06-09-2026, 11:58 AM	0 responses 109 views 0 reactions	Last Post by SEQadmin2 06-09-2026, 11:58 AM

Unconfigured Ad

DESeq2 SummarizedExperiment help

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News