Seqanswers Leaderboard Ad

**shi** · 01-28-2015, 01:50 PM

Originally posted by tamfernandez View Post

Dear Wei,

I would like to use featureCounts for counting multiple aligned reads, but instead of adding 1 to the count for each location y would like to add fractional counts (1/number of locations where the read mapped). Is this posible?

Thank you in advance for your reply.
Tamara

Hi Tamara,

Although this is something we are considering to add to featureCounts, I do not really think you will get much gain from doing this since this will add noise to your read count data, which may then compromise your downstream differential expression analysis.

If you worry about those reads overlapping with multiple genes, you should consider using a more conservative annotation such as RefSeq gene annotation which contains much less overlapping genes compared to other annotations such as Ensembl genes.

Wei

**tamfernandez** · 01-28-2015, 06:24 PM

Originally posted by shi View Post

Hi Tamara,

Although this is something we are considering to add to featureCounts, I do not really think you will get much gain from doing this since this will add noise to your read count data, which may then compromise your downstream differential expression analysis.

If you worry about those reads overlapping with multiple genes, you should consider using a more conservative annotation such as RefSeq gene annotation which contains much less overlapping genes compared to other annotations such as Ensembl genes.

Wei

Hi Wei,

Thank you for your reply! I absolutely agree with you but I'm trying to count short reads that map to some multicopy genes and some conserved genes like tRNAs among others. I thought that assigning a fraction to each gene copy could be a good solution to count reads that map specifically to these genes. Do you have any suggestions for dealing with this using featureCounts?

Again, thank you for your attention.
Tamara

**shi** · 01-28-2015, 06:48 PM

In feautureCounts built-in annotation, multicopy genes (with the same Entrez gene id) were combined to form one gene that contains exons from all its copies. Assigning 1 count for a read overlapping with this meta-gene should be equivalent to assigning a 1/n count to each of its n copies. You may try the inbuilt annotation to see it works for you.

Wei

**tamfernandez** · 01-29-2015, 03:58 AM

Originally posted by shi View Post

In feautureCounts built-in annotation, multicopy genes (with the same Entrez gene id) were combined to form one gene that contains exons from all its copies. Assigning 1 count for a read overlapping with this meta-gene should be equivalent to assigning a 1/n count to each of its n copies. You may try the inbuilt annotation to see it works for you.

Wei

Thank you Wei for your ansewer. This will solve part of the problem. Maybe for genes like tRNA-Pro isodecoders, which have almost the same sequence but still different ids, I can try to modify the id's so I can use what you suggested.

Thank you,
Tamara

**colindaven** · 01-29-2015, 04:15 AM

Originally posted by Gordon Smyth View Post

Counts are needed for a proper statistical analyses, as is explained in the publications behind edgeR, voom and DESeq.

However, generating RPKM from featureCounts is easy, see:

404 Not Found

http://bioinf.wehi.edu.au/RNAseqCaseStudy

for an example.

Dear Gordon,

thanks for your response and the link, I agree entirely. Incidentally I have been using edgeR and more recently voom/limma for statistical analysis for many years. However, my users are still very keen to see RPKM values for many, many samples.

RPKM, despite it's statistical weaknesses, is used widely. At the moment I do take the R route to generate RPM or RPKM, yet this is labour intensive for non-model organisms as I need to create multiple annotation objects. Seeing as all the information is available to featureCounts, I think there would be a demand for this.

Like you say, the downside is that people then might want to put RPKM into statistical packages like edgeR, but this topic has been dealt with repeatedly and might be avoided by an appropriate warning.

Thanks again,
Colin

**Gordon Smyth** · 01-29-2015, 09:47 PM

Originally posted by colindaven View Post

At the moment I do take the R route to generate RPM or RPKM, yet this is labour intensive for non-model organisms as I need to create multiple annotation objects. Seeing as all the information is available to featureCounts, I think there would be a demand for this.

Hi Colin,

We seem to be in complete agreement. Our group probably uses RPKM much the same as you do, for descriptive plots and expression summaries.

We don't output RPKM from featureCounts just on the principle that it seems redundant information. featureCounts already outputs all the information one needs to compute RPKM. You can use the rpkm() function in the edgeR package to extract the RPKM (as in the example I linked to) or else just do it yourself in a couple of lines:

fc <- featureCounts(stuff)
RPKM <- fc$counts / fc$annotation$Length * 1e3
RPKM <- t(t(RPKM) / colSums(fc$counts) * 1e6)

In our own analyses, we generally like to filter unwanted rows (mitochondrial genes, Ig DNA segments, ribosomal genes, non-expressed genes and what not) and estimate normalization factors before computing RPKM. So there are few steps to do before we extract RPKM.

BTW, did you know that voom and limma can work directly with RPKM? Just do this:

v <- voom(counts, design, plot=TRUE)
v$E <- v$E - log2(GeneLength/1e6)

Then v$E contains log2(RPKM), and all the limma analyses will work correctly in terms of these expression values.

**bruce01** · 03-23-2015, 04:47 AM

Strand issues

Hi Wei,

I have been using single end 50bp stranded library, and was getting low counts and losing quite a lot of data. I tried '-s =0, 1, 2' and found discrepancies with counts. I expected 1+2 to sum to output of 0, is this correct?

Code:

ENS_ID          0    1    2
ENSG00000000003 48   15   33
ENSG00000000005 44   17   27
ENSG00000000419 77   50   91
ENSG00000000457 235  72   193
ENSG00000000460 677  581  360

I have 23,038 matching, and 40,642 which do not sum to output from '-s 0'.

I have tried resorting (aligned with STAR giving 'sort-by-coord' option) which made no difference to featureCounts output, and my command is:

Code:

featureCounts -a ${HOME}/Homo_sapiens.GRCh37.75.saf -F SAF -t gene -o $out -s "$x" -T 8 -d 30 $1

Any ideas would be very welcome and I appreciate your continued support of you tools.

**shi** · 03-23-2015, 02:07 PM

Hi @bruce01,

The '-s 0' option is likely to give you less counts compared to the sum of counts from '-s 1' and '-s 2'. This is because reads are more likely to be found to overlap with more than one gene when '-s 0' is used (genes with overlapping coordinates but on different strands will be taken as overlapping genes at unstranded counting mode, but not taken as overlapping genes at stranded counting mode).

Wei

**bruce01** · 03-24-2015, 02:03 AM

Hi Wei,

that makes sense for s=0, of course. I suppose my real question is that if we know what strand the RNA originally comes from do we not always need to use s=1+2, seeing as some will be from '+' and some '-'? Is this something other people do?

Bruce.

**dpryan** · 03-24-2015, 02:22 AM

Strand refers to sense and antisense, not +/-.

**bruce01** · 03-24-2015, 02:49 AM

Originally posted by dpryan View Post

Strand refers to sense and antisense, not +/-.

Sorry about the nomenclature, I have +/- in my head as they are in GTF and that is how I denote it in there.

Can I ask your opinion on what way to run featureCounts with stranded libraries? Presumably if strand is known, then using both s=1 and s=2 gives the more accurate counts that using s=0, particularly for overlapping genes, when strand is known?

**dpryan** · 03-24-2015, 03:10 AM

If your library is stranded, then use either s=1 or s=2, as appropriate. It seems that s=2 would typically be correct, since most stranded kits use a dUTP-based method (so the sense strand is the opposite of how read #1 aligns). When in doubt, featureCounts is really fast, so you can run it with s=1 and s=2 separately and then just determine which one was the correct setting from the numbers output (afterall, sometimes you don't know the strandedness). The incorrect one will have vastly lower numbers.

**bruce01** · 03-24-2015, 03:39 AM

Originally posted by dpryan View Post

The incorrect one will have vastly lower numbers.

This is my problem, I have samples where s=1, which is incorrect based on the library prep protocol which was dUTP based, have more counts than s=2. This is from FFPE tumor samples, and so I expect it to be bad, but not like this. Will go with s=2 anyway, thanks for your help.

**shi** · 03-24-2015, 02:32 PM

Hi Bruce,

I agree with Devon that if your library is stranded you should count reads at the stranded mode (s=1 or s=2) rather than unstranded mode (s=0).

Also as I said earlier, my concern with performing stranded counting for unstranded reads is that you may have reduced capability to detect reads that overlap with more than one gene. This may result in reads being assigned to wrong genes.

Another thing you need to consider is that the annotated strands for genes might be incorrect, particularly for those computationally predicted genes. Stranded read counting will be problematic for such genes.

Wei

**GenoMax** · 10-22-2015, 10:09 AM

Hi Wei,

Do you have plans to add support for SAM v.1.4 format tags to featureCounts?

We use BBMap for alignments and it outputs SAM v. 1.4 tags by default. We have been having trouble assigning read counts using those alignments with featureCount. featureCount does not generate an error but only assigns a small number of (< 1%) reads.

Note: BBMap can output SAM v.1.3 tags and then the counting works, which we are using as a workaround for now.

Thanks.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 25 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 27 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 24 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News