Seqanswers Leaderboard Ad

**shi** · 05-21-2013, 03:43 PM

Dear Iris and Nicolas,

Thanks for letting me know it works for you now.

It is a non-trivial task to add an option to Subread to make it support the processing of gzipped FASTQ file directly. We are currently working on some further developments to Subread which takes a higher priority, but you may subscribe to the Subread updates (http://subread.sourceforge.net) so that you can get notified when we add this option.

Best wishes,

Wei

**iris_aurelia** · 05-28-2013, 01:01 AM

Hi Wei,

I compared HTseq-count with featureCounts, using a 50 bp single-end dataset, and noticed that for many genes featureCounts reports a higher count than HTseq-count does. It looks like that featureCounts does count reads mapped to multiple locations twice (or more), while HTseq-count excludes these reads. In principle we cannot say to which gene it actually belongs, so I would rahter discard those reads instead of counting them for each gene.
Wouldn't it be an option to include a parameter that can for example filter for low mapping quality reads or only use reads that are mapped uniquely?

Iris

**shi** · 05-28-2013, 05:02 PM

Hi Iris,

I'm a little bit confused here. Are you talking about the reads mapped to different chromosomal locations, or you are talking about the reads that have only one reported mapping location but were found to overlap with more than one gene?

If it is the latter, featureCounts does not count those reads overlapping with more than one gene in its default setting and that shouldn't be the reason why featureCounts counted more reads than HTseq-count. A possible reason for this is that HTseq-count takes the rightmost chromosomal location of the feature as a open position (ie. not included in the summarization) whereas featureCounts takes it as a closed position (ie. included in the summarization).

We have seen what you reported in our evaluation as well (http://arxiv.org/abs/1305.3347). In our evaluation, ~500 genes got more counts from featureCounts and ~150 genes got more counts from HTseq-count when summarizing single-end reads. Did you see HTseq-count give you more counts for some genes in your data?

Thanks for suggesting us to add a parameter to filter for low mapping quality reads. We are now adding this parameter and will let you know once this is done.

Best wishes,
Wei

**iris_aurelia** · 05-30-2013, 11:48 PM

Hi Wei,

I do mean the reads that are mapped to different chromosomal locations. So for example if a read maps to gene A on chr8 and maps to gene F on chr 9 the read is counted for both genes.
I've indeed read about the open and closed position what could result in a difference between the two tools, but that's not the case here.

I downloaded the newest version, subread-1.3.3-p4 and used the quality filter option to see if this will make a difference. When I only used the reads with the highest mapping qualities featureCounts reported exactly the same counts as HTseq-count does. (see attached figure). HTseq-count hower never reported more counts than featureCounts in my case.

Iris

Attached Files

featureCounts_vs_HTseq-count2.png (30.6 KB, 107 views)

**shi** · 05-31-2013, 07:45 PM

Dear Iris,

Thanks for your clarification and providing the comparison results.

However, we could not reproduce what you have observed. Attached is a toy example. In this example, a read was mapped to five different locations and all locations were reported in the SAM file. We found that featureCounts and HTseq-count gave the identical results when summarizing it.

Could you please provide the commands you used when running featureCounts and HTseq-count? It will also be very helpful if you can provide a runnable example dataset so that we can try to figure out what's going on.

Thanks,

Wei

Attached Files

data.zip (1.0 KB, 38 views)

**Nicolas Nalpas** · 06-01-2013, 02:48 AM

Dear Wei,

I just had another question about featureCounts and the strand specificity parameter.
We have a strand-specific RNA-seq dataset, on which we initially used HTseq-count, so as you know HTseq-count has three options for the strand-specific parameter: --stranded no or --stranded yes or --stranded reverse and in our case the appropriate options is actually the latter: --stranded reverse. So now we want to use featureCounts (as we found it very quick and accurate) on this dataset, however the current standed options are: unstranded by default or -s for stranded. So after testing we understood that featureCounts unstranded correspond to HTseq-count --stranded no and that featureCounts -s correspond to HTseq-count --stranded yes. So as I said above the option that we want for this dataset is really --stranded reverse, which we were not able to find in featureCounts.
So I was wondering if I missed something in the featureCounts users guide and this option is already available? Or if it is not possible to do that, I was wondering if you have any plan to incorporate such option?

Thanks a lot for your help.
Regards,
Nicolas

**shi** · 06-01-2013, 03:45 PM

Dear Nicolas,

We have just modified featureCounts to make it be able to perform reverse-stranded read counting. The latest version can be found in Subread v1.3.3-p5 (http://subread.sourceforge.net).

Providing a parameter '-s 2' to featureCounts will instruct it to perform reverse-stranded read counting.

Thank you for your suggestion and I hope it works for your dataset.

Best wishes,

Wei

**Nicolas Nalpas** · 06-02-2013, 04:41 AM

Dear Wei,
This is great, thanks a lot for being so helpful and available.
I will give this a try on our dataset.
Thanks a lot.
Regards,
Nicolas

**iris_aurelia** · 06-02-2013, 11:55 PM

Hi Wei,

I have tested your test-dataset and it indeed gave the same results for both HTseq-count as FeatureCounts.
I think the difference is made by the NH tag in my BAM/SAM file. When I tried a single read that mapped 5 times it gives a different result.
According the documentation of HTseq-count it excludes all reads with more than one reported alignment (whenever the NH tag is available), which I believe FeatureCounts does not.
Using the mapping quality filter in FeatureCounts, it will produce the same result, since reads with more than one reported alignment have a lower mapping quality.

I have used Tophat for the alignment of my reads, followed by HTseq-count as well as FeatureCounts. I've attached a test set with 5 multimapped hits from the data I got from TopHat.
The commands I used were:

HTseq-count:

Code:

htseq-count MMread.sam annotation.gtf -s no -q > genecounts_MMread.txt

FeatureCounts:

Code:

featureCounts -a annotation.gtf -t exon -g 'gene_id' -i MMread.sam -T 5 -o feature_counts_MMread.txt

Thanks for the quick responses!

Iris

Attached Files

testdataMM.zip (1.6 KB, 36 views)

**shi** · 06-03-2013, 06:34 PM

Dear Iris,

Thanks for looking into this deeply and providing a reproducible example.

You were correct that featureCounts did not exclude multi-mapping reads from the summarization. We have just added a '-M' option to featureCounts to enable it to disregard multi-mapping reads during read summarization. When '-M' is provided, featureCounts will use the 'NH' tag to identify multi-mapping reads and then exclude them.

These changes have been incorporated into the latest version (1.3.4) of the Subread package. You can download it from http://subread.sourceforge.net . We have tested it and it worked fine.

Please let me know if you run into any problems.

Best wishes,

Wei

**shi** · 06-03-2013, 08:57 PM

Originally posted by shi View Post

Dear Iris,

Thanks for looking into this deeply and providing a reproducible example.

You were correct that featureCounts did not exclude multi-mapping reads from the summarization. We have just added a '-M' option to featureCounts to enable it to disregard multi-mapping reads during read summarization. When '-M' is provided, featureCounts will use the 'NH' tag to identify multi-mapping reads and then exclude them.

These changes have been incorporated into the latest version (1.3.4) of the Subread package. You can download it from http://subread.sourceforge.net . We have tested it and it worked fine.

Please let me know if you run into any problems.

Best wishes,

Wei

Dear Iris,

Sorry, we have just flipped the meaning of the '-M' option. So now when running featureCounts in its default setting, multi-mapping reads will be excluded. The '-M' option is only useful if you want to count those multi-mapping reads (this is certainly not your case here).

The latest version is still 1.3.4 (the usage info for featureCounts and users guide has been updated).

Best wishes,

Wei

**iris_aurelia** · 06-03-2013, 11:03 PM

Hi Wei,

Thanks, I'm definitely gonna try this out, since featureCounts is much faster than HTseq-count.
I'll let you know if I run into any other problems!

Iris

**Nicolas Nalpas** · 06-11-2013, 09:15 AM

Dear Wei,

We have now tried featureCounts v1.3.5-p1 to perform unstranded, stranded and reverse-stranded read counting on our new strand-specific dataset. While the unstranded parameter for featureCounts gives results very similar to HTseq-count, it seems that when it comes to stranded or reverse-stranded parameters the results between featureCounts and HTseq-count are quite divergent (see pdf file enclosed). I am actually expecting results similar to the one provided by HTseq-count.

I may be wrong but I believe that at the moment, featureCounts is assigning reads only to genes present on the + strand when using the stranded (s 1) parameter. And when using the reverse stranded (s 2) parameter, reads will be assigned only to genes present on the – strand.

However, it does not correspond to what I would call strand-specific read counts (I cannot talk for other RNA-seq users, but other RNA-seq users in our lab agree with me). Indeed, s 1 and s 2 parameters should in fact be one single parameter since users want read counts summarisation for genes present on + and – strands. Strand-specific is not the difference between genes present on the + or – strand.

What strand-specific actually means is that:
- (1) A read mapping to the + strand (for example read with flag 0 or 99 [its mate having flag of 147]) can be assigned to a gene present on the + strand (in this case it is a sense read)
- (2) A read mapping to the – strand (for example read with flag 16 or 83 [its mate having flag of 163]) can be assigned to a gene present on the - strand (in this case it is also a sense read)
- (3) A read mapping to the + strand (for example read with flag 0 or 99 [its mate having flag of 147]) can be assigned to a gene present on the - strand (in this case it is an antisense read)
- (4) A read mapping to the – strand (for example read with flag 16 or 83 [its mate having flag of 163]) can be assigned to a gene present on the + strand (in this case it is also an antisense read)

So I believe that, firstly when using the unstranded parameter, gene counts summarisation should contain summarisation for all reads corresponding to these 4 possibilities listed above; secondly when using the stranded (s 1) parameter, gene counts summarisation should contain summarisation for all reads corresponding to possibility (1) and (2); and thirdly when using the reverse-stranded (s 2) parameter, gene counts summarisation should contain summarisation for all reads corresponding to possibility (3) and (4).

So it also means that if a read is overlapping a gene present on the + strand and a gene present on the – strand, this read should be considered as overlapped_genes (and not counted unless specified by –O parameter on command line). In other terms, when you add number of reads with accepted-genes obtained from s 1 and s 2, you should obtain very close to the number obtained from s 0, indeed a read counted in s 1 should not also be counted in s 2.

Thanks a lot for all your help with this, I hope it was understandable (not the easiest to explain via email, otherwise let me know). And if I may suggest, it will be great if you can check these unstranded, stranded and reverse-stranded parameters for read summarisation.
Thanks a lot,
Nicolas

Attached Files

Comparison_count.pdf (170.0 KB, 67 views)

**shi** · 06-11-2013, 04:05 PM

Dear Nicolas,

Thanks for reporting the problem in much details. I certainly agree with your explanation of the stranded and unstranded read summarization and this is actually the counting scheme we implemented in featureCounts program. So I'm a bit confused with what you have observed.

Is it possible that you can provide us the first 10 reads you used for summarization and also your command for running featureCounts? This will help us to figure out what went wrong. Thanks.

Best wishes,

Wei

**Nicolas Nalpas** · 06-12-2013, 12:31 AM

Dear Wei,

Thanks for your reply.
Enclosed is a file containing my first 5,000 mapped reads (forgot to say in my earlier post that I used STAR to align my reads against Bos taurus 3.1.71 genome).
I checked again featureCounts on this file and I was able to reproduce my results of yesterday (using Bos taurus 3.1.71 annotation).
Also here is my commands for unstranded, stranded and reverse stranded:

Code:

featureCounts -a /workspace/storage/genomes/bostaurus/UMD3.1.71/annotation_file/Bos_taurus.UMD3.1.71.gtf -t exon -g gene_id -i 5000_reads.sam -o test.unstranded -s 0 -R -p -B
featureCounts -a /workspace/storage/genomes/bostaurus/UMD3.1.71/annotation_file/Bos_taurus.UMD3.1.71.gtf -t exon -g gene_id -i 5000_reads.sam -o test.stranded -s 1 -R -p -B
featureCounts -a /workspace/storage/genomes/bostaurus/UMD3.1.71/annotation_file/Bos_taurus.UMD3.1.71.gtf -t exon -g gene_id -i 5000_reads.sam -o test.reverse -s 2 -R -p -B

I hope this help, thanks a lot for all your help.
Regards,
Nicolas

Attached Files

5000_reads.sam.gz (408.8 KB, 41 views)

Topics	Statistics	Last Post
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, Yesterday, 08:47 AM	0 responses 15 views 0 likes	Last Post by seqadmin Yesterday, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 54 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News