SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
STAR: ultrafast universal RNA-seq aligner alexdobin Bioinformatics 218 04-02-2018 05:59 PM
Primer design Universal tailed Amplicon paloma84 454 Pyrosequencing 2 11-26-2012 11:33 AM
Program edit an ace file to simultaneously extract read information from all contig? cllorens Bioinformatics 3 06-27-2012 08:20 AM
PubMed: Target-Enrichment Through Amplification of Hairpin-Ligated Universal Targets Newsbot! Literature Watch 0 03-25-2011 06:10 AM
PubMed: Development and quantitative analyses of a universal rRNA-subtraction protoco Newsbot! Literature Watch 1 03-12-2010 05:41 PM

Reply
 
Thread Tools
Old 05-15-2013, 07:16 PM   #1
shi
Wei Shi
 
Location: Australia

Join Date: Feb 2010
Posts: 235
Default featureCounts: a universal read summarization program

Dear All,

I would like to formally introduce to you our featureCounts program, a software program we developed for summarizing the next-gen sequencing reads to genomic features such as genes, exons and promoters.

featureCounts is a light-weight read counting program written entirely using the C programming language. It can be used to count both gDNA-seq and RNA-seq reads for genomic features. It has the following features:
(1) It carries out precise and accurate read assignments by taking care of indels, junctions and fusions in the reads.
(2) It takes less than 4 minutes to summarize 20 million pairs of reads to 26k RefSeq genes using one thread, and only uses 40MB of memory (you can run it on a Mac laptop).
(3) It supports multi-threaded running, making it extremely fast for summarizing large datasets.
(4) It supports GTF format annotation and SAM/BAM read data.
(5) It supports strand-specific read summarization.
(6) It can perform read summarization at both feature level (eg. exons) and meta-feature level (eg. genes).
(7) It allows users to specify whether reads overlapping with more than one feature should be counted or not.
(8) It gives users full control on the summarization of paired-end reads, including allowing them to check if both ends are mapped and/or if the paired-end distances satisfy the distance criteria.
(9) It discriminates the features, which were overlapped by both ends from the same fragment, from those which were overlapped by only one end so as to get more fragments counted.
(10) It allows users to specify whether chimeric fragments should be counted.

For a quick start, have a look at our short tutorial - http://bioinf.wehi.edu.au/featureCounts/ . For more details, please refer to the users guide - http://bioinf.wehi.edu.au/featureCounts/usersguide.pdf (see Chapter 6).

We also compared featureCounts with other methods. The comparison results can be found in our manuscript - http://arxiv.org/abs/1305.3347.

The featureCounts program is part of the Subread package (http://subread.sourceforge.net), which includes a suite of programs for processing next-gen sequencing data such as read mapping and exon-exon junction detection. featureCounts can also be accessed from the development version of the Bioconductor R package Rsubread (http://bioconductor.org/packages/2.1.../Rsubread.html)

Please do not hesitate to contact me if you have any questions (shi@wehi.edu.au).

Best regards,
-------------------
Wei Shi, Ph.D
Bioinformatics Division
The Walter and Eliza Hall Institute of Medical Research
1G Royal Parade, Parkville, Victoria 3052
Australia

Last edited by shi; 05-16-2013 at 04:08 AM.
shi is offline   Reply With Quote
Old 05-16-2013, 12:16 AM   #2
dietmar13
Senior Member
 
Location: Vienna

Join Date: Mar 2010
Posts: 107
Default

dear wei,

using featurecount I
get:

Code:
./count_featurCount.sh: line 15: 29340 Segmentation fault      (core dumped) featureCounts -p -B -C -a $gtf -t exon -g gene_id -i $SAMdir/$name/RUM.sam -o $name.fcount

I use:

Code:
featureCounts -p -B -C -a $gtf -t exon -g gene_id -i $SAMdir/$name/RUM.sam -o $name.fcount
example sam file:
Code:
seq.1	77	*	0	0	*	=	0	0	CGGGGCGGGATCCCGCCGCCTCTCCGGCCCGCCGGNNNNCCGCCACCGGCCCACTNTNCNCNNCCCCCCCNCGCGG	############################################################################
seq.1	141	*	0	0	*	=	0	0	NNGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNTNNNNNNNNNNNNNNNNNTNNCNGNNTCNNNNNNNNNN	############################################################################
seq.2	73	chr2	89157089	25	76M	=	89157089	0	CCTCTCTGGGATAGAAGTTATTCAGCAGGCACACANNNNAGGCAGTTCCAGATTTNANCTGNNCATCAGANGGCGG	CCCCCCCDBCCCCBCAABBBCCBCCCCCCCBB@AA####/00000CCCCA##########################	XO:A:F	MD:Z:35ACAG16C1A3CT7T5	NM:i:9	IH:i:1	HI:i:1
seq.2	133	chr2	89157089	25	*	=	89157089	0	NNTNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNTNNNNNNNNNNNNNNNNNCNCCNCNNTGCNNNNNNNNN	############################################################################	XO:A:F	IH:i:1	HI:i:1
seq.3	77	*	0	0	*	=	0	0	CGGTTCAGCAGGAATGCCGAGATCGGAAGAGCGGTNNNGCAGGAATGCCGAGACCGCNTGTANTCTCGTATGCGGG	BB>4B?>?3>@>@642::<7A#######################################################
seq.3	141	*	0	0	*	=	0	0	NNGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGNNNNNNNNNNNNNNNNNCNTGNCGNGCGNNNNNNNNN	############################################################################
seq.4	77	*	0	0	*	=	0	0	CGGTTCACCAGGAATGCCGAGACCGGAAGAGCGGTTCNGCAGGAATGCCGAGACCGCTCGTAATCTCGCATACCGG	<4=.+;**1*45052@@@@@########################################################
seq.4	141	*	0	0	*	=	0	0	CNGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNTNNNNNNNNNNNNNNNNNGNCGNGGNGAGNNNNNNNNN	############################################################################
and example of the gtf-file (from gencode):
##description: evidence-based annotation of the human genome (GRCh37), version 15 (Ensembl 70)
Code:
##provider: GENCODE
##contact: gencode@sanger.ac.uk
##format: gtf
##date: 2013-01-21
chr1	HAVANA	gene	11869	14412	.	+	.	gene_id "ENSG00000223972.4"; transcript_id "ENSG00000223972.4"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1"; level 2; havana_gene "OTTHUMG00000000961.2";
chr1	HAVANA	transcript	11869	14409	.	+	.	gene_id "ENSG00000223972.4"; transcript_id "ENST00000456328.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "DDX11L1-002"; level 2; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1	HAVANA	exon	11869	12227	.	+	.	gene_id "ENSG00000223972.4"; transcript_id "ENST00000456328.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "DDX11L1-002"; exon_number 1;  level 2; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1	HAVANA	exon	12613	12721	.	+	.	gene_id "ENSG00000223972.4"; transcript_id "ENST00000456328.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "DDX11L1-002"; exon_number 2;  level 2; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1	HAVANA	exon	13221	14409	.	+	.	gene_id "ENSG00000223972.4"; transcript_id "ENST00000456328.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "DDX11L1-002"; exon_number 3;  level 2; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
chr1	ENSEMBL	transcript	11872	14412	.	+	.	gene_id "ENSG00000223972.4"; transcript_id "ENST00000515242.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1-201"; level 3; havana_gene "OTTHUMG00000000961.2";
chr1	ENSEMBL	exon	11872	12227	.	+	.	gene_id "ENSG00000223972.4"; transcript_id "ENST00000515242.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1-201"; exon_number 1;  level 3; havana_gene "OTTHUMG00000000961.2";
do you have any clue what could be wrong?

best wishes,

dietmar
dietmar13 is offline   Reply With Quote
Old 05-16-2013, 03:18 AM   #3
shi
Wei Shi
 
Location: Australia

Join Date: Feb 2010
Posts: 235
Default

Dear dietmar,

Thanks for providing the detailed information about the problem you encountered. This helped us to reproduce the problem.

The featureCounts program does not allow ## lines or comment lines to be included in the provided annotation file. The program crashed when such lines exist in the annotation file. Sorry about this. We are now working on this and will let you know once it is fixed.

In the meantime, you can get around this problem by removing those ## lines from your annotation file. I have tested this and it worked. Hope it will work for you. But please let me know if it doesn't. I will get back to you soon.


Best regards,
Wei
shi is offline   Reply With Quote
Old 05-16-2013, 10:37 PM   #4
shi
Wei Shi
 
Location: Australia

Join Date: Feb 2010
Posts: 235
Default

Hi dietmar,

We have fixed this and uploaded the latest version (v1.3.3-p1) to sourceforge (http://subread.sourceforge.net).

Hope this works for you.

Best wishes,
Wei
shi is offline   Reply With Quote
Old 05-19-2013, 06:19 AM   #5
dietmar13
Senior Member
 
Location: Vienna

Join Date: Mar 2010
Posts: 107
Default

dear wei,

thank you.

but now i get another error:

Code:
**********
**********
  WARNING
**********
No meta-feature id is found on the 6-th line. If it is a GTF file, you may need to check the name of the gene_id field and specify a correct field name using a '-g' option.
**********
**********
WARNING: the feature on the 31178-th line has zero coordinate or zero lengths
... many similar warnings ...
WARNING: the feature on the 2577002-th line has zero coordinate or zero lengths
There are 1201574 features loaded from the annotation file.
The 1201574 features are sorted.
Number of chromosomes included in the annotation is 25
./count_featurCount.sh: line 16: 24368 Segmentation fault      (core dumped) featureCounts -p -B -C -a $gtf -t exon -g gene_id -i $SAMdir/$name/RUM.sam -o $name.fcount

the 6th line has definitely a gene_id (probably a bug?) but the lines with zero length features are true.

the gtf-file:
Code:
##description: evidence-based annotation of the human genome (GRCh37), version 15 (Ensembl 70)
##provider: GENCODE
##contact: gencode@sanger.ac.uk
##format: gtf
##date: 2013-01-21
chr1	HAVANA	gene	11869	14412	.	+	.	gene_id "ENSG00000223972.4"; transcript_id "ENSG00000223972.4"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1"; level 2; havana_gene "OTTHUMG00000000961.2";
chr1	HAVANA	transcript	11869	14409	.	+	.	gene_id "ENSG00000223972.4"; transcript_id "ENST00000456328.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "DDX11L1-002"; level 2; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
...
31178-th line:
chr1	HAVANA	exon	19983359	19983359	.	+	.	gene_id "ENSG00000158747.9"; transcript_id "ENST00000439278.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "NBL1"; transcript_type "protein_coding"; transcript_status "NOVEL"; transcript_name "NBL1-007"; exon_number 4;  level 1; tag "mRNA_end_NF"; tag "cds_end_NF"; tag "exp_conf"; havana_gene "OTTHUMG00000002700.5"; havana_transcript "OTTHUMT00000313743.1";
do you think the zero-length features are the problem, or is it possible that if the BAM files has additional chromosomes (like chr1_gl000191_random) which are not present in the gtf-annotation file your program has a problem. i think this should be catched, because this will happen often...

best wishes,

dietmar
dietmar13 is offline   Reply With Quote
Old 05-19-2013, 03:43 PM   #6
shi
Wei Shi
 
Location: Australia

Join Date: Feb 2010
Posts: 235
Default

Dear dietmar,

Although you got those warning messages, your annotation file was actually successfully processed according to the program output (sorry for the misleading warning message about gene id, we have fixed this and updated the program on sourceforge).

Did you provide a SAM or BAM file to featureCounts? If you provided a BAM file, you need to add an '-b' option to your command. Otherwise it will cause a segmentation fault.

We actually downloaded the ENCODE annotation file you used and tested it here. We found it worked with the version of featureCounts you are using for both SAM and BAM input.

Could you please check if you correctly specified the type of your read file? I think this was probably the reason why you got a segmentation fault.

Best regards,

Wei
shi is offline   Reply With Quote
Old 05-19-2013, 04:21 PM   #7
NateP
Junior Member
 
Location: Nebraska

Join Date: Sep 2011
Posts: 9
Default

I don't know if this an intentional limitation to the program, but here is an error I'm encountering.

The reference genome I am aligning reads to is a draft version, with 430k contigs.

The error I am getting when attempting to use featureCounts is:

"There are 290343 features loaded from the annotation file.
WARNING: THere are too many chromosomes in the annotations. The remainder of the annotation file is ignored.
The 8586 features are sorted.
Number of chromosomes inclued in the annotation is 499.

WARNING: There are too many reference sequences in the BAM file!
[repeated many times]
Segmentation fault"
NateP is offline   Reply With Quote
Old 05-19-2013, 10:38 PM   #8
shi
Wei Shi
 
Location: Australia

Join Date: Feb 2010
Posts: 235
Default

Dear @NateP,

featureCounts had hard-coded limits on the numbers of chromosome allowed to be included in the annotation file and in the BAM/SAM file. These limits were far less than the number of contigs you have here and that was the reason why featureCounts crashed.

We have now removed these limits and allowed any number of chromosomes (or contigs) to be included in the annotation file and in the SAM/BAM file. Please check out the latest version of the Subread package (1.3.3-p2) on sourceforge. The featureCounts program included in this version should work for your data now.

Please let me know if the problem persists.

Best wishes,

Wei
shi is offline   Reply With Quote
Old 05-20-2013, 06:53 AM   #9
NateP
Junior Member
 
Location: Nebraska

Join Date: Sep 2011
Posts: 9
Default

Wei,

I figured it was something along those lines.
The newest version works with my data sets, thanks for the great program!
NateP is offline   Reply With Quote
Old 05-20-2013, 02:41 PM   #10
shi
Wei Shi
 
Location: Australia

Join Date: Feb 2010
Posts: 235
Default

Thanks for letting me know, NateP.

Cheers,
Wei
shi is offline   Reply With Quote
Old 05-21-2013, 01:43 AM   #11
Nicolas Nalpas
Member
 
Location: Ireland

Join Date: Apr 2011
Posts: 19
Default

Dear Wei,
I was just wondering if you could give advise on a featureCounts job that I tried to perform, I have RNA-seq data aligned to bovine genome (UMD3.1.71 from ENSEMBL) and I used subread-1.3.3-p2 (which is the latest version I think).
The error I get is:
Code:
No meta-feature id is found on the 1-th line. If it is a GTF file, you may need to check the name of the gene_id field and specify a correct field name using a '-g' opt
ion.
My code is:
Code:
featureCounts -a /path/annotation.gtf -t exon -g gene_id -b -i /path/file.bam -o file.out -s -R -p -B
And if I understand correctly the outputs I obtained, the reads assignment was still performed against my feature (exon in this case, represented as LINE_*), however I thought the output should have been my counts per gene_id:
Code:
geneid  length  nreads
LINE_0000001    88      0
LINE_0000002    167     0
LINE_0000003    51      0
LINE_0000004    337     0
LINE_0000005    1399    0
LINE_0000006    107     0
LINE_0000007    110     0
LINE_0000008    110     0
LINE_0000009    133     0
LINE_0000010    169     0
LINE_0000011    204     0
LINE_0000012    211     0
LINE_0000013    229     0
LINE_0000014    45      0
LINE_0000015    98      0
LINE_0000016    183     0
LINE_0000017    139     0
LINE_0000018    1429    0
LINE_0000019    218     0
LINE_0000020    91      0
LINE_0000021    202     0
LINE_0000022    142     2
LINE_0000023    157     6
My gtf file looks like this:
Code:
4       protein_coding  exon    432160  432247  .       -       .        gene_id "ENSBTAG00000003625"; transcript_id "ENSBTAT00000004727"; exon_number "1"; gene_name "BT.66099"; gene_biotype "protein_coding"; transcript_name "BT.66099-201"; exon_id "ENSBTAE00000393952";
4       protein_coding  CDS     432160  432235  .       -       0        gene_id "ENSBTAG00000003625"; transcript_id "ENSBTAT00000004727"; exon_number "1"; gene_name "BT.66099"; gene_biotype "protein_coding"; transcript_name "BT.66099-201"; protein_id "ENSBTAP00000004727";
4       protein_coding  start_codon     432233  432235  .       -       0        gene_id "ENSBTAG00000003625"; transcript_id "ENSBTAT00000004727"; exon_number "1"; gene_name "BT.66099"; gene_biotype "protein_coding"; transcript_name "BT.66099-201";
4       protein_coding  exon    430223  430389  .       -       .        gene_id "ENSBTAG00000003625"; transcript_id "ENSBTAT00000004727"; exon_number "2"; gene_name "BT.66099"; gene_biotype "protein_coding"; transcript_name "BT.66099-201"; exon_id "ENSBTAE00000037981";
4       protein_coding  CDS     430223  430389  .       -       2        gene_id "ENSBTAG00000003625"; transcript_id "ENSBTAT00000004727"; exon_number "2"; gene_name "BT.66099"; gene_biotype "protein_coding"; transcript_name "BT.66099-201"; protein_id "ENSBTAP00000004727";
4       protein_coding  exon    429103  429153  .       -       .        gene_id "ENSBTAG00000003625"; transcript_id "ENSBTAT00000004727"; exon_number "3"; gene_name "BT.66099"; gene_biotype "protein_coding"; transcript_name "BT.66099-201"; exon_id "ENSBTAE00000369947";
4       protein_coding  CDS     429103  429153  .       -       0        gene_id "ENSBTAG00000003625"; transcript_id "ENSBTAT00000004727"; exon_number "3"; gene_name "BT.66099"; gene_biotype "protein_coding"; transcript_name "BT.66099-201"; protein_id "ENSBTAP00000004727";
4       protein_coding  exon    426030  426366  .       -       .        gene_id "ENSBTAG00000003625"; transcript_id "ENSBTAT00000004727"; exon_number "4"; gene_name "BT.66099"; gene_biotype "protein_coding"; transcript_name "BT.66099-201"; exon_id "ENSBTAE00000037983";
Any idea on how I can sort this out?
Thanks a lot for your help.
Regards,
Nicolas
Nicolas Nalpas is offline   Reply With Quote
Old 05-21-2013, 02:11 AM   #12
iris_aurelia
Member
 
Location: Netherlands

Join Date: Jul 2012
Posts: 22
Default

I tried subread-align and featureCount as well and encountered the same 'problem'. I was hoping that it would give me a gene identifier in the output file instead of line numbers. I've also used the -t and the -g option.
iris_aurelia is offline   Reply With Quote
Old 05-21-2013, 03:07 AM   #13
shi
Wei Shi
 
Location: Australia

Join Date: Feb 2010
Posts: 235
Default

Dear Nicolas and iris_aurelia,

Please replace -g gene-id with -g ' gene_id' in your command (note that there is a space right before gene_id in the new argument) and try it again.

I suspect the problem you encountered was because there is a space right before the gene_id attribute in the 9th column of your annotation file. I have seen this for the yeast GTF annotation file generated by Ensembl. This is not compliant with the GTF-format specification which requires all the columns to be tab-delimited.

However, the human GTF annotation generated by Ensembl is fine. So it seems this problem only occurred for some species. I'm going to contact Ensembl to let them know this issue.

Please let me know if the new argument does not work.

Best wishes,

Wei
shi is offline   Reply With Quote
Old 05-21-2013, 03:56 AM   #14
iris_aurelia
Member
 
Location: Netherlands

Join Date: Jul 2012
Posts: 22
Default

Hi Wei,

Thanks, this solution works! However, I was using the Human reference (version 66), which had this problem as well. Maybe newer versions of the GTF file do not have this problem..?

Another thing I bumped into was the lack of reading gzipped fastQ files. Is it possible to either read from standard in or to create an option to use gzipped files?

Iris

Last edited by iris_aurelia; 05-21-2013 at 04:09 AM.
iris_aurelia is offline   Reply With Quote
Old 05-21-2013, 05:27 AM   #15
Nicolas Nalpas
Member
 
Location: Ireland

Join Date: Apr 2011
Posts: 19
Default

Hi Wei,

Thanks a lot for all your help, indeed it works perfectly now and so fast.
This is great.
Thanks again,
Nicolas
Nicolas Nalpas is offline   Reply With Quote
Old 05-21-2013, 03:43 PM   #16
shi
Wei Shi
 
Location: Australia

Join Date: Feb 2010
Posts: 235
Default

Dear Iris and Nicolas,

Thanks for letting me know it works for you now.

It is a non-trivial task to add an option to Subread to make it support the processing of gzipped FASTQ file directly. We are currently working on some further developments to Subread which takes a higher priority, but you may subscribe to the Subread updates (http://subread.sourceforge.net) so that you can get notified when we add this option.

Best wishes,

Wei
shi is offline   Reply With Quote
Old 05-28-2013, 01:01 AM   #17
iris_aurelia
Member
 
Location: Netherlands

Join Date: Jul 2012
Posts: 22
Default

Hi Wei,

I compared HTseq-count with featureCounts, using a 50 bp single-end dataset, and noticed that for many genes featureCounts reports a higher count than HTseq-count does. It looks like that featureCounts does count reads mapped to multiple locations twice (or more), while HTseq-count excludes these reads. In principle we cannot say to which gene it actually belongs, so I would rahter discard those reads instead of counting them for each gene.
Wouldn't it be an option to include a parameter that can for example filter for low mapping quality reads or only use reads that are mapped uniquely?

Iris
iris_aurelia is offline   Reply With Quote
Old 05-28-2013, 05:02 PM   #18
shi
Wei Shi
 
Location: Australia

Join Date: Feb 2010
Posts: 235
Default

Hi Iris,

I'm a little bit confused here. Are you talking about the reads mapped to different chromosomal locations, or you are talking about the reads that have only one reported mapping location but were found to overlap with more than one gene?

If it is the latter, featureCounts does not count those reads overlapping with more than one gene in its default setting and that shouldn't be the reason why featureCounts counted more reads than HTseq-count. A possible reason for this is that HTseq-count takes the rightmost chromosomal location of the feature as a open position (ie. not included in the summarization) whereas featureCounts takes it as a closed position (ie. included in the summarization).

We have seen what you reported in our evaluation as well (http://arxiv.org/abs/1305.3347). In our evaluation, ~500 genes got more counts from featureCounts and ~150 genes got more counts from HTseq-count when summarizing single-end reads. Did you see HTseq-count give you more counts for some genes in your data?

Thanks for suggesting us to add a parameter to filter for low mapping quality reads. We are now adding this parameter and will let you know once this is done.

Best wishes,
Wei

Last edited by shi; 05-28-2013 at 05:27 PM.
shi is offline   Reply With Quote
Old 05-30-2013, 11:48 PM   #19
iris_aurelia
Member
 
Location: Netherlands

Join Date: Jul 2012
Posts: 22
Default

Hi Wei,

I do mean the reads that are mapped to different chromosomal locations. So for example if a read maps to gene A on chr8 and maps to gene F on chr 9 the read is counted for both genes.
I've indeed read about the open and closed position what could result in a difference between the two tools, but that's not the case here.

I downloaded the newest version, subread-1.3.3-p4 and used the quality filter option to see if this will make a difference. When I only used the reads with the highest mapping qualities featureCounts reported exactly the same counts as HTseq-count does. (see attached figure). HTseq-count hower never reported more counts than featureCounts in my case.

Iris
Attached Images
File Type: png featureCounts_vs_HTseq-count2.png (30.6 KB, 62 views)

Last edited by iris_aurelia; 05-31-2013 at 12:40 AM.
iris_aurelia is offline   Reply With Quote
Old 05-31-2013, 07:45 PM   #20
shi
Wei Shi
 
Location: Australia

Join Date: Feb 2010
Posts: 235
Default

Dear Iris,

Thanks for your clarification and providing the comparison results.

However, we could not reproduce what you have observed. Attached is a toy example. In this example, a read was mapped to five different locations and all locations were reported in the SAM file. We found that featureCounts and HTseq-count gave the identical results when summarizing it.

Could you please provide the commands you used when running featureCounts and HTseq-count? It will also be very helpful if you can provide a runnable example dataset so that we can try to figure out what's going on.

Thanks,

Wei
Attached Files
File Type: zip data.zip (1.0 KB, 13 views)
shi is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 12:15 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO