Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • featureCounts: a universal read summarization program

    Dear All,

    I would like to formally introduce to you our featureCounts program, a software program we developed for summarizing the next-gen sequencing reads to genomic features such as genes, exons and promoters.

    featureCounts is a light-weight read counting program written entirely using the C programming language. It can be used to count both gDNA-seq and RNA-seq reads for genomic features. It has the following features:
    (1) It carries out precise and accurate read assignments by taking care of indels, junctions and fusions in the reads.
    (2) It takes less than 4 minutes to summarize 20 million pairs of reads to 26k RefSeq genes using one thread, and only uses 40MB of memory (you can run it on a Mac laptop).
    (3) It supports multi-threaded running, making it extremely fast for summarizing large datasets.
    (4) It supports GTF format annotation and SAM/BAM read data.
    (5) It supports strand-specific read summarization.
    (6) It can perform read summarization at both feature level (eg. exons) and meta-feature level (eg. genes).
    (7) It allows users to specify whether reads overlapping with more than one feature should be counted or not.
    (8) It gives users full control on the summarization of paired-end reads, including allowing them to check if both ends are mapped and/or if the paired-end distances satisfy the distance criteria.
    (9) It discriminates the features, which were overlapped by both ends from the same fragment, from those which were overlapped by only one end so as to get more fragments counted.
    (10) It allows users to specify whether chimeric fragments should be counted.

    For a quick start, have a look at our short tutorial - http://bioinf.wehi.edu.au/featureCounts/ . For more details, please refer to the users guide - http://bioinf.wehi.edu.au/featureCounts/usersguide.pdf (see Chapter 6).

    We also compared featureCounts with other methods. The comparison results can be found in our manuscript - http://arxiv.org/abs/1305.3347.

    The featureCounts program is part of the Subread package (http://subread.sourceforge.net), which includes a suite of programs for processing next-gen sequencing data such as read mapping and exon-exon junction detection. featureCounts can also be accessed from the development version of the Bioconductor R package Rsubread (http://bioconductor.org/packages/2.1.../Rsubread.html)

    Please do not hesitate to contact me if you have any questions ([email protected]).

    Best regards,
    -------------------
    Wei Shi, Ph.D
    Bioinformatics Division
    The Walter and Eliza Hall Institute of Medical Research
    1G Royal Parade, Parkville, Victoria 3052
    Australia
    Last edited by shi; 05-16-2013, 04:08 AM.

  • #2
    dear wei,

    using featurecount I
    get:

    Code:
    ./count_featurCount.sh: line 15: 29340 Segmentation fault      (core dumped) featureCounts -p -B -C -a $gtf -t exon -g gene_id -i $SAMdir/$name/RUM.sam -o $name.fcount

    I use:

    Code:
    featureCounts -p -B -C -a $gtf -t exon -g gene_id -i $SAMdir/$name/RUM.sam -o $name.fcount
    example sam file:
    Code:
    seq.1	77	*	0	0	*	=	0	0	CGGGGCGGGATCCCGCCGCCTCTCCGGCCCGCCGGNNNNCCGCCACCGGCCCACTNTNCNCNNCCCCCCCNCGCGG	############################################################################
    seq.1	141	*	0	0	*	=	0	0	NNGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNTNNNNNNNNNNNNNNNNNTNNCNGNNTCNNNNNNNNNN	############################################################################
    seq.2	73	chr2	89157089	25	76M	=	89157089	0	CCTCTCTGGGATAGAAGTTATTCAGCAGGCACACANNNNAGGCAGTTCCAGATTTNANCTGNNCATCAGANGGCGG	CCCCCCCDBCCCCBCAABBBCCBCCCCCCCBB@AA####/00000CCCCA##########################	XO:A:F	MD:Z:35ACAG16C1A3CT7T5	NM:i:9	IH:i:1	HI:i:1
    seq.2	133	chr2	89157089	25	*	=	89157089	0	NNTNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNTNNNNNNNNNNNNNNNNNCNCCNCNNTGCNNNNNNNNN	############################################################################	XO:A:F	IH:i:1	HI:i:1
    seq.3	77	*	0	0	*	=	0	0	CGGTTCAGCAGGAATGCCGAGATCGGAAGAGCGGTNNNGCAGGAATGCCGAGACCGCNTGTANTCTCGTATGCGGG	BB>4B?>?3>@>@642::<7A#######################################################
    seq.3	141	*	0	0	*	=	0	0	NNGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGNNNNNNNNNNNNNNNNNCNTGNCGNGCGNNNNNNNNN	############################################################################
    seq.4	77	*	0	0	*	=	0	0	CGGTTCACCAGGAATGCCGAGACCGGAAGAGCGGTTCNGCAGGAATGCCGAGACCGCTCGTAATCTCGCATACCGG	<4=.+;**1*45052@@@@@########################################################
    seq.4	141	*	0	0	*	=	0	0	CNGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNTNNNNNNNNNNNNNNNNNGNCGNGGNGAGNNNNNNNNN	############################################################################
    and example of the gtf-file (from gencode):
    ##description: evidence-based annotation of the human genome (GRCh37), version 15 (Ensembl 70)
    Code:
    ##provider: GENCODE
    ##contact: [email protected]
    ##format: gtf
    ##date: 2013-01-21
    chr1	HAVANA	gene	11869	14412	.	+	.	gene_id "ENSG00000223972.4"; transcript_id "ENSG00000223972.4"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1"; level 2; havana_gene "OTTHUMG00000000961.2";
    chr1	HAVANA	transcript	11869	14409	.	+	.	gene_id "ENSG00000223972.4"; transcript_id "ENST00000456328.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "DDX11L1-002"; level 2; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
    chr1	HAVANA	exon	11869	12227	.	+	.	gene_id "ENSG00000223972.4"; transcript_id "ENST00000456328.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "DDX11L1-002"; exon_number 1;  level 2; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
    chr1	HAVANA	exon	12613	12721	.	+	.	gene_id "ENSG00000223972.4"; transcript_id "ENST00000456328.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "DDX11L1-002"; exon_number 2;  level 2; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
    chr1	HAVANA	exon	13221	14409	.	+	.	gene_id "ENSG00000223972.4"; transcript_id "ENST00000456328.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "DDX11L1-002"; exon_number 3;  level 2; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
    chr1	ENSEMBL	transcript	11872	14412	.	+	.	gene_id "ENSG00000223972.4"; transcript_id "ENST00000515242.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1-201"; level 3; havana_gene "OTTHUMG00000000961.2";
    chr1	ENSEMBL	exon	11872	12227	.	+	.	gene_id "ENSG00000223972.4"; transcript_id "ENST00000515242.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "unprocessed_pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1-201"; exon_number 1;  level 3; havana_gene "OTTHUMG00000000961.2";
    do you have any clue what could be wrong?

    best wishes,

    dietmar

    Comment


    • #3
      Dear dietmar,

      Thanks for providing the detailed information about the problem you encountered. This helped us to reproduce the problem.

      The featureCounts program does not allow ## lines or comment lines to be included in the provided annotation file. The program crashed when such lines exist in the annotation file. Sorry about this. We are now working on this and will let you know once it is fixed.

      In the meantime, you can get around this problem by removing those ## lines from your annotation file. I have tested this and it worked. Hope it will work for you. But please let me know if it doesn't. I will get back to you soon.


      Best regards,
      Wei

      Comment


      • #4
        Hi dietmar,

        We have fixed this and uploaded the latest version (v1.3.3-p1) to sourceforge (http://subread.sourceforge.net).

        Hope this works for you.

        Best wishes,
        Wei

        Comment


        • #5
          dear wei,

          thank you.

          but now i get another error:

          Code:
          **********
          **********
            WARNING
          **********
          No meta-feature id is found on the 6-th line. If it is a GTF file, you may need to check the name of the gene_id field and specify a correct field name using a '-g' option.
          **********
          **********
          WARNING: the feature on the 31178-th line has zero coordinate or zero lengths
          ... many similar warnings ...
          WARNING: the feature on the 2577002-th line has zero coordinate or zero lengths
          There are 1201574 features loaded from the annotation file.
          The 1201574 features are sorted.
          Number of chromosomes included in the annotation is 25
          ./count_featurCount.sh: line 16: 24368 Segmentation fault      (core dumped) featureCounts -p -B -C -a $gtf -t exon -g gene_id -i $SAMdir/$name/RUM.sam -o $name.fcount

          the 6th line has definitely a gene_id (probably a bug?) but the lines with zero length features are true.

          the gtf-file:
          Code:
          ##description: evidence-based annotation of the human genome (GRCh37), version 15 (Ensembl 70)
          ##provider: GENCODE
          ##contact: [email protected]
          ##format: gtf
          ##date: 2013-01-21
          chr1	HAVANA	gene	11869	14412	.	+	.	gene_id "ENSG00000223972.4"; transcript_id "ENSG00000223972.4"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "pseudogene"; transcript_status "KNOWN"; transcript_name "DDX11L1"; level 2; havana_gene "OTTHUMG00000000961.2";
          chr1	HAVANA	transcript	11869	14409	.	+	.	gene_id "ENSG00000223972.4"; transcript_id "ENST00000456328.2"; gene_type "pseudogene"; gene_status "KNOWN"; gene_name "DDX11L1"; transcript_type "processed_transcript"; transcript_status "KNOWN"; transcript_name "DDX11L1-002"; level 2; tag "basic"; havana_gene "OTTHUMG00000000961.2"; havana_transcript "OTTHUMT00000362751.1";
          ...
          31178-th line:
          chr1	HAVANA	exon	19983359	19983359	.	+	.	gene_id "ENSG00000158747.9"; transcript_id "ENST00000439278.1"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "NBL1"; transcript_type "protein_coding"; transcript_status "NOVEL"; transcript_name "NBL1-007"; exon_number 4;  level 1; tag "mRNA_end_NF"; tag "cds_end_NF"; tag "exp_conf"; havana_gene "OTTHUMG00000002700.5"; havana_transcript "OTTHUMT00000313743.1";
          do you think the zero-length features are the problem, or is it possible that if the BAM files has additional chromosomes (like chr1_gl000191_random) which are not present in the gtf-annotation file your program has a problem. i think this should be catched, because this will happen often...

          best wishes,

          dietmar

          Comment


          • #6
            Dear dietmar,

            Although you got those warning messages, your annotation file was actually successfully processed according to the program output (sorry for the misleading warning message about gene id, we have fixed this and updated the program on sourceforge).

            Did you provide a SAM or BAM file to featureCounts? If you provided a BAM file, you need to add an '-b' option to your command. Otherwise it will cause a segmentation fault.

            We actually downloaded the ENCODE annotation file you used and tested it here. We found it worked with the version of featureCounts you are using for both SAM and BAM input.

            Could you please check if you correctly specified the type of your read file? I think this was probably the reason why you got a segmentation fault.

            Best regards,

            Wei

            Comment


            • #7
              I don't know if this an intentional limitation to the program, but here is an error I'm encountering.

              The reference genome I am aligning reads to is a draft version, with 430k contigs.

              The error I am getting when attempting to use featureCounts is:

              "There are 290343 features loaded from the annotation file.
              WARNING: THere are too many chromosomes in the annotations. The remainder of the annotation file is ignored.
              The 8586 features are sorted.
              Number of chromosomes inclued in the annotation is 499.

              WARNING: There are too many reference sequences in the BAM file!
              [repeated many times]
              Segmentation fault"

              Comment


              • #8
                Dear @NateP,

                featureCounts had hard-coded limits on the numbers of chromosome allowed to be included in the annotation file and in the BAM/SAM file. These limits were far less than the number of contigs you have here and that was the reason why featureCounts crashed.

                We have now removed these limits and allowed any number of chromosomes (or contigs) to be included in the annotation file and in the SAM/BAM file. Please check out the latest version of the Subread package (1.3.3-p2) on sourceforge. The featureCounts program included in this version should work for your data now.

                Please let me know if the problem persists.

                Best wishes,

                Wei

                Comment


                • #9
                  Wei,

                  I figured it was something along those lines.
                  The newest version works with my data sets, thanks for the great program!

                  Comment


                  • #10
                    Thanks for letting me know, NateP.

                    Cheers,
                    Wei

                    Comment


                    • #11
                      Dear Wei,
                      I was just wondering if you could give advise on a featureCounts job that I tried to perform, I have RNA-seq data aligned to bovine genome (UMD3.1.71 from ENSEMBL) and I used subread-1.3.3-p2 (which is the latest version I think).
                      The error I get is:
                      Code:
                      No meta-feature id is found on the 1-th line. If it is a GTF file, you may need to check the name of the gene_id field and specify a correct field name using a '-g' opt
                      ion.
                      My code is:
                      Code:
                      featureCounts -a /path/annotation.gtf -t exon -g gene_id -b -i /path/file.bam -o file.out -s -R -p -B
                      And if I understand correctly the outputs I obtained, the reads assignment was still performed against my feature (exon in this case, represented as LINE_*), however I thought the output should have been my counts per gene_id:
                      Code:
                      geneid  length  nreads
                      LINE_0000001    88      0
                      LINE_0000002    167     0
                      LINE_0000003    51      0
                      LINE_0000004    337     0
                      LINE_0000005    1399    0
                      LINE_0000006    107     0
                      LINE_0000007    110     0
                      LINE_0000008    110     0
                      LINE_0000009    133     0
                      LINE_0000010    169     0
                      LINE_0000011    204     0
                      LINE_0000012    211     0
                      LINE_0000013    229     0
                      LINE_0000014    45      0
                      LINE_0000015    98      0
                      LINE_0000016    183     0
                      LINE_0000017    139     0
                      LINE_0000018    1429    0
                      LINE_0000019    218     0
                      LINE_0000020    91      0
                      LINE_0000021    202     0
                      LINE_0000022    142     2
                      LINE_0000023    157     6
                      My gtf file looks like this:
                      Code:
                      4       protein_coding  exon    432160  432247  .       -       .        gene_id "ENSBTAG00000003625"; transcript_id "ENSBTAT00000004727"; exon_number "1"; gene_name "BT.66099"; gene_biotype "protein_coding"; transcript_name "BT.66099-201"; exon_id "ENSBTAE00000393952";
                      4       protein_coding  CDS     432160  432235  .       -       0        gene_id "ENSBTAG00000003625"; transcript_id "ENSBTAT00000004727"; exon_number "1"; gene_name "BT.66099"; gene_biotype "protein_coding"; transcript_name "BT.66099-201"; protein_id "ENSBTAP00000004727";
                      4       protein_coding  start_codon     432233  432235  .       -       0        gene_id "ENSBTAG00000003625"; transcript_id "ENSBTAT00000004727"; exon_number "1"; gene_name "BT.66099"; gene_biotype "protein_coding"; transcript_name "BT.66099-201";
                      4       protein_coding  exon    430223  430389  .       -       .        gene_id "ENSBTAG00000003625"; transcript_id "ENSBTAT00000004727"; exon_number "2"; gene_name "BT.66099"; gene_biotype "protein_coding"; transcript_name "BT.66099-201"; exon_id "ENSBTAE00000037981";
                      4       protein_coding  CDS     430223  430389  .       -       2        gene_id "ENSBTAG00000003625"; transcript_id "ENSBTAT00000004727"; exon_number "2"; gene_name "BT.66099"; gene_biotype "protein_coding"; transcript_name "BT.66099-201"; protein_id "ENSBTAP00000004727";
                      4       protein_coding  exon    429103  429153  .       -       .        gene_id "ENSBTAG00000003625"; transcript_id "ENSBTAT00000004727"; exon_number "3"; gene_name "BT.66099"; gene_biotype "protein_coding"; transcript_name "BT.66099-201"; exon_id "ENSBTAE00000369947";
                      4       protein_coding  CDS     429103  429153  .       -       0        gene_id "ENSBTAG00000003625"; transcript_id "ENSBTAT00000004727"; exon_number "3"; gene_name "BT.66099"; gene_biotype "protein_coding"; transcript_name "BT.66099-201"; protein_id "ENSBTAP00000004727";
                      4       protein_coding  exon    426030  426366  .       -       .        gene_id "ENSBTAG00000003625"; transcript_id "ENSBTAT00000004727"; exon_number "4"; gene_name "BT.66099"; gene_biotype "protein_coding"; transcript_name "BT.66099-201"; exon_id "ENSBTAE00000037983";
                      Any idea on how I can sort this out?
                      Thanks a lot for your help.
                      Regards,
                      Nicolas

                      Comment


                      • #12
                        I tried subread-align and featureCount as well and encountered the same 'problem'. I was hoping that it would give me a gene identifier in the output file instead of line numbers. I've also used the -t and the -g option.

                        Comment


                        • #13
                          Dear Nicolas and iris_aurelia,

                          Please replace -g gene-id with -g ' gene_id' in your command (note that there is a space right before gene_id in the new argument) and try it again.

                          I suspect the problem you encountered was because there is a space right before the gene_id attribute in the 9th column of your annotation file. I have seen this for the yeast GTF annotation file generated by Ensembl. This is not compliant with the GTF-format specification which requires all the columns to be tab-delimited.

                          However, the human GTF annotation generated by Ensembl is fine. So it seems this problem only occurred for some species. I'm going to contact Ensembl to let them know this issue.

                          Please let me know if the new argument does not work.

                          Best wishes,

                          Wei

                          Comment


                          • #14
                            Hi Wei,

                            Thanks, this solution works! However, I was using the Human reference (version 66), which had this problem as well. Maybe newer versions of the GTF file do not have this problem..?

                            Another thing I bumped into was the lack of reading gzipped fastQ files. Is it possible to either read from standard in or to create an option to use gzipped files?

                            Iris
                            Last edited by iris_aurelia; 05-21-2013, 04:09 AM.

                            Comment


                            • #15
                              Hi Wei,

                              Thanks a lot for all your help, indeed it works perfectly now and so fast.
                              This is great.
                              Thanks again,
                              Nicolas

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:37 PM
                              0 responses
                              10 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 06:07 PM
                              0 responses
                              9 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              50 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              67 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X