Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • featureCounts error

    Hi,

    I have a problem with featureCounts gtf file.

    I want to count the miRNA reads using the sorted .bam file containing only mapped reads (all generated by samtools from Bowtie output --SAM file).
    For mapping I used the H. sapiens, NCBI v37 indexes, downloaded from bowtie homepage


    To get the gtf file for miRNA I used:
    Code:
    $ wget ftp://ftp.ensembl.org/pub/release-74/gtf/homo_sapiens/Homo_sapiens.GRCh37.74.gtf.gz
    
    $ gunzip Homo_sapiens.GRCh37.74.gtf.gz
    
    $ cat Homo_sapiens.GRCh37.74.gtf.gz | grep "miRNA" > Homo_sapiens_miRNA.gtf
    the generate file looks like that
    Code:
    $ head Homo_sapiens_miRNA.gtf  
    1	miRNA	exon	30366	30503	.	+	.	gene_id "ENSG00000243485"; transcript_id "ENST00000607096"; exon_number "1"; gene_name "MIR1302-11"; gene_biotype "lincRNA"; transcript_name "MIR1302-11-201"; exon_id "ENSE00003695741";
    1	miRNA	exon	1102484	1102578	.	+	.	gene_id "ENSG00000207730"; transcript_id "ENST00000384997"; exon_number "1"; gene_name "MIR200B"; gene_biotype "miRNA"; transcript_name "MIR200B-201"; exon_id "ENSE00001500004";
    1	miRNA	exon	1103243	1103332	.	+	.	gene_id "ENSG00000207607"; transcript_id "ENST00000384875"; exon_number "1"; gene_name "MIR200A"; gene_biotype "miRNA"; transcript_name "MIR200A-201"; exon_id "ENSE00001499882";
    1	miRNA	exon	1104385	1104467	.	+	.	gene_id "ENSG00000198976"; transcript_id "ENST00000362106"; exon_number "1"; gene_name "MIR429"; gene_biotype "miRNA"; transcript_name "MIR429-201"; exon_id "ENSE00001436869";
    1	miRNA	exon	3477259	3477354	.	-	.	gene_id "ENSG00000207776"; transcript_id "ENST00000385042"; exon_number "1"; gene_name "MIR551A"; gene_biotype "miRNA"; transcript_name "MIR551A-201"; exon_id "ENSE00001500049";
    1	miRNA	exon	3800628	3800697	.	+	.	gene_id "ENSG00000264428"; transcript_id "ENST00000579705"; exon_number "1"; gene_name "AL691523.1"; gene_biotype "miRNA"; transcript_name "AL691523.1-201"; exon_id "ENSE00002727084";
    in featureCounts (Subread package) I use:
    Code:
    featureCounts -a Homo_sapiens_miRNA.gtf -F GTF -t transcript_name -M -o sample_1_counts.txt input_file sample_1_mapped.bam
    ...it reads in the .bam file, but does not recognize the gtf of some reason.
    Error message:
    Code:
    Load annotation file Homo_sapiens_miRNA.gtf ...                            ||
    ||    Features : 0                                                            ||
    || WARNING no features were loaded in format GTF.                             ||
    ||         annotation format can be specified using '-F'.                     ||
    Failed to open the annotation file Homo_sapiens_miRNA.gtf, or its format is incorrect, or it contains no 'transcript_name' features.

    so, questions
    1) is my featureCounts code OK to find the miRNAs? In the .gtf file there are some transcripts, which are with a gene_biotype "miRNA", but he transcript_name is stated based on the gene name, rather than "MIR" (see above the last row of gtf file). Can someone explain what these are?

    2) why doesn't featureCount recognize the gtf? Any suggestions to solve the problem?

    Thanks in advance for all the repliers

  • #2
    hi heso,
    `Your script to run featureC has problem

    1) -t is the feature type. This is 3rd col of GTF. Its normally "exon" and it is so in the snapshot of your GTF too. So, specify -t exon and it should be fine.

    2) The -g parameter specifies the attribute used to group features ('exon' here). Since its miRNA so either gene or transcript would mean the same I think.
    So, you can specify -g gene_id
    OR, -g transcript_id

    Whatever you do in 2nd, the result file will have the primary identifier as that. Again the value for -g is in the 9th col of GTF

    Comment


    • #3
      Thank you for the answere.
      I now tried the following command:

      $ featureCounts -a Homo_sapiens_miRNA.gtf -F GTF -t exon -M -g gene_id -o Rd4_UCexo_r1_fcounts.txt input_files Rd4_UCexo_r1_seqok_n0l15abest_hg37_onlymapped.bam
      of some reason I still get 0% successfully assigned reads:
      //================================= Running ==================================\\
      || ||
      || Load annotation file Homo_sapiens_miRNA.gtf ... ||
      || Features : 3424 ||
      || Meta-features : 3411 ||
      || Chromosomes : 116 ||
      || ||
      || Process Unknown file input_files... ||
      || Single-end reads are included. ||
      || Failed to open file input_files ||
      || No counts were generated for this file. ||
      || ||
      || Process BAM file Rd4_UCexo_r1_seqok_n0l15abest_hg37_onlymapped.bam... ||
      || Single-end reads are included. ||
      || Assign reads to features... ||
      || Total reads : 92004855 ||
      || Successfully assigned reads : 0 (0.0%) ||
      || Running time : 3.18 minutes ||
      || ||
      || Read assignment finished. ||
      || ||
      \\===================== http://subread.sourceforge.net/ ======================//
      The output file looks like that:

      Geneid Chr Start End Strand Length Sample_1_hg37_onlymapped.bam
      ENSG00000243485 1 30366 30503 + 138
      ENSG00000207730 1 1102484 1102578 + 95
      ENSG00000207607 1 1103243 1103332 + 90
      ENSG00000198976 1 1104385 1104467 + 83
      ENSG00000207776 1 3477259 3477354 - 96
      ENSG00000264428 1 3800628 3800697 + 70
      ENSG00000264341 1 5624131 5624203 + 73
      ENSG00000216045 1 5875931 5876019 - 89
      ENSG00000264101 1 5922732 5922801 - 70
      ENSG00000266687 1 5953899 5953984 - 86
      ENSG00000265392 1 6489894 6489956 - 63
      ENSG00000207865 1 9211727 9211836 - 110
      ENSG00000252841 1 9338450 9338592 + 143
      What could be the problem?

      Comment


      • #4
        hi,
        In the cmdline, what is the input_files term doing there?
        featureCounts -a Homo_sapiens_miRNA.gtf -F GTF -t exon -M -g gene_id -o Rd4_UCexo_r1_fcounts.txt input_files Rd4_UCexo_r1_seqok_n0l15abest_hg37_onlymapped.bam

        Just remove that term. See in the diagnostic log file, featureC says
        || Process Unknown file input_files... ||

        Its not reading your BAM file. The correct cmd -

        featureCounts \
        -a Homo_sapiens_miRNA.gtf \
        -F GTF \
        -t exon \
        -M \
        -g gene_id \
        -o Rd4_UCexo_r1_fcounts.txt \
        Rd4_UCexo_r1_seqok_n0l15abest_hg37_onlymapped.bam

        (i have just formatted for easier readability)

        Comment


        • #5
          hi again,

          I tried You suggestion. Still no help


          Code:
          //================================= Running ==================================\\
          ||                                                                            ||
          || Load annotation file Homo_sapiens_miRNA.gtf ...                            ||
          ||    Features : 3424                                                         ||
          ||    Meta-features : 3411                                                    ||
          ||    Chromosomes : 116                                                       ||
          ||                                                                            ||
          || Process BAM file Rd4_UCexo_r1_seqok_n0l15abest_hg37_onlymapped.bam...      ||
          ||    Single-end reads are included.                                          ||
          ||    Assign reads to features...                                             ||
          ||    Total reads : 92004855                                                  ||
          ||    Successfully assigned reads : 0 (0.0%)                                  ||
          ||    Running time : 3.46 minutes                                             ||
          ||                                                                            ||
          ||                         Read assignment finished.                          ||
          ||                                                                            ||
          \\===================== [url]http://subread.sourceforge.net/[/url] ======================//
          And the output file:

          Code:
          # Program:featureCounts v1.4.6; Command:"featureCounts" "-a" "Homo_sapiens_miRNA.gtf" "-$
          Geneid  Chr     Start   End     Strand  Length  Rd4_UCexo_r1_seqok_n0l15abest_hg37_onlym$
          ENSG00000243485 1       30366   30503   +       138     0
          ENSG00000207730 1       1102484 1102578 +       95      0
          ENSG00000207607 1       1103243 1103332 +       90      0
          ENSG00000198976 1       1104385 1104467 +       83      0
          ENSG00000207776 1       3477259 3477354 -       96      0
          ENSG00000264428 1       3800628 3800697 +       70      0
          ENSG00000264341 1       5624131 5624203 +       73      0
          ENSG00000216045 1       5875931 5876019 -       89      0
          ENSG00000264101 1       5922732 5922801 -       70      0
          ENSG00000266687 1       5953899 5953984 -       86      0
          ENSG00000265392 1       6489894 6489956 -       63      0
          ENSG00000207865 1       9211727 9211836 -       110     0
          ENSG00000252841 1       9338450 9338592 +       143     0

          But generally, does the output file have to include only the gene_id's where it has found a 'hit' (read count) or all the gene_id's of the gtf file (even if no reads are found for the specific gene_id)?

          Comment


          • #6
            All the gene ids are included in the output no matter they got a hit or not.

            featureCounts also outputs a .summary file which summarizes the read counting results and lists the numbers of reads that were not assigned due to various reasons. Can you provide the content of that file please?

            You may also have a look at the Subread users guide which includes a section for mapping and counting miRNA reads using Subread and featureCounts (section 5.5):

            http://bioinf.wehi.edu.au/subread-pa...UsersGuide.pdf

            The miRBase database (http://www.mirbase.org/) contains miRNA gene annotations which can be used for read summarization as well.

            Comment


            • #7
              The .summary file contents is the following:
              Code:
              Status  Rd4_UCexo_r1_seqok_n0l15abest_hg37_onlymapped.bam
              Assigned        0
              Unassigned_Ambiguity    0
              Unassigned_MultiMapping 0
              Unassigned_NoFeatures   92004855
              Unassigned_Unmapped     0
              Unassigned_MappingQuality	0
              Unassigned_FragementLength	0
              Unassigned_Chimera	0
              Unassigned_Secondary    0
              Unassigned_Nonjunction  0
              Unassigned_Duplicate    0
              I have also tried to map the reads against the mature and hairpin miRNA sequences, which I got from miRbase homepage. (I built the bowtie indexes myself after grep'ing Human sequences, changing U's to T's and adding also 25G's to the end of the mature sequences to use them for bowtie)
              In this case for the same sample (which I above have used to map against the genome) I get 10% of the reads mapped to mature sequences and 6% mapped to the hairpin sequences. Therefore, I don't expect loads of miRNA's to pop up in featureCounts.
              However, I should get some hits! Now I get 0!!!!

              In addition, can someone explain me why I get a higher % of hits when mapping against hairpin than mature sequences??

              Comment


              • #8
                Post the first few alignments, perhaps you just have mismatching chromosome names (that's a common cause of this).

                Regarding mapping against the hairpin vs. mature sequence, those results sound unsurprising. The hairpin contains the mature sequence plus extra surrounding sequence. We know that the annotated mature sequences don't fully capture the heterogeneity of real miRNA bounds, but those bounds should generally be contained within the hairpin sequence. So, the hairpin will often yield higher mapping rates.

                Comment


                • #9
                  Yes, that's what it seemed to be.
                  I downloaded ebwt as well as gtf files from the UCSC site and now everything worked

                  Status /.............................................bam
                  Assigned 164848
                  Unassigned_Ambiguity 24960
                  Unassigned_MultiMapping 0
                  Unassigned_NoFeatures 91815047
                  Unassigned_Unmapped 0
                  Unassigned_MappingQuality 0
                  Unassigned_FragementLength 0
                  Unassigned_Chimera 0
                  Unassigned_Secondary 0
                  Unassigned_Nonjunction 0
                  Unassigned_Duplicate 0
                  Thanks again to everybody for helping out !

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    Strategies for Sequencing Challenging Samples
                    by seqadmin


                    Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                    03-22-2024, 06:39 AM
                  • seqadmin
                    Techniques and Challenges in Conservation Genomics
                    by seqadmin



                    The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                    Avian Conservation
                    Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                    03-08-2024, 10:41 AM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, Yesterday, 06:37 PM
                  0 responses
                  10 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, Yesterday, 06:07 PM
                  0 responses
                  9 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 03-22-2024, 10:03 AM
                  0 responses
                  49 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 03-21-2024, 07:32 AM
                  0 responses
                  67 views
                  0 likes
                  Last Post seqadmin  
                  Working...
                  X