Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • htseq-count on UTR

    Hi
    Does anybody know if htseq-count works on UTR? I tried it, only got gene list, there is no count in the output.

  • #2
    Read the instructions for htseq-count:



    -t <feature type>, --type=<feature type>

    feature type (3rd column in GFF file) to be used, all features of other type are ignored (default, suitable for RNA-Seq and Ensembl GTF files: exon)


    By using the -t, --type= argument you can tell what features you want htseq-count to use. The default is exon, so you cant use the defaults. The question then is do you want reads mapping only to the UTR or to the entire transcript, including the UTR? That will determine what feature type you give it.

    So say you want reads mapping only to UTRs. Whatever your UTRs are named in your GFF file, use that with the -t or --type= arguments when you run htseq-count.

    Comment


    • #3
      Hi chadn737,

      I used -t feature in the command line and set feature to utr, please see the command i used:
      htseq-count -s no -t utr -m intersection-nonempty file.sam utr.gff > urtcount.txt

      the output looks like:

      "06100073560ik"
      "06100071803ik"
      "0610008608Rik"
      "0610008008Rik"
      "0610008007Rik"
      "0610008007Rik"
      "0610009760Rik"
      "06100094371ik"
      "06100094166ik"
      "06100094858ik"
      "0610009020Rik"
      "06100091744ik"

      Only gene_id is there and no count. BTW, we used NCBI reference and modified it according to ensembl gff file format.

      THanks,

      Comment


      • #4
        Look in your gff file to find out specifically how the UTRs are labeled. I'm not familiar with your specific species/gff file. For instance in the TAIR10 GFF they are named as three_prime_UTR or five_prime_UTR. Assuming your gff file even has the UTRs listed? You may need to modify the names using sed or alternatively count reads for 5' UTRs and 3' UTRs seperately.

        Comment


        • #5
          Hi Chand737,

          The GFF used had been modified according to ensembl format as following

          1 bed2gff utr5 8794025 8794051 . - . transcript_id "NM_27671"; utr5_number "0016"; gene_id "NM_27671"
          1 bed2gff utr5 8872242 8872295 . - . transcript_id "NM_27671"; utr5_number "0017"; gene_id "NM_27671"
          1 bed2gff utr5 8989267 8989330 . - . transcript_id "NM_27671"; utr5_number "0018"; gene_id "NM_27671"
          1 bed2gff utr5 9193260 9193341 . - . transcript_id "NM_27671"; utr5_number "0019"; gene_id "NM_27671"
          1 bed2gff utr5 9288214 9289811 . - . transcript_id "NM_27671"; utr5_number "0020"; gene_id "NM_27671"

          total 9 columns. each columns separated by tab. Starting from transcript_id to gene_id is one column (the last column).
          I am using mouse genome. 5'UTR

          Thanks,
          Li
          Last edited by emilyjia2000; 04-06-2012, 06:51 AM.

          Comment


          • #6
            Originally posted by emilyjia2000 View Post
            Hi Chand737,

            The GFF used had been modified according to ensembl format as following

            1 bed2gff utr 8794025 8794051 . - . transcript_id "NM_27671"; utr_number "0016"; gene_id "NM_27671"
            1 bed2gff utr 8872242 8872295 . - . transcript_id "NM_27671"; utr_number "0017"; gene_id "NM_27671"
            1 bed2gff utr 8989267 8989330 . - . transcript_id "NM_27671"; utr_number "0018"; gene_id "NM_27671"
            1 bed2gff utr 9193260 9193341 . - . transcript_id "NM_27671"; utr_number "0019"; gene_id "NM_27671"
            1 bed2gff utr 9288214 9289811 . - . transcript_id "NM_27671"; utr_number "0020"; gene_id "NM_27671"

            I am using mouse genome.

            Thanks,
            Li
            Why does your output from your last post:

            "06100073560ik"
            "06100071803ik"
            "0610008608Rik"
            "0610008008Rik"
            "0610008007Rik"
            "0610008007Rik"
            "0610009760Rik"
            "06100094371ik"
            "06100094166ik"
            "06100094858ik"
            "0610009020Rik"
            "06100091744ik"

            Not match the gene_ids of your gff file?

            That suggests to me that the problem has to do with naming of something in one of your files.
            Last edited by chadn737; 04-06-2012, 11:10 AM.

            Comment


            • #7
              Sorry I have two annotation files, one is the posted, another is the one with geneid as listed in the output. two files have the same format.

              Sorry I didn't make it clear.

              Comment


              • #8
                Originally posted by emilyjia2000 View Post
                Hi Chand737,

                The GFF used had been modified according to ensembl format as following

                1 bed2gff utr5 8794025 8794051 . - . transcript_id "NM_27671"; utr5_number "0016"; gene_id "NM_27671"
                1 bed2gff utr5 8872242 8872295 . - . transcript_id "NM_27671"; utr5_number "0017"; gene_id "NM_27671"
                1 bed2gff utr5 8989267 8989330 . - . transcript_id "NM_27671"; utr5_number "0018"; gene_id "NM_27671"
                1 bed2gff utr5 9193260 9193341 . - . transcript_id "NM_27671"; utr5_number "0019"; gene_id "NM_27671"
                1 bed2gff utr5 9288214 9289811 . - . transcript_id "NM_27671"; utr5_number "0020"; gene_id "NM_27671"

                total 9 columns. each columns separated by tab. Starting from transcript_id to gene_id is one column (the last column).
                I am using mouse genome. 5'UTR

                Thanks,
                Li
                It looks like you edited this post while or after I replied. In this gff file your feature type is listed as "utr5" not utr.

                So try using -t utr5 instead of -t utr or alternatively --type=utr5.

                Comment


                • #9
                  Something else to check. At the very end of your output from htseq-count there should be the following categories:

                  no_feature
                  ambiguous
                  too_low_aQual
                  not_aligned
                  alignment_not_unique

                  How many reads fall into these categories? Assuming that everything in your sam and gff files is right, and the only problem is the feature type you give to htseq-count, then you should see read counts in these categories. Particularly since the majority of your reads are unlikely to be coming from UTR regions.

                  If these categories are empty, then something else is wrong.

                  Comment


                  • #10
                    Thanks Chand737, I tried both features: -t utr or -t utr5, both of them delivered the same results.

                    As you pointed out, I checked the category as follows:

                    no_feature 122533035
                    ambiguous 1417743
                    too_low_aQual 0
                    not_aligned 0
                    alignment_not_unique 20439904

                    Could you please let me know what's wrong with it?

                    Thanks

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Current Approaches to Protein Sequencing
                      by seqadmin


                      Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                      04-04-2024, 04:25 PM
                    • seqadmin
                      Strategies for Sequencing Challenging Samples
                      by seqadmin


                      Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                      03-22-2024, 06:39 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, 04-11-2024, 12:08 PM
                    0 responses
                    27 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-10-2024, 10:19 PM
                    0 responses
                    30 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-10-2024, 09:21 AM
                    0 responses
                    26 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 04-04-2024, 09:00 AM
                    0 responses
                    52 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X