Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Is my parameter choice for HTSeq-count is right?

    I am trying to count single-end reads.

    Based on amel_OGSv3.2.gene.gff
    (Group1.1 amel_OGSv3.1 gene 507599 515039 1 - . ID=GB42155),
    I written the following command:

    python -m HTSeq.scripts.count -s no -t gene -i ID CK_2.sam /media/wenfu/LaCie/my_rnaseq_dat/amel_OGSv3.2.gene.gff > CK_2.gff

    Is there any one who can check if the command was written correctly, because I don't know whether “-s no -t gene -i ID” are right.

    Thanks a lot!!

    Richard

  • #2
    Well, I usually call HTSeq-counts like this:

    htseq-counts <flags/parameters> <input.sam> <input.gtf>

    Try:

    htseq-counts --help

    This will tell you about the flags you are interested in:
    -s yes/no is for stranded data. Is your data stranded? Ask the person who did library prep.
    -t is featureType, so usually 'exon' if you are doing RNAseq.
    -i is attribute, usually 'gene_id', but you could use 'transcript_id' I suppose, or 'gene_name'

    Hope that is some help to you?
    Last edited by bruce01; 11-06-2013, 08:45 AM. Reason: Mistyping

    Comment


    • #3
      bruce01,
      Thanks for your answer!
      I chose "-s no", because they are not stranded reads.
      As to -i and -ID, do you have any idea how to write the command according my gff file?

      Comment


      • #4
        Can you post the output from the command:

        head amel_OGSv3.2.gene.gff

        Comment


        • #5
          bruce01,
          In fact, amel_OGSv3.2.gene.gff is trimed gff file from the original one, amel_OGSv3.2.gff. I just grep the rows with gene. It seams that I need grepping the rows with exon, right?
          Here is output of the original one for one gene:

          Group1.1 amel_OGSv3.1 gene 507599 515039 1 - . ID=GB42155
          Group1.1 amel_OGSv3.1 mRNA 507599 515039 1 - . ID=GB42155-RA;Parent=GB42155
          Group1.1 amel_OGSv3.1 CDS 514009 514378 . - 0 Parent=GB42155-RA
          Group1.1 amel_OGSv3.1 CDS 513392 513906 . - 2 Parent=GB42155-RA
          Group1.1 amel_OGSv3.1 five_prime_UTR 514761 515039 1 - . Parent=GB42155-RA
          Group1.1 amel_OGSv3.1 five_prime_UTR 514379 514408 1 - . Parent=GB42155-RA
          Group1.1 amel_OGSv3.1 three_prime_UTR 512910 513391 1 - . Parent=GB42155-RA
          Group1.1 amel_OGSv3.1 three_prime_UTR 507599 509541 1 - . Parent=GB42155-RA
          Group1.1 amel_OGSv3.1 exon 507599 509541 1 - . Parent=GB42155-RA
          Group1.1 amel_OGSv3.1 exon 512910 513906 1 - . Parent=GB42155-RA
          Group1.1 amel_OGSv3.1 exon 514009 514408 1 - . Parent=GB42155-RA
          Group1.1 amel_OGSv3.1 exon 514761 515039 1 - . Parent=GB42155-RA
          Group1.1 amel_OGSv3.1 gene 940410 942604 1 + . ID=GB42182

          Thanks!!

          Comment


          • #6
            I don't use GFFs so not too sure about -i, but I think you should be using:

            -t exon -i ID

            This will mean that the the reads must overlap exons to be counted, but will be counted on a 'per ID' basis, so counts you get out in a table will be for the ID name. If you did:

            -t gene -i ID

            you would get any reads overlapping with any of the gene, including introns, which is probably not what you want(?)

            Give it a go, any errors or issues post them up.

            Comment


            • #7
              Do You mean the original gff file can be used dirrectedly for the counting?

              Comment


              • #8
                Yes, I think the -t flag is used to parse exon, or gene, or whatever you put in there, from the GFF so you don't really need to grep exon or gene.

                Comment


                • #9
                  Hi bruce01,
                  Using -t exon -i ID, I got the following information:

                  Error occured when processing GFF file (line 9 of file /media/wenfu/LaCie/my_rnaseq_dat/amel_OGSv3.2.gff3):
                  Feature GB42155-RA does not contain a 'ID' attribute
                  [Exception type: SystemExit, raised in count.py:55]

                  As you can see, it contains 'Parent' atribute.

                  Comment


                  • #10
                    Yes, was wondering if you should use -i ID or -i Parent. Try -i Parent now
                    Last edited by bruce01; 11-06-2013, 09:39 AM. Reason: Mistyping again!

                    Comment


                    • #11
                      I used -i Parent, it is running now.

                      Comment


                      • #12
                        Originally posted by wmseq View Post
                        I used -i Parent, it is running now.
                        This is going to result in an improper analysis. Your purpose is to count the number of reads uniquely aligned to each gene, but the "Parent" name identifies the transcript, not the gene. A gene with multiple isoforms will have multiple transcripts associated with it. This means that a single read will counted multiple times (once for each transcript which shares the exon which the read is aligned to). Reads which align to multiple entities are not counted by HT-Seq. As it stands the GFF file you have can not be used to properly count reads per GENE. To do that you need a GFF (or GTF) with a gene name (e.g. gene_id in ENSEMBL GTF files) associated with each exon feature.

                        Comment


                        • #13
                          @kmcarr How can you tell that 'Parent' is a transcript and not a gene? The "-RA"? It will only result in an improper analysis if multiple transcripts exist, so could test easily for that.

                          @wmseq are you working on Apis mellifera? Any reason not to use the GTF here?

                          Comment


                          • #14
                            Originally posted by bruce01 View Post
                            @kmcarr How can you tell that 'Parent' is a transcript and not a gene? The "-RA"?
                            Look at the second line of the GFF fragment wmseq posted:
                            Code:
                            Group1.1	amel_OGSv3.1	mRNA	507599	515039	1	-	.	ID=GB42155-RA;Parent=GB42155
                            and you can see the the ID GB42155-RA is assigned to an mRNA (transcript) not to a gene. The gene (Parent) of GB42155-RA has the ID GB42155.
                            It will only result in an improper analysis if multiple transcripts exist, so could test easily for that.
                            A majority of genes in complex eukaryotes have multiple isoforms so it is pretty much guaranteed that there will be a problem using the transcript ID with HTSeq-count.
                            @wmseq are you working on Apis mellifera? Any reason not to use the GTF here?
                            That annotation file is specific to the version 4.0 assembly of the A. mellifera genome. wmseq appears to be working with the v3.2 assembly. If (his/her) reads are really aligned to the 3.2 build then trying to do gene counts with a v4.0 annotation file will fail.

                            Comment


                            • #15
                              @kmcarr, thanks for the explanation, saw that line after I posted, not as familiar with GFF. Infact I distinctly dislike and avoid them! Just not intuitive to me.

                              A majority of genes in complex eukaryotes have multiple isoforms so it is pretty much guaranteed...
                              Yes, so one might test for multiple transcripts, instead of assume they are present. I was unsure of level of annotation and couldn't find the assembly for that GFF, hence looking at that v4.0 on Ensembl. Was a bit brief asking about using that, I did mean use the whole assembly, so realign etc.

                              An option otherwise is to remove the overlapping transcripts (if any) from GFF, thereby using a single transcript ID as a proxy gene ID. I have a script to do that for GTF, if you want PM me location of GFF for download and I can rewrite for GFF.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Essential Discoveries and Tools in Epitranscriptomics
                                by seqadmin




                                The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                                04-22-2024, 07:01 AM
                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Today, 08:47 AM
                              0 responses
                              12 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              60 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              59 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              54 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X