Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • HTSeq exon counts to gene level

    Hi,

    I'm working with drosophila melanogaster RNA-Seq data and at least one of my genes of interest is overlapping an other gene so I used HTSeq with exon counts but now I don't know how to combine the individual exon counts to gene level (so all the exon counts of a gene are counted for that gene for further analysis). Is it possible (more easily than with manually counting them in an excel file or something like that)? I have 18 samples and overs 50000 exons so there should be an easier way. Can the combining be done in R or Unix environment?

    If the HTSeq exon counts can't be combined, what should be used instead of HTSeq when the exon has to be the starting point to show the expression of overlapping genes?

  • #2
    Why did you count exons if you want gene-level metrics. Htseq-count can deal with overlapping genes without issue.

    Comment


    • #3
      I'm actually just a summer worker now and my supervisor did the gene level counts but the most important gene didn't show any expression in the data. It is located in a promoter area of an other gene. With exon counts it shows the expected expression and some collaborator used the exon data but I can't get his code before he comes back from his holiday and I'm stuck.

      Comment


      • #4
        It sounds like your supervisor did it wrong. In any case, you might just want to redo the counting, either with htseq-count or with featureCounts (which is much faster).

        Comment


        • #5
          She used this code for the gene counts:
          python -m HTSeq.scripts.count -m union -s no -a 10 -t gene -i ID $name.n-sorted.q10.sam $ANNOT/$gf_gene > $name.n-sorted.q10.$out_gene

          The ANNOT is a directory path and $name is used to run the whole list with additional running code. What should I change to get the gene counts right?

          Comment


          • #6
            Without seeing the annotation file, my guess is that whatever gene you're getting 0 counts for overlaps slightly this other gene, so using "-m union" is losing a lot of counts. So, try "-m intersection_nonempty" instead.

            If one gene completely overlaps the other, then without a stranded dataset there's no way to say from which of the overlapping genes a given read originates, thus saying that it has 0 counts would be appropriate. Granted, you might be able to guess from the coverage profile that most/all of them actually come from one or the other gene, but then you're using an expectation maximization method rather than direct counting. Alternatively, just remove the overlapping portion of one of the genes and then mention that whenever the data gets published. You'd likely get hammered by the reviewers, but that would be completely appropriate.

            Comment


            • #7
              The overlap is complete, that really is the problem. Checking from the manual (before even running the exon counts) all three options for -m just say ambiguous. That's why I came here (after intense googling).

              Edit.
              As I know a little of programming I would assume that the exons could be combined with some script. My skills just aren't good enough yet, so any help with just transforming a gff file from:
              14-3-3epsilon:1 0
              14-3-3epsilon:2 0
              14-3-3epsilon:3 3527
              14-3-3epsilon:4 1343
              14-3-3epsilon:5 1
              14-3-3epsilon:6 0
              14-3-3epsilon:7 57
              14-3-3epsilon:8 0

              to:
              14-3-3epsilon 4928

              is appreciated.
              Last edited by Minty; 06-17-2014, 05:02 AM.

              Comment


              • #8
                No count-based method will ever return anything other than 0 in that circumstance, since doing otherwise would be wrong.

                If you look at the read distribution in a genome browser (IGV or similar) and it's very obvious in all samples that the reads aligning to this region only originate from one of the genes then you might be able to get away with just removing the overlapping portion of one of the genes, which would result in the other not having 0 counts. This is, of course, also a good way to produce results that don't match the underlying biology. One would hope that your collaborator, who did something similar, thought about this issue (I wouldn't hold my breath).

                Comment


                • #9
                  The gene of interest has it's exon in an area that the other gene doesn't have any exons so the removal of the overlapping part of the other gene could be an options if nothing else works.

                  Comment


                  • #10
                    Just recounting by genes would produce similar results.

                    Keeping in mind that counting by exons will over/under count things (depending on how spliced reads are treated):
                    Code:
                    cat foo.txt | sed -r 's/:[0-9]+//' | awk '{if($1 == gene) {count+=$2} else {if(gene != "") printf("%s\t%i\n",gene,count);gene=$1;count=$2}}END{printf("%s\t%i\n",gene,count)}' > foo.combined.txt

                    Comment


                    • #11
                      Thank you for the code. I'll keep in mind the affect of spliced reads.

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        Current Approaches to Protein Sequencing
                        by seqadmin


                        Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                        04-04-2024, 04:25 PM
                      • seqadmin
                        Strategies for Sequencing Challenging Samples
                        by seqadmin


                        Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                        03-22-2024, 06:39 AM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by seqadmin, 04-11-2024, 12:08 PM
                      0 responses
                      30 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-10-2024, 10:19 PM
                      0 responses
                      32 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-10-2024, 09:21 AM
                      0 responses
                      28 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-04-2024, 09:00 AM
                      0 responses
                      53 views
                      0 likes
                      Last Post seqadmin  
                      Working...
                      X