Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • HTSeq exon counts to gene level

    Hi,

    I'm working with drosophila melanogaster RNA-Seq data and at least one of my genes of interest is overlapping an other gene so I used HTSeq with exon counts but now I don't know how to combine the individual exon counts to gene level (so all the exon counts of a gene are counted for that gene for further analysis). Is it possible (more easily than with manually counting them in an excel file or something like that)? I have 18 samples and overs 50000 exons so there should be an easier way. Can the combining be done in R or Unix environment?

    If the HTSeq exon counts can't be combined, what should be used instead of HTSeq when the exon has to be the starting point to show the expression of overlapping genes?

  • #2
    Why did you count exons if you want gene-level metrics. Htseq-count can deal with overlapping genes without issue.

    Comment


    • #3
      I'm actually just a summer worker now and my supervisor did the gene level counts but the most important gene didn't show any expression in the data. It is located in a promoter area of an other gene. With exon counts it shows the expected expression and some collaborator used the exon data but I can't get his code before he comes back from his holiday and I'm stuck.

      Comment


      • #4
        It sounds like your supervisor did it wrong. In any case, you might just want to redo the counting, either with htseq-count or with featureCounts (which is much faster).

        Comment


        • #5
          She used this code for the gene counts:
          python -m HTSeq.scripts.count -m union -s no -a 10 -t gene -i ID $name.n-sorted.q10.sam $ANNOT/$gf_gene > $name.n-sorted.q10.$out_gene

          The ANNOT is a directory path and $name is used to run the whole list with additional running code. What should I change to get the gene counts right?

          Comment


          • #6
            Without seeing the annotation file, my guess is that whatever gene you're getting 0 counts for overlaps slightly this other gene, so using "-m union" is losing a lot of counts. So, try "-m intersection_nonempty" instead.

            If one gene completely overlaps the other, then without a stranded dataset there's no way to say from which of the overlapping genes a given read originates, thus saying that it has 0 counts would be appropriate. Granted, you might be able to guess from the coverage profile that most/all of them actually come from one or the other gene, but then you're using an expectation maximization method rather than direct counting. Alternatively, just remove the overlapping portion of one of the genes and then mention that whenever the data gets published. You'd likely get hammered by the reviewers, but that would be completely appropriate.

            Comment


            • #7
              The overlap is complete, that really is the problem. Checking from the manual (before even running the exon counts) all three options for -m just say ambiguous. That's why I came here (after intense googling).

              Edit.
              As I know a little of programming I would assume that the exons could be combined with some script. My skills just aren't good enough yet, so any help with just transforming a gff file from:
              14-3-3epsilon:1 0
              14-3-3epsilon:2 0
              14-3-3epsilon:3 3527
              14-3-3epsilon:4 1343
              14-3-3epsilon:5 1
              14-3-3epsilon:6 0
              14-3-3epsilon:7 57
              14-3-3epsilon:8 0

              to:
              14-3-3epsilon 4928

              is appreciated.
              Last edited by Minty; 06-17-2014, 05:02 AM.

              Comment


              • #8
                No count-based method will ever return anything other than 0 in that circumstance, since doing otherwise would be wrong.

                If you look at the read distribution in a genome browser (IGV or similar) and it's very obvious in all samples that the reads aligning to this region only originate from one of the genes then you might be able to get away with just removing the overlapping portion of one of the genes, which would result in the other not having 0 counts. This is, of course, also a good way to produce results that don't match the underlying biology. One would hope that your collaborator, who did something similar, thought about this issue (I wouldn't hold my breath).

                Comment


                • #9
                  The gene of interest has it's exon in an area that the other gene doesn't have any exons so the removal of the overlapping part of the other gene could be an options if nothing else works.

                  Comment


                  • #10
                    Just recounting by genes would produce similar results.

                    Keeping in mind that counting by exons will over/under count things (depending on how spliced reads are treated):
                    Code:
                    cat foo.txt | sed -r 's/:[0-9]+//' | awk '{if($1 == gene) {count+=$2} else {if(gene != "") printf("%s\t%i\n",gene,count);gene=$1;count=$2}}END{printf("%s\t%i\n",gene,count)}' > foo.combined.txt

                    Comment


                    • #11
                      Thank you for the code. I'll keep in mind the affect of spliced reads.

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        Advancing Precision Medicine for Rare Diseases in Children
                        by seqadmin




                        Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
                        12-16-2024, 07:57 AM
                      • seqadmin
                        Recent Advances in Sequencing Technologies
                        by seqadmin



                        Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

                        Long-Read Sequencing
                        Long-read sequencing has seen remarkable advancements,...
                        12-02-2024, 01:49 PM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by seqadmin, 12-17-2024, 10:28 AM
                      0 responses
                      33 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 12-13-2024, 08:24 AM
                      0 responses
                      48 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 12-12-2024, 07:41 AM
                      0 responses
                      34 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 12-11-2024, 07:45 AM
                      0 responses
                      46 views
                      0 likes
                      Last Post seqadmin  
                      Working...
                      X