Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • DESeq2 Anomaly/Issue?

    Hi,

    I am trying to identify patterns of gene expression changes (to group genes by their regulating TFs) in different genetic backgrounds. However I am getting one (at least) anomaly that I can't explain.

    This gene has one transcript. here are the cufflinks generated (from mapped BAM files of biological replicates) mean FPKM values:

    Wt: 139.2 (+/- 16.0) n = 4
    GOF: 318.9 (+/- 60.9) n = 4
    LOF: 6.8 (+/- 1.6) n = 4
    2x LOF: 11.7 (+/-16.3) n = 4

    It is upregulated when it should be (GOF) and off when it should be (LOF, 2x LOF), a pretty good example of an activated target gene for this group. (yes I know there seems to be an outlier replicate in 2x LOF)

    Yet every DESeq2 comparison gives NA across the board, because it reports the base mean expression is 0????

    Is there some problem with annotation of this gene going from the mapped read BAM files to HTSeqCount output file to DESeq2?

    I am working with a very well annotated genome - C. elegans ce10 version.

    Also side question: what exactly is the base mean expression/how is it calculated? is it the mean expression between the groups you are comparing?

    Thanks for any help!
    Last edited by The_Corsair; 02-27-2015, 08:43 PM.

  • #2
    Ok so I have been doing more digging.

    The problem seems to be when going from the mapped read BAM files to the HTSeqCount output file.

    I checked the HTSeqCount output files for the 4 Wt replicates and 4 GOF replicates and every single one has 0 counts for this gene.

    Can anyone tell me what is going on here? Why is HTSeqCount (or cuff links) miscounting the reads for this gene??

    Comment


    • #3
      The only thing I can think of is it has too much sequence similarity to other genes but of the 5 related genes in the genome, all have counts.

      Comment


      • #4
        It sounds like an HTSeqCount issue, my first guess would be that the GTF/gff file contains overlapping gene regions. Quick thing to check, view the gene models file directly or look at it in a genome browser in that region of interest. Similarly, look at the Tophat2 coverage in this region, you should be able to see whether there is read coverage of zero or not.

        Apologies if you checked this line of thinking already...

        Perhaps subject to some debate, HTSeqCount chooses not to report counts in regions where multiple genes overlap, by design. In theory, after removing a portion of a gene, you'd still be able to detect differential changes using the remaining non-overlapping region. But sometimes gene models are completely overlapping (intended with other assumptions in mind) and so some genes potentially are completely lost during the count step. It isn't really for HTSeqCount to correct issues in gene annotations, but something to keep in mind. Easiest fix is to change the gtf file manually to remove overlapping regions (or merge them into one larger gene locus) and regenerate counts.

        Comment


        • #5
          DEXseq has a script to help with this (dexseq_prepare_annotation.py). See Section 2.4:
          The Bioconductor project aims to develop and share open source software for precise and repeatable analysis of biological data. We foster an inclusive and collaborative community of developers and data scientists.


          I've also modified htseq-count to optionally include ambiguous reads (ie those that overlap multiple entries in your GTF file) - happy to provide the source if interested.

          Comment


          • #6
            fanli, many thanks! I would be interested in the modified htseq-count just for the sake of adding another tool to the bag of tricks.

            I respect the intent of not counting those regions, and add caution that including counts could cause other cascade issues. So be aware.

            I have used that DEXseq script, I find it very useful. In fact, a good workflow may be to run that DEXseq script for the purpose of identifying any regions with overlapping genes. Then update the GTF file to resolve those regions in some way. Finally, run htseq-count using the updated GTF/GFF3 file which contains no overlapping regions.

            The GenomicFeatures R package has a function disjointExons() which accomplishes a very similar task, for what it's worth.

            All this, and we don't really know if this addresses the OP. Let us know!

            Comment


            • #7
              Attached. See the --allow_ambiguous option. And second jmw's point about being cautious when including these counts
              Attached Files

              Comment


              • #8
                Thanks for the suggestions!

                There are reads for the gene but not many and they must all be mapping to multiple genes, since there are a total of six related genes that should all be expressed in wild-type. The sequencing depth was lower because we had so many replicates and different samples - cost affecting parameters. I think this gene was just expressed at a low level and I didn't pick up any unique reads for it.

                I think I won't bother trying anything else with this gene. It's not needed for my analysis, was just curious as to why of the 6 it was the only one not expressed. Thanks for the helpful tips!

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Advancing Precision Medicine for Rare Diseases in Children
                  by seqadmin




                  Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
                  12-16-2024, 07:57 AM
                • seqadmin
                  Recent Advances in Sequencing Technologies
                  by seqadmin



                  Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

                  Long-Read Sequencing
                  Long-read sequencing has seen remarkable advancements,...
                  12-02-2024, 01:49 PM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, 12-17-2024, 10:28 AM
                0 responses
                32 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 12-13-2024, 08:24 AM
                0 responses
                48 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 12-12-2024, 07:41 AM
                0 responses
                34 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 12-11-2024, 07:45 AM
                0 responses
                46 views
                0 likes
                Last Post seqadmin  
                Working...
                X