Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Re-extracting refgene names after DESeq Analysis

    Hi All,

    I ran my RNAseq Illumina Hiseq 50x50 data through DESeq, no replicates, and as a first step obtained my annotations via -

    txdb <- makeTranscriptDbFromUCSC(genome = "hg19", tablename = "refGene")
    exonRangesList <- exonsBy(txdb, "gene")

    After DESeq I have results such as -

    id baseMean baseMeanA baseMeanB foldChange log2FoldChange pval
    22410 653048 1820.9684 0 3641.9369 Inf Inf 2.183608e-96
    22411 653067 1820.9684 0 3641.9369 Inf Inf 2.183608e-96
    22412 653219 1820.9684 0 3641.9369 Inf Inf 2.183608e-96
    22413 653220 1820.9684 0 3641.9369 Inf Inf 2.183608e-96
    22638 9503 1820.9684 0 3641.9369 Inf Inf 2.183608e-96
    14024 7345 318.4127 0 636.8254 Inf Inf 8.684167e-23

    I would appreciate it if someone would explain to me (a newbie) -

    1. what exactly my "id" refers to in the output - ie. transcripts/exons/refgenes?

    2. how to obtain corresponding refgene for the output so I can look to see if there are changes in genes of interest

    Thanks a lot!

  • #2
    Hi,

    I have the same problem.

    I am using a small subset of my complete bam files. I have two samples, no replicates.

    when running the script according to the DESeq manual I get a long list of genes with no results.
    I took a sub set of this list, where at least one of the gene counts in not '0'.

    Code:
    geneCounts <- cbind(as.data.frame(mutGeneCounts), as.data.frame(wtGeneCounts))
    temp <- which(geneCounts$mutGeneCounts != 0  | geneCounts$wtGeneCounts !=0)
    geneCounts_small <- geneCounts[temp,]
    conds <- c(rep("mut", 1), rep("wt", 1)) # conditions
    
    cds <- newCountDataSet( geneCounts_small, conds )
    
    res <- nbinomTest(cds, "mut", "wt") # performing the deregulation test
    
    plot(res$baseMean, res$log2FoldChange, log="x", pch=20, cex=.1, 
         col=ifelse(res$padj<=0.7, "red", "blue"))
    
    resSig <- res[ res_small$padj < .1, ]
    When I am looking for the most DE genes, I get this table:
    Code:
                 id  baseMean baseMeanA baseMeanB foldChange log2FoldChange         pval         padj resVarA resVarB
    369 FBgn0031637 1037.5715         0 2075.1430        Inf            Inf 9.848705e-14 1.176920e-11      NA      NA
    89  FBgn0022153  903.7494         0 1807.4988        Inf            Inf 3.820621e-13 3.652513e-11      NA      NA
    4   FBgn0000286  833.5378         0 1667.0756        Inf            Inf 8.324344e-13 6.631728e-11      NA      NA
    365 FBgn0031633  439.2726         0  878.5452        Inf            Inf 3.161212e-10 1.888824e-08      NA      NA
    373 FBgn0031643  323.4535         0  646.9069        Inf            Inf 5.330071e-09 2.547774e-07      NA      NA
    371 FBgn0031639  308.4510         0  616.9020        Inf            Inf 8.562510e-09 3.720800e-07      NA      NA
    1. I understand that the last two are NA because I don't have replicates, but what about the inf values? How do I interprate these values?

    2. For this table I get inf or -inf when searching for the most strongly up- and down-regulated genes respectively.

    Can it be because I am using only a small portion of the complete data set?
    I do get a lot of genes with '0'.

    I would appreciate any suggestions

    Thanks
    Assa

    Comment


    • #3
      Originally posted by silverlining View Post
      1. what exactly my "id" refers to in the output - ie. transcripts/exons/refgenes?
      This is the row number in the count table that you used to create the CountDataSet. If you count table had row names, you would see these instead.

      2. how to obtain corresponding refgene for the output so I can look to see if there are changes in genes of interest
      Before you call 'makeCountDataSet', set the rownames to the IDs of your genes. The example in the vignette assumes that the file with your counts had the gene IDs in the first columns.

      By the way, make sure you count for genes, not for transcript. As RefSeq, IIRC, assigns different IDs to the different transcripts of a gene, your annotation might not be suitable.

      Simon

      Comment


      • #4
        Originally posted by frymor View Post
        1. I understand that the last two are NA because I don't have replicates, but what about the inf values? How do I interprate these values?
        Have a look at the raw counts for these genes. It seems you have several thousand counts in sample B and zero counts in sample A. Hence, you have an extreme upregulation. (It is not really 'infinite', of course, but this is what you get from a division by zero.)

        It seems suspicious to me that you have so many zeroes. Maybe you made a mistake subsampling your SAM file. (Why did you subsample at all?)

        S

        Comment


        • #5
          Originally posted by Simon Anders View Post
          Maybe you made a mistake subsampling your SAM file. (Why did you subsample at all?)

          S
          Thanks for the input. Here is the txt file with counts that I read in -

          head(ch3counts)
          PBS x5.nm x50.nm x500.nm
          100038246 0 2 0 2
          10006 1783 2909 2376 1456
          100126326 0 0 0 0
          100126327 0 0 0 0
          100127889 8 13 17 4
          100128054 10 22 7 6

          the row IDs are the UCSC IDs - the data is not subsampled - it is for all chromosomes. I am not sure why -

          1. why these IDs disappear after DESeq analysis?
          2. why I get the Inf values - not all are 0 for sure but some are

          Again, to remind you, while I await replicates this analysis was done without any replicates.

          Thanks!

          Comment


          • #6
            Originally posted by Simon Anders View Post
            As RefSeq, IIRC, assigns different IDs to the different transcripts of a gene, your annotation might not be suitable.

            Simon
            What do you mean by that? To be honest, I am unsure as to whether I have gene or transcript IDs in my analysis right now but am working on finding out. The basic script I used for counts was -

            txdb <- makeTranscriptDbFromUCSC(genome = "hg19", tablename = "refGene")
            exonRangesList <- exonsBy(txdb, "gene")

            Thanks!

            Comment


            • #7
              Finally, this is how a larger set of my data looks - not all are Inf. There are 256 supposedly differentially expressed and the majority are not Inf.

              id baseMean baseMeanA baseMeanB foldChange log2FoldChange
              22410 653048 1820.96844 0.000000 3641.93688 Inf Inf
              22411 653067 1820.96844 0.000000 3641.93688 Inf Inf
              22412 653219 1820.96844 0.000000 3641.93688 Inf Inf
              22413 653220 1820.96844 0.000000 3641.93688 Inf Inf
              22638 9503 1820.96844 0.000000 3641.93688 Inf Inf
              22554 8277 1587.34635 2.304498 3172.38821 1376.6068354 10.4269009
              1434 283120 2777.58744 118.681666 5436.49321 45.8073550 5.5175074
              21708 100132399 1687.38186 21.892735 3352.87098 153.1499401 7.2588010
              22485 729422 1682.97984 21.892735 3344.06694 152.7477958 7.2550078
              22487 729431 1682.97984 21.892735 3344.06694 152.7477958 7.2550078
              22398 645073 1407.01651 17.283738 2796.74927 161.8139148 7.3381919
              22489 729447 1192.57795 8.065744 2377.09015 294.7142954 8.2031732
              21977 2574 1097.04333 9.217993 2184.86866 237.0221529 7.8888781
              21996 26749 1086.03828 9.217993 2162.85857 234.6344213 7.8742709
              21676 100101629 1069.16388 9.217993 2129.10976 230.9732329 7.8515819
              22488 729442 1092.58725 14.979239 2170.19527 144.8802042 7.1787167
              2873 3855 2276.15738 185.512119 4366.80264 23.5391772 4.5569920
              22484 729408 853.78004 6.913495 1700.64659 245.9894114 7.9424524
              21995 26748 827.05284 9.217993 1644.88769 178.4431386 7.4793206
              22486 729428 805.61887 10.370243 1600.86750 154.3712671 7.2702604
              21674 100008586 689.12292 9.217993 1369.02785 148.5169030 7.2144833
              21981 2578 588.92525 6.913495 1170.93700 169.3697587 7.4040325
              22147 4103 435.64232 1.152249 870.13238 755.1599014 9.5606384
              21969 2543 480.49966 5.761246 955.23808 165.8040795 7.3733357
              21805 114824 494.28184 6.913495 981.65019 141.9904368 7.1496500
              14024 7345 318.41269 0.000000 636.82539 Inf Inf

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Current Approaches to Protein Sequencing
                by seqadmin


                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                04-04-2024, 04:25 PM
              • seqadmin
                Strategies for Sequencing Challenging Samples
                by seqadmin


                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                03-22-2024, 06:39 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, 04-11-2024, 12:08 PM
              0 responses
              25 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 10:19 PM
              0 responses
              28 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 09:21 AM
              0 responses
              24 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-04-2024, 09:00 AM
              0 responses
              52 views
              0 likes
              Last Post seqadmin  
              Working...
              X