Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Contingency tests in R, Error with large numbers

    I have been struggling to figure out how to fix this error, and I thought why not try the seqanswers community. I am fairly new to R though, so please forgive me if this is a fairly easy solution.

    I am trying to perform multiple Fisher's Exact tests or Pearson's Chi-squared contingency tests from a datamatrix in which data from each row is data for an independent test.

    My data is formatted as such:

    AAA 75533 4756922556 88210 6715122129
    BBB 14869 4756983220 16384 6715193955
    CCC 7230 4756990859 8559 6715201780
    DDD 18332 4756979757 23336 6715187003
    EEE 14733 4756983356 16826 6715193513
    FFF 2918 4756995171 3433 6715206906
    GGG 3726 4756994363 4038 6715206301
    HHH 6196 4756991893 7011 6715203328
    III 7925 4756990164 9130 6715201209
    JJJ 1434 4756996655 1602 6715208737
    Where the 1st column is the identifier, the 2nd column = observations 1, the 3rd column = background counts 1, the 4th column = observations 2 and the 5th column = background counts 2.

    I am loading my data like this:

    > data=read.table("My.File", header=FALSE)
    And I am looping through each row to perform a test like this:

    > pvalues=c("pvalue")
    > for(i in 1:10){
    + datamatrix=matrix(c(as.integer(data[i,2:5])),nrow=2)
    + fisherresult=fisher.test(datamatrix)
    + pvalues=cbind(pvalues,fisherresult[1])
    + }
    Here is the Error I am Getting:

    Error in fisher.test(datamatrix) :
    all entries of 'x' must be nonnegative and finite
    In addition: Warning messages:
    1: In matrix(c(as.integer(data[i, 2:5])), nrow = 2) :
    NAs introduced by coercion
    2: In matrix(c(as.integer(data[i, 2:5])), nrow = 2) :
    NAs introduced by coercion
    When I replace the large number in the 3rd and 5th column with smaller numbers, the statistical calculation works fine.

    Any ideas? Any help would be GREATLY appreciated!

  • #2
    Contingency tests in R, Error with large numbers

    Unless your columns are in the wrong order, in the data sample you've shown,
    the background counts are way higher than the observed counts.

    You could also try posting this on the R/Bioconducor mailing list:

    Comment


    • #3
      Hi mastal,
      Yes, that is correct. The background frequencies are much larger than the observed frequencies.

      Thanks for your suggestion. I will try posting to the R/Bioconducor mailing list.

      Comment


      • #4
        This is because you have integers larger than 2^32. If you look at help(as.integer), you'll find that it doesn't support numbers over 2*10^9.

        Comment


        • #5
          Thanks dpryan,
          I just ran into this answer myself from the following post:



          Hmmm...is there a way to change this, I wonder?

          Comment


          • #6
            Not to my knowledge, though I'm looking further into this in case I ever run into this (I follow the bioconductor email list to, so hopefully someone will reply with a good solution). There's some limited support in the int64 package, but I think you would otherwise have to recompile R and change the default size of int with a compiler switch (you can run into similar problems if you try to do svd on large datasets, since matrix indexing is still 32bit on some levels).

            Comment


            • #7
              I appreciate your insight.

              This is pretty frustrating. Unfortunately, I am really at a standstill until I can figure out how to generate these p-values. I have about a hundred million of tests to perform and was going to break down the jobs into batches of about 100,00 tests.

              Outside of R, do you happen to know of any other solutions? For example, I ran a couple tests in JMP, which worked fine (and thus apparently has a larger integer limit).

              Comment


              • #8
                Not if you want anything close to user friendly, at least. There are open source algorithms (available from netlib, which is what R actually uses) that you can more easily recompile to make int a 64-bit integer by default. You can then write a "simple" wrapper program to parse you dataset and run through the statistics. I've had to do this with other functions in R that are limited by the 32bit issue (in my case it wasn't how big the numbers were, but that I was dealing with matrices that were too big to be used in the underlying BLAS algorithms). You probably have to compile BLAS in the same fashion, depending on how the algorithm works (there's a link to the algorithm if you type help(fisher.test) in R). If you're familiar with programming and compilation, this is pretty doable, but I expect it can become really daunting if not. You also need to then do a couple spot-checks just to make sure that nothing is getting screwed up in the process (since JMP seems to work, I guess you could use that).

                If no one else comes up with something better and your programming knowledge isn't sufficient for this method, you can shoot me a PM and I can (hopefully) walk you through how to go about this via email (I assume that this would be off-topic for this forum).

                I hope that R will transition to 64bit integers at some point, but that won't be a quick process.

                Comment


                • #9
                  Thanks again for your help. Unfortunately, I am not much of a programmer. What I think I'll do is recruit a collaborater from my institution to see if they can help come up with a solution.

                  I will update this thread once we've come up with a solution.

                  Many thanks again.

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    Strategies for Sequencing Challenging Samples
                    by seqadmin


                    Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                    03-22-2024, 06:39 AM
                  • seqadmin
                    Techniques and Challenges in Conservation Genomics
                    by seqadmin



                    The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                    Avian Conservation
                    Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                    03-08-2024, 10:41 AM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, Yesterday, 06:37 PM
                  0 responses
                  10 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, Yesterday, 06:07 PM
                  0 responses
                  9 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 03-22-2024, 10:03 AM
                  0 responses
                  49 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 03-21-2024, 07:32 AM
                  0 responses
                  67 views
                  0 likes
                  Last Post seqadmin  
                  Working...
                  X