Sunday, June 17, 2012

Excel: Storing the Results of Repeated Calculations

I had a spreadsheet containing a random-number formula. Each time it would calculate, it would generate a different random number, and this would result in a different output value. I wanted to run that same calculation numerous times, so as to see what values tended to result; and I wanted to save those values for later analysis. This post describes how I addressed that problem.

With guidance from Stephen Bullen, I set up my spreadsheet to look like this:

Before describing how it worked, I will explain each of those boldfaced headings:
  • Random Guide. This was a random number that I chose. Basically, I had 3,141 items to select from, representing the rates of poverty in each of 3,141 counties in the United States. I wanted to select only 30 of those 3,141 items each time; but I wanted to do many such selections (of 30 items each time), to see what kind of results I would get, on average -- to see, in other words, how well those 30 items would represent the total of 3,141. (I was interested in this because of a separate statistics problem.) So this "Random Guide" value is the percentage of data points that I would not select each time. In my main table, I set up a Selected column, using a RAND (random number) calculation on each of the 3,141 rows in the main table, to decide whether that particular row would be selected in that particular go-round.
  • N. This was the number of items that actually got selected in this particular instance. Over the long haul, if I had my numbers right, the random number generator would average out at 30 items per iteration; but there could be quite a bit of variation in the individual case.
  • Mean. Similar to N. This was the mean (average) of the data values that got selected this time around. This was what I was really after, for statistical purposes: I wanted a list of the sample means that would result from many selections of approximately 30 random items from the data table.
  • Zero to Stop. If I put a zero in this cell, calculations would stop. See below.
  • Current Iteration and Next Iteration. If the spreadsheet was now calculating the 12th random selection from the table, this would read 12. At the point when I took the snapshot shown above, the calculation of my desired set of 1,000 sample means was nearly completed.
  • Mean Results. This is the mean of the values shown in the results table. This would show error values, first time through, because there would be some zero values in the Results table. When I took the snapshot shown here, I was on my second pass through the Results table, so the means appeared properly at this point.
  • Results:  Iteration. I set up the Results table to show the results of 1,000 calculations of Mean and N, based on 1,000 random selections of about 30 items from my table of 3,141 items.
  • Results:   Mean. On the first iteration -- that is, the first time I ran the calculation to select a sample of about 30 items -- I got, from those 30 items, a mean poverty rate of about 13.6. So that particular sample was a bit on the high side, when compared to the overall 13.3% rate that I would get from either (a) direct calculation of the mean of all poverty rates shown in the main table or (b) the means of 1,000 samples, as shown on the Mean Results line (above).
  • Results:   N. On the first iteration, the random number generator resulted in the selection of 36 items from the main table. There didn't appear to be a way  to restrict this to produce exactly 30 items each time, without screwing up the desired randomness of the selection process.
So now that I have explained what my spreadsheet was doing, I will explain how it worked. Here is an image of the formulas driving those cells, as displayed here using Tools > Formula Auditing > Formula Auditing Mode (shortcut: Ctrl-Backtick, i.e., Ctrl-`):

The explanations of those formulas are as follows:
  • Random Guide.  See above.
  • N.  This one counted the number of cells that were marked for selection (in the Selected column in the main table -- see Random Guide, above) in a given iteration's random selection.
  • Mean.  The average of the values displayed in the Selected column in the main table.
  • Zero to Stop.  If I put a zero in this cell, the spreadsheet would not enter values into the Results table.  After putting zero here, any other value would restart the count at Iteration 1 upon pressing F9. I would have had to refine the spreadsheet to let me restart at some other location.
  • Current Iteration and Next Iteration.  These cells circularly referred to each other.  That was permissible only after selecting Tools > Options > Calculation tab > Iteration.  I had to set the maximum number of iterations to just 1:  any larger value would cause the spreadsheet to generate lots of duplicative values that would distort my results.  Holding down the F9 key would sometimes insert zero values; I had to hit F9 each time to recalculate.  The differences between these two cells were that (a) Current Iteration checked Zero to Stop before proceeding, and (b) Next Iteration incremented the Current Iteration by one.
  • Mean Results.  Explained above.
  • Results:  Mean and  Results:  N.  These formulas looked at the Current Iteration value.  If it was not playing their tune, they would (circularly) keep whatever value they already had:  zero, at the beginning, or whatever had already been calculated for them otherwise.  But if their number was up, they would copy new values from cells I2 and I3.
The approach described here worked.  It was very tedious to have to hit F9 each time I wanted to calculate a cell, given that I was doing it over 3,000 times.  It was not clear why I was not getting the kind of automatic calculation I saw in Bullen's gs_demo.xls spreadsheet.  I was still working on figuring that out.