Assignment 5 Statistical Significance
| categories: Assignments
Due Wednesday 7 November.
This NASA post, NASA Data Link Pollution to Rainy Summer Days in the Southeast, claims that it rains more in the southeastern US Tuesday through Thursday than it does Saturday through Monday. The presence of a seven-day cycle in the weather is "errie" evidence that human activity influences the weather.
Your mission in this assignment is to see if you can validate their claim using data from the instruments at RDU airport. Of course, the NASA researchers had access to much richer data, so we are not really equipped to confirm or refute their claim but we're in the southeast, and have some data, let's see what we can do.
I have collected 10 years of data into a single file, krdu-rain-2001-2010.csv, in a format suitable for use with np.loadtxt(). The data have 4 columns; year, month, day, and rainfall in inches.
There is no template for this assignment. You'll need to
- Read the data.
- Determine the days of the week from the dates.
- Write a function to get the average daily rainfall during midweek (Tuesday through Thursday) and weekend (Saturday through Monday).
- Report the average rainfall for midweek and weekend, and their difference (delta).
- Determine and report the p value (the likelyhood that the effect is not real) by simulation. You'll need to first, compute the delta which in our case is the difference between the means for midweek and weekend. Then you'll run the function many times, each time permuting the day labels, and counting the number of times that the difference between the new means is greater than delta. Now divide that count by the number of trials you ran. That will be the p value. If count is 0, that means you didn't find even a single permutation that produced a greater difference in means; the effect is very likely real. On the other hand, if count is huge, then there were many permutations that produced greater differences, so the difference is likely just random.
Hints:
- You'll find these posts by Allen Downey very illuminating: "There is only one test!" and "More hypotheses, less trivia" (especially the paragraph Permutation under Difference in means). You shouldn't use his code; write your own.
- The weekday method of the date class from the datetime module will be useful for getting the day of the week from the date. You can use it like this.
from datetime import date # later in your code when you want to determine the day of the week # we create a date object do = date(year, month, day) # and use its weekday method day = do.weekday() # or do it all in one step day = date(year, month, day).weekday()
The date class knows nothing about numpy so you'll need to use a loop to process all the data. - The np.random.shuffle function will be useful for permuting the data.
- You'll need to use a loop to run your simulation many times. I found that 1000 trials gives a faily stable result (0.04 to 0.06) and doesn't require long to run (good for debugging). On my laptop 100,000 trials took only a couple of minutes.
- I confirmed my intuition about how this should work by tweaking the data. For example, if I add 0.1 inch of rain to every Tuesday, the p value drops to zero but if I replace the rainfall with random numbers it rises to near 0.5.
- I get about 0.02 inches of difference in rainfall between midweek and weekend.