This take-home is cumulative, covering all material presented in class and on homework. All work must be submitted electronically. Use of computational tools (e.g., R) is encouraged; and when you do, code inputs and outputs must be shown in-line (not as an appendix) and be accompanied by plain English that briefly explains what the code is doing. Extra credit, augmenting your score by at most 10%, is available for (neatly formatted) solutions authored in Rmarkdown, and submitted as a working .Rmd file.
Students must limit themselves to methods and libraries discussed in class. For full credit all steps must be shown.
Most importantly, all work must be your own. In this exam setting, communicating with others about the problems or solutions is not allowed, and doing so will be considered a breach of the honor code. All questions of clarification, etc., must be directed to Prof. Gramacy.
The problems below are deliberately open-ended. In some cases there may be several “correct” answers. The questions are similar in scope to a homework questions, but there is no prompting about what to do or in what order. Explaining why is as important as what you’re doing; you are being tested your instincts and ability to be thorough (without being pedantic) and on your execution (ability to get the job done). Include the necessary plots, tests, diagnostics, and model probabilities to illustrate and support your conclusions. Presentation matters, and longer is not better.
Be careful: any “data mining” you do may be computationally intensive. Don’t leave these problems to the last minute: allow your computer some time to work on your behalf while you are doing something else.
The file elec.csv contains data on the rate, measured in megawatts (MW), of electricity delivered to Gulf Energy customers in Alabama. Also provided are the average daily temperature readings (temp in degrees Fahrenheit) in that market for each of the 364 days of the study, which started on January 1, a Sunday. To operate effectively, power companies must be able to predict daily peak demand for electricity.
Your task is to provide a fitted model for forecasting daily electricity demand. In addition to describing your modeling enterprise, you might consider the following in addition to anything else you deem relevant.
- Comment on the accuracy of your forecaster with particular focus on peak demand.
- What does your fitted model forecast for the last day of the year, i.e., for the day following the last day in the data? Be sure to include uncertainty estimates.
- Consider extending your method to provide forecasts (and uncertainties) for each day of the first full week of the following year.
This question investigates whether or not there is a systematic racial bias in who is stopped by Washington State Police (WSP) officers. The data (in wspts.csv) consists of the number of both officer-initiated traffic stops (e.g., without a radar trap or a crash) and radar initiated traffic stops recorded for each of six racial groups between November, 1, 2005 and September 30, 2006 for 34 autonomous patrol areas (APAs). So, for example, the first observation in the data file tells us that in APA 2, 11445 white people were stopped at the discretion of a WSP officer and 2531 white people were stopped due to indications from radar. You may find it helpful to note that APAs are roughly ordered by distance from Seattle.
Your task is to build and fit an appropriate model (or models) in order to provide evidence for or against racial bias by relating officer initiated traffic stops to radar initiated traffic stops. Use radar initiated traffic stops as a benchmark for the population that is at-risk to be stopped by the WSP. These drivers are selected from passing motorists based upon driving characteristics, and there is very little chance of racial bias. If members of a particular race are actively stopped (at the discretion of a WSP officer) at a different rate than predicted by this benchmark, we have evidence of racial bias.
Note: Many researchers suggest that a difference between the racial distribution of persons stopped by police and the racial distribution of the population at risk of being stopped would constitute evidence of racial profiling. This implicit definition reveals the key empirical problem in testing for racial profiling: measuring the risk set, or the benchmark racial distribution, against which to compare the racial distribution of traffic stops by officers.
The data for this question considers attributes of e-mails collected at HP Labs in Palo Alto, CA, in the late 1990’s. The file spam.csv comprises of 57 attributes of 4601 e-mails, and a human-assigned label (spam: 1 if it is spam, or 0 if not) indicating whether or not the e-mail was spam. The attributes are:
- 48 of them are word frequencies (named w_word where word is a particular word). Some are numbers, including common telephone prefixes.
- 6 of them are character frequencies (named c_char where char is one of “;”, “(“, “[“, “!”, “$”, or “#”).
- The final three (using the prefix caps_) give the length of the average, longest, and total strings of capital letters in the e-mail.
Your task is to build a spam filter, i.e., to build a predictor for detecting which new e-mails are spam and which are not. In addition to describing your modeling enterprise, and commenting on out-of-sample accuracies and anything else you deem relevant, you might consider the following.
- What is the accuracy of your spam filter out-of-sample?
- Try (linear) models both with and without interaction terms. Remember the response is binary.
- Try nonlinear methods. Again, remember the response is binary.
- For more info see the UCI page. In particular, note that “false positives (marking good mail as spam)” may be undesirable