## Part 1

Introduce yourself on D2L by posting to the Class Introductions forum on D2L. Include a bit of information about yourself including some of the following. Note, this

• a. The college you are in and the degree you are pursuing
• b. Your work background, especially as it relates to data analysis. What is it that brings you to this class, and what is your interest in multivariate analysis
• c. What kind of data are you interested in pursuing (this is useful for forming groups in the first two weeks

The following three problems are due by the second lecture. Be prepared to ask questions about the math, if you have them, at the beginning of the second lecture.

## Part 2

Perform, by hand, the following calculations from linear algebra. For the following matrices and vectors. Submit a scanned copy of your answers (no cell phone photos).

## Part 3

In R, write a script to compute each of the parts in problem 1 to check your answers. Submit both the .r file and the output. Then, create a dataset with `x = <5, -3, 2, 4>` and `y = <2, 1, -1, 3>` and run a regression analysis on the data. Compare your value for in part j of the last problem, with the coefficients calculated by R’s lm function.
A few commands in R will help

• a. as.matrix(vector or data.frame) to convert data to matrices
• b. M = matrix(c(entries by column), nrow = [#rows], ncol = [#col])
• c. v = c(entries)
• d. t(M) for transpose, det(M) for determinant
• e. ginv(M) for inverse… note you will need the MASS package loaded for this
• f. %*% for matrix multiplication
• g. fit = lm(y ~ x, data = dataset)
• h. summary(fit)
• i. for the dot product, see the lecture on how you can do it with matrix multiplication

## Part 4

Use the dataset “mtcars” which is built-in RStudio. You can see the structure of the data by the command “head(mtcars)”. Perform the following operations

• a. Create a copy of the dataset called A with only the columns {cyl, disp, hp, wt, carb}. Use the column selection mechanism we covered in class to select these columns from the dataset.
• b. Add a column of ones to A called “count”.
• c. Use the “as.matrix” function to convert it to a matrix and assign it back to the variable A (so you are overwriting the data.frame here and converting it to a matrix)
• d. Compute the following multiple regression by manually computing the matrix operations.
• e. Compute the regression with the RStudio “lm” command and compare with your results from d). Note any differences.

The following problems may be completed with any statistical software you wish. Make sure that the output is clearly indicated and explained in your answer to the problem.

## Part 5

Every four years, many of the world’s greatest athletes gather to participate in the Summer Olympics. In addition to individual (or team) prowess, the Olympics is also a highly-watched pageant of national pride and competition. The data set (Olympics.xls under the course documents for homework 1) for this problem concerns the performance of various countries in the 2012 London Summer Olympics. For each included country, the data contains medal counts, number of athletes (by gender), national population figures, and national GDP (gross domestic product).

It is your job to distill an interesting story or insight in this data, suitable for presentation to the general public. You must choose the message you would like to communicate. It will take some investigation for you to find that message. Is there an important trend or lesson that you would like the public to understand? For example, are there ways to evaluate a country’s “performance” beyond raw medal counts, and if so, do any surprises emerge? Is there any relationship between the success in Olympics game and the wealth of the people in country? How good/bad are they compared to the peers?

You may dry different multiple-regressions and plots and can compare these results to automatic variable selection methods. Be very thorough. In your write-up, be sure to include the graph(s) and analyses you are using to see the relationships and clearly indicate the intended message of your graphs and analyses.

## Part 6

In a study of genetic variation in sugar maple, seeds were collected from native trees in the eastern United States and Canada and planted in a nursery in Wooster, Ohio. The time of leafing out of these seedlings can be related to the latitude and mean July temperature of the place of origin of the seed. The variables are X1 = latitude, X2 = July mean temperature, and Y = weighted mean index of leafing out time. (Y is a measure of the degree to which the leafing out process has occurred. A high value is indicative that the leafing out process is well advanced.) The data is in the file maple.txt on the course web page under the documents for week 2.

• a. Find the regression of LeafIndex on Latitude. Is latitude a useful predictor of leaf index?
• b. Repeat part (a) for the regression of LeafIndex on JulyTemp.
• c. Find the regression of LeafIndex on Latitude and JulyTemp. Compare the results of this analysis with your results from (a) and (b). How different are the slope coefficients in each case?
What best explains the differences in their values?
• d. What statistical measure(s) can you use to detect and quantify this issue? What are the value(s) of these measures for this regression analysis?

## Part 7

The data in the file chicinsur.txt are collected from 47 zip-code areas in the Illinois area. There are 8 columns in the data file but not all are relevant here. The response variable of interest is the number of new home insurance policies (NEWPOL) (minus canceled policies) per 100 housing units. The predictor variables are the percent minority population living in the area (PCTMINOR), the number of fires per 1000 housing units (FIRES), the number of thefts per 1000 in population (THEFTS), the percent of housing units built before 1940 (PCTOLD), and the median income (INCOME). We are interested in which predictors are significant predictors of insurance policies issued.

• a. Before running any regressions make a prediction as to what the sign of the coefficient of each predictor should be expected to be. Obtain the correlation matrix for the variables PCT-MINOR FIRES THEFTS PCTOLD INCOME NEWPOL. Do the simple correlations support your predictions about the signs?
• b. Run a multiple regression of NEWPOL on the variables listed above.
• i. Comment on the overall significance of the regression fit.
• ii. Which predictors have coefficients that are significantly different from zero at the .05 level?
• iii. Do any of the predictors have signs that are different than suggested by their simple correlations? If so, explain what may be happening. If not, explain how such a thing can happen.
• iv. Examine a plot of residuals versus predicted values. Do you see any problems?

## Part 8

The Housing dataset (under the course documents for week 3) contains housing values in the suburbs of Boston. The detailed explanation concerning the input and output variables can be fetched from the UCI machine learning repository (Note that in R, you can load in this file with simply “read.table(“housing.dat”)”. If you try to specify a separator, R will get confused by the multiple spaces between fields.

1. CRIM: per capita crime rate by town
2. ZN: proportion of residential land zoned for lots over 25,000 sq.ft.
3. INDUS: proportion of non-retail business acres per town
4. CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
5. NOX: nitric oxides concentration (parts per 10 million)
6. RM: average number of rooms per dwelling
7. AGE: proportion of owner-occupied units built prior to 1940
8. DIS: weighted distances to five Boston employment centers
10. TAX: full-value property-tax rate per \$10,000
11. PTRATIO: pupil-teacher ratio by town
12. B: 1000(Bk - 0.63)^2 where Bk is the proportion of African Americans by town
13. LSTAT: % lower status of the population
14. MEDV: Median value of owner-occupied homes in \$1000’s (output variable)
• a. Fit a linear regression model of CRIM based on the other variables and report goodness of fit, the utility of the model, the estimated coefficients, their standard errors, and statistical significance.