## Part 1

Note: If your compiled pdf does not look right, number of spaces in indent1 variable
Bitmap image can be read in via the following command:

Plot the image in grayscale:

• (a) Using syntax ??[keyword], help pages can be searched for any pages with keyword in it. Also, if there are same function names in multiple packages, a package can be specified by ?[packagename]::[functionname]. Using the search method, find what the keyword xaxt and yaxt does in the above image() function by looking up the appropriate help page.

• (b) Compute the principal components using prcomp and list objects in the function output: i.e. str function would be useful.

• (c) Recall that principal components were linear combination of data columns.
Verify that this is true by multiplying data matrix (original bitmap image img or a.k.a X) by loadings (pca.img\$rotation object or a.k.a matrix of ij ) and compare to computed principal components.

• (d) Check that rotation of the prcomp output is indeed a rotation matrix, say Q, by verifying a crucial property of orthonormal rotation matrices.

• (e) This means we can approximately reconstruct original data using any number of principal components we choose. Using this fact, reconstruct the image from 10 and 100 principal components and plot the reconstructed image.

• (f) Plot proportion of variance explained as function of number of principal components and also cumulative proportional variance explained. The function summary returns helpful objects including PVE.
Using this information, find out how many principal components are needed to explain 90% of the variance.

## Part 2

Discuss whether or not each of the following activities is a data mining task.

• (a) Looking up customers of a company according to their profitability.
• (b) Computing the total sales of a company.
• (c) Predicting the future stock price of a company using historical records.
• (d) Sorting a student database based on student identification numbers.
• (e) Predicting the outcomes of tossing a (fair) pair of dice.

## Part 3

Consider the Boston housing data from the UCI Machine Learning Repository.

• (a) Describe this data set: how many observations? how many variables (or attributes)?, what is the unit analyzed?
• (b) Load the data into R and can call it Boston.Housing. Consider the fact that the columns of the dataset are unequally separated by white spaces. Also, to keep standard the analysis.
• (c) Produce a histogram of the median value of owner-occupied homes with the title Histogram of median home value based on Boston Housing Data. Using appropriate argument for the histogram function you use, gradually increase the number of bins (create four different histograms). What happens to the histogram?
• (d) Show all histograms plot in one chart using par() or grid.arrange() from gridExtra package
• (e) Using R, compute mean, median, standard deviation and interquartile range of the median home value. What is a good measure of center and spread of your data? Explain why. Note that you are asked to compute median of the median home value. Does this make sense? Explain.
• (f) Create 5 equally distributed ranks of Crime.Rate variable. Then use a boxplot to analyze if the median value of the house significantly differs across the levels of each rank of crime rate by town. Hint: Use quantile() function.

## Part 4

In this problem, you will develop a model to predict whether a given car gets high or low gas mileage based on the Auto data set, which is part of the ISLR package. You will also need to download class package for part (d).

• (a) Create a binary variable, mpg01, that contains a 1 if mpg contains a value above its median, and a 0 if mpg contains a value below its median. You can compute the median using the median() function. Make sure that you make mpg01 a factor variable. Also, use the data.frame() function to create a single data set containing both mpg01 and the other Auto variables.

• (b) Explore the data graphically in order to investigate the association between mpg01 and the other features. Which of the other features seem most likely to be useful in predicting mpg01? Scatterplots and boxplots and other graphical devices discussed in section may be useful tools to answer this question (you should at least include 3 different graphs). Describe your findings.

• (c) Split the data into a training set (75%) and a test set (25%). Call them train.set and test.set, respectively. The sample() command may be useful for answering this question.

• (d) Using train.set and test.set perform k-NN on the training data, with several values of k, in order to predict mpg01. Use only the variables that seemed most associated with mpg01 in (b) (Justify). What test errors do you obtain? Which value of k seems to perform the best on this data set?