Machine Learning代写:COMP135 kNN

Requirement

用kNN(k-nearest neighbors)算法对一个数据集进行聚类处理,编程语言自选,kNN算是机器学习里面最简单的入门算法了。

Introduction

In this assignment you will experiment with the k nearest neighbors algorithm (kNN) and the decision tree learning algorithm, and evaluate kNN’s sensitivity to the value of k and the relevance/number of features.
You will use the Weka system for decision trees (J48) and will write your own code for kNN.

Data

The assignment uses 4 datasets available from the UCI repository. To simplify your experiments we have prepared the data in a convenient form. Each dataset is split into a training portion and test portion for use in experiments. The data is already normalized and features do not require further preprocessing. The processed data files are accessible through the course web page.

Your Tasks

Evaluating Decision Trees

Run the default version of the J48 algorithm in Weka on all datasets and record the test set accuracy. Please note that in all experiments in this assignment we train and test on different portions of the data as prepared in advance. In Weka you can use the options -t and -T to specify the separate train and test files respectively. You might want to write a short script to perform this task. We will compare the performance of J48 to kNN below.

Reading Data Files

Write code to read data in data in arff format as used in the Weka system. To simplify your coding you may assume that all features are numerical. The class, which is the last column in the data rows is discrete, but it may have more than two values.
In view of implementing kNN and feature selection you should build in a facility to calculate the weighted Euclidean distance between examples. With this scheme, we get the standard distance for all j but we can also ignore some features by assigning. Note that distance calculation should not use the class label.

Implementing kNN

Implement your own version of k nearest neighbors. Your procedure should take three parameters corresponding to train set, test set, and the value of k. The search for neighbors can be done using linear time search, i.e., you need not worry about the computational improvements discussed in class.

Evaluating kNN with respect to k

Run kNN on all datasets with values of k from 1 to 25 and record the test set accuracy. For each dataset, plot the accuracy of kNN as a function of k. On the same plot add the performance of J48 (as a flat line as it does not depend on k).
In your report include these plots and a short discussion. How do the two algorithms compare? and how does the performance of J48 vary with k?

Feature Selection for kNN

kNN’s performance may degrade when the data has many irrelevant features. In this part you will implement and test a simple strategy to get around this. The idea in this assignment is to sort features by their information gain and select the top n features according to this ranking. Since our features are numerical we need to define an appropriate notion of information gain.
Please use the following scheme. First, divide the feature range into 5 parts, each of which includes a fifth of the examples. You can do this by sorting the examples according to this feature’s value and picking the right boundaries. Then replace the numerical values with 5 corresponding categorical values. This gives a discrete feature which has 5 values. With this in mind you can evaluate the information gain for the feature.
Write code to perform this calculation and then rank the features by information gain. Once this is done evaluate kNN with k = 5 using the top n features for n from 1 to the total number of features and record the test set accuracy. Note that it should be easy to do this evaluation using the facility for weighted distance by assigning a weight of zero to features that are not being used in a particular run. For each dataset, plot the accuracy of kNN as a function of n. On the same plot add the performance of J48 (as a flat line since we only train it once using all the features).
In your report include these plots and a short discussion. How does the performance of kNN vary with n? and how does this affect the relationship between the two algorithms (for k = 5)? Does it look feasible that we can automatically choose appropriate values for k and n for each dataset?

For Extra Credit

This part is not required; if you wish do it for extra fun and some extra credit. The work above looks at trends on the test set for the effect of k and n. To pick such values we need an independent method that does not use the test set. Then we can evaluate kNN with the selected k and n only once on the test set.
In this part you can evaluate the following scheme or another scheme of your choice. First, split the train set into two parts - call them train and validation. Then redo the work of Section 3.5 to select n (using k = 1 or k = 5) but evaluating on the validation set. Once n is selected, redo the work of Section 3.4 to select k. Once this is done “train” kNN on the entire train/validation set using the specific values of k and n and test on the test set. Compare this value with the accuracy of J48. Write a short report on your method and findings.