This assignment requires you to implement a sentiment classifier using k-nearest neighbour (kNN) algorithm using Python programming language.
Note that no credit will be given for implementing any other types of classification algorithms or using an existing library for the kNN instead of implementing it by yourself. You must provide a README file describing how to run your code to produce the results. Programs that do not run will result in a mark of zero!
In binary sentiment classification, the goal is to classify a given user review about a product as to whether the review expresses a positive or a negative sentiment about the product. We encounter such reviews in numerous online shopping sites such as Amazon or eBay. If we can automatically predict the sentiment of a review, then we can group reviews into positive and negative ones and read only a subset of all the reviews.
Download and the file. Inside, you will find four files: train.positive, train.negative, test.positive, and test.negative. These files correspond to the positive and negative train/test reviews we will be using in this assignment. Each line in each file represents a review using a set of features. We will be using both unigram and bigram (concatenated using two underscores) features to represent a review. A review is represented using a bag-of-features. Moreover, each feature is counted only once, giving a boolean valued feature representation (i.e. a set of features for each review).
(1) Write a program to load the train/test instances (positive/negative) from the train/test files.
(2) Implement a kNN classifier and measure the classification accuracy on the test instances. Classification accuracy is defined as the percentage of the total number of correctly classified instances to the total number of test instances.
(3) Vary the value of k and evaluate the performance of your kNN classifier. Plot your results in a graph where the x-axis corresponds to the value of k and the y-axis corresponds to the classification accuracy. What trends can be observed from the graph? Briefly report your findings.
(4) For measuring the similarity for computing the neighbourhood in your kNN classifier try different similarity/distance measures such as a) cosine similarity, b) Euclidean distance, and c)
Manhattan distance. Compare the performance of the kNN classifier with these three measures.
(You may use additional similarity/distance measures other than the (a), (b), and (c) listed above.) Briefly report your findings. (10 marks)
(5) Using dierent sub-samples of positive vs. negative training instances, evaluate the robustness of the kNN classifier under unbalanced training datasets. Briefly report your findings.
- (a) the source code for all your programs,
- (b) a README file (plain text) describing how to compile/run your code to produce the various results required by the assignment, and
- (c) a PDF file providing the answers and graphs for the questions , , and .
Compress all of the above files into a single tar ball (tgz) file and specify the filename as studentid.tgz. It is extremely important that you provide all the files described above and not just the source code!