Python代写:CS412 Classification Framework

代写一个分类器框架,实现分类器模型,并对数据进行处理。

Before Dive in

  • Programming assignments take more time, and it counts more in your final grade, so please start early.
  • This is an individual assignment. You can discuss this assignment in Piazza, including the performance your implementation can achieve on the provided dataset, but please do not work together or share codes/parameters.
  • Related libraries or programs can be found on-line, but you are prohibited to use these resources directly. The purpose of this programming assignment is to learn how to build a classification framework step by step.
  • You will need to make decisions about parameter values when implementing these algorithms (e.g., depth of decision tree, number of trees in random forest, etc.). Please do not ask or share parameter settings on Piazza. You need to tune the parameters and find the best setting for each dataset yourself. You need to briefly describe why and how you find these parameter settings in your final report.
  • You can use C/C++ or Java or Python2/3 as your programming language (No, you cannot use Matlab or C#. And no, Scala is not Java). Your programs should be able to compile correctly in EWS Linux environment. This assignment will be graded automatically using an auto-grader so please make sure you follow all the input and output definitions strictly. The auto-grader will try to compile your source code and check the output of your programs. The auto-grader also has plagiarism detection feature, so please do not copy other people’s programs or use modified version of existing resources. A more detailed project organization guidance can be found at the end of the assignment.

  • A section titled Frequently Asked Questions can be found at the end of the assignment, and we will keep updating it when more questions emerge on Piazza.

The Big Picture

In this assignment, we will build a general purpose classification framework from scratch. This framework contains two classification methods: a basic method and an ensemble method. More specifically, decision tree and random forest. Given certain training dataset following a specific data format, your classification framework should be able to generate a classifier, and use this classifier to assign labels to unseen test data instances.

We will then test our classification framework with several datasets using both the basic method and the ensemble method, and evaluate the quality of the classifiers with different metrics.

In order to build this classification framework, we need to finish four steps as follows.

Step 1

Read in training dataset and test dataset, and store them in memory.

Step 2

Implement a basic classification method, which includes both training process and test process. Given a training dataset, the classification method should be able to construct a classifier correctly and efficiently. Given a test dataset, the classification method should be able to predict the label of the unseen test instances with the aforementioned classifier.

Step 3

Implement an ensemble method using the basic classification method in Step 2. The ensemble classification method should also be able to train a classifier and predict labels for unseen data instances.

Step 4

Test both the basic classification method and the ensemble method with different datasets. Evaluate their performances and write a report.

We introduced many classification methods in the lecture. However in this assignment, we pick the following basic classifier and ensemble classifier.

  • Decision Tree (gini-index) as basic method, Random Forest as the ensemble version of the classification method.

Note 1

Other classifiers can also fit into this basic-ensemble framework. For example, Naive Bayes as basic method, and AdaBoost as ensemble method. We do not cover them in this assignment.

Note 2

Different Random Forest methods can be found online (we introduced two in the lecture notes). You can choose either one but Forest-RI might be easier to implement.

Note 3

Many off-the-shelf ML tools can be found online for our classification tasks, e.g., sk-learn. Since the purpose of this assignment is to go through the basic classifier and ensemble classifier step by step, it is not allowed to use any third-party library in your code.

Note 4

The classification tasks we are handling here can be either binary classification or multi-class classification. Please make your code robust enough to handle both cases.

Step 1: Data I/O and Data Format

The classification framework should be able to parse training and test files, load these datasets into memory. We use the popular LIBSVM format i n this assignment.
The format of

<label> <index1>:<value1> <index2>:<value2> ...
......
......

Each line contains an instance and is ended by a ‘\n’ character. [label] is an integer indicating the class label. The pair [index]:[value] gives a feature (attribute) value: [index] is an integer starting from 1 a non-negative integer and [value] is a number (we only consider categorical attributes in this assignment). Note that one attribute may have more than 2 possible values, meaning it is a multi-value categorical attribute.

You need to be careful about not using label information as an attribute when you build the classifier, in which case you can get 100% precision easily, but you will get 0 pts for Step 2 and Step 3.

Step 2: Implement Basic Classification Method

In this step, you will implement the basic classification method, i.e., decision tree with gini-index as the attribute selection measure (note: the gini-index score is computed by considering every possible categorical value of the feature as a branch). Your classification method should contain two steps, training (build a classifier) and test (assign labels to unseen data instances). Many existing classification or prediction packages have the feature to save the learned classifier as a file, so that users can load the classifier into the framework later and apply the same classifier on different test datasets asynchronously. However, in this assignment, we are going to skip this save-classifiers-as-files feature, and assume that the training and test processes happen in the same run. Before you begin with your implementation, please first read the Project Organization and Submission section at the end of the assignment to get a general idea on how to organize your project and how to name your programs so the auto-grading system can compile and execute your programs correctly.

The classification method you implemented should be able to take both training file and test file as input parameters from the command line, use training dataset to build a classifier and test dataset with labeling information to evaluate the quality of the classifier. For example, if you are implementing Decision Tree using C/C++, your program should be able to finish the entire classification process when you type this command:

./DecisionTree train\_file test\_file

If you use Java:

java classification.DecisionTree train\_file test\_file

If you use Python:

python DecisionTree.py train\_file test\_file

Your program should output (to stdout) k*k numbers (k number per line, separated by space, k lines total) to represent the confusion matrix of the classifier on testing data, where k is the count of class labels. These lines are sorted by the label id. In i-th line, the j-th numbers are the number of data points in test where the actual label is i and predicted label is j. And you will be needing these values in Step 4.
An example output can be found as follows (when k=4):

50 30 16 23
12 43 21 9
0 21 71 12
1 23 11 67

Note that if k=2, the output is a 2x2 matrix, which is similar to the table where the numbers represent true/false positive/negative. For more information about confusion matrix and how to evaluate performance based on it, please read https://en.wikipedia.org/wiki/Confusion_matrix or watch the video.

Step 3: Implement Ensemble Classification Method

In this step, you will implement an ensemble classification method using the basic classification method in Step 2, i.e., random forest. Very similarly, your ensemble classification method should take train_file and test_file as input, and output k*k values in k lines as the confusion matrix so that you can use these numbers to evaluate classifier quality in Step 4. For example, you are implementing random forest using C/C++, and the command line of running your program should look like this:

./RandomForest train\_file test\_file

If you use Java:

java classification.RandomForest train\_file test\_file

If you use Python:

python RandomForest.py train\_file test\_file

Step 4: Model Evaluation and Report

In this step, you will apply both methods you implemented in Step 2 and Step 3 on three real-world datasets and a synthetic dataset. Note that we preprocessed these datasets and partitioned them into training and test for you (click on the numbers in the table to download), so please do not use the original datasets directly.

If you are curious about the the semantic meaning of each attribute or labels, please read the source links associated with the first three datasets. The fourth dataset is generated via distributed representation learning on a synthetic social network (a.k.a, network embedding), where binning is further performed to generate categorical attributes. You do not need to know these details in order to finish this assignment.

The main difference between the fourth dataset and the other three is that the former has a significantly larger number of attributes.

You need to apply both your basic classification method and your ensemble classification method on each dataset, and calculate the following model evaluation metrics using the output of your classification methods for both training and test datasets. Please do this manually or write a separate program (you do not need to turn this in), by taking the k*k numbers as input. Do not output these metrics from your classifiers directly since the auto-grader won’t be able to recognize them.

For overall performance, output:

Accuracy (overall accuracy = sum of the diagonal of your output matrix / sum of all entries in your output matrix)

For each class, output:

Accuracy, Specificity, Precision, Recall, F-1 Score, F \beta score (\beta = 0.5 and 2)

Note 1

To evaluate the performance on each class, we regard the target class as positive and all others as negative.

Note 2

The detailed equations of the above measures based on Confusion Matrix can be found in https://en.wikipedia.org/wiki/Confusion_matrix

Now you are ready to write your report. You should include the following sections in your report:

  • brief introduction of the classification methods in your classification framework;
  • all model evaluation measures you calculated above ((1 overall metric + # class * 7 per-class metrics) * (1 on training set + 1 on test set) * 2 methods * 4 datasets);
  • parameters you chose during implementation and why you chose these parameters; and
  • your conclusion on whether the ensemble method improves the performance of the basic classification method you chose, why or why not.

Project Organization and Submission

The auto-grading process will be performed on a Linux server, so make sure your source code can be compiled in the EWS Linux environment (or in major distro like Ubuntu, Mint, Fedora or Arch with a relatively new version of gcc/g++ or JDK).

If you use C/C++

You must attach a Makefile so that the grader can compile your programs correctly. If you are a Windows user and use Visual Studio as your IDE, please add a Makefile after you finish implementation and make sure your code can be compiled properly using gcc/g++. Your project structure should look like this:

yournetid_assign4/
    |----------------report.pdf
    |----------------code/
        |----------------Makefile
        |----------------*.c/c++

After compilation, your Makefile should be able to generate binary target files. Your binary files should be DecisionTree and RandomForest. Also, as we discussed in Step 2 and 3, the input arguments for your binary should look like this:

./DecisionTree train_file test_file

If you use Java

Very similarly, please make sure your code can be compiled in Linux environment. Your entrance class should be placed under classification package, and your project structure should look like this:

yournetid_assign4/
    |----------------report.pdf
    |----------------code/
    |----------------other_packages/
        |----------------*.java
    |----------------classification/
        |----------------*.java

The grader will simply execute javac classification/*.java to compile your programs, and your program should be able to generate binary files with corresponding classification method names. The input arguments are defined as follows.

java classification.DecisionTree train_file test_file

If you use Python

Very similarly, please make sure your code can be executed in Linux environment. Your project structure should look like this:

yournetid_assign4/
    |----------------report.pdf
    |----------------code/
        |----------------*.py

The grader will run the following command to test your programs:

python DecisionTree.py train_file test_file

After your finish your implementation and report, please compress your project into zip format (please make sure after unzip the submission, everything will be in a folder with “yournetid_assign4” as the name rather than layers of nested folders), and rename your zip file to yournetid_assign4.zip.

Parameter settings for each dataset

Please hardcode your choice of parameters (depth of decision tree, number of trees in random forest, etc.) for each dataset into your program. In other words, your program should choose the corresponding parameters according to the input datasets. Specifically, if train_file (the path to the training set) contains the string “balance.scale” the parameters tuned for the provided balance.scale dataset are used in this round of execution. The same goes for nursery, led, and synthetic.social.

Besides the four provided pairs of training-test files, we also hold some unseen datasets for the auto-grader to evaluate your implementation. The unseen datasets differ from the provided datasets in that the training-test files are resampled, while the raw data used to generate the unseen datasets are the same as those used to generate the provided datasets. The filename of any unseen datasets would also contain balance.scale, n ursery, led, or synthetic.social. As a result, the parameters you selected for the provided datasets will also work fine with the unseen datasets.

Double check before you submit

Your programs should be able to handle both absolute paths and relative paths to different training and test files. Do not assume that the files are located in the same folder of your programs.

The unseen datasets are roughly the same scale as the 4 datasets that we provided. Make sure that your programs can finish within 3 mins for each dataset. The auto-grader will shut down your program after 3 mins if no results are detected (and you will receive 0 points for the corresponding step).

In the past, when the same auto-grader was used for programming assignment, around 20% of the submissions would not pass the auto-grader, mostly because these submissions did not strictly follow the project organization rules we stated above. In this case, TAs will manually grade these assignments but 6 to 14 points will be taken off from your assignment. Some common mistakes include using Java without package structure, naming the program by decisiontree instead of DecisionTree, generating incorrect zip file name or folder name after unzip, etc.

Now you can submit your assignment through Compass 2g!!

Congratulations, you just finished the 2nd (and last) programming assignment of CS412!!

Frequently Asked Questions

How to deal with unseen values or unseen features in test dataset?

In Decision Tree, this is hard to deal with during testing. You can ignore a feature if you never saw it in training however you need to bypass a unseen value in test data for a seen feature. (Dec. 6 update: in this homework, you can assume no feature/index unseen in the training set will appear in the test set; when we say ignoring a feature here, we intend to provide a solution to the more general application scenario.)

You can do 1) randomly choose a label, 2) randomly choose a path to finish traversing the tree, or 3) choose a path which most data flow if you kept this information in your decision tree.

There are more elegant ways to handle such cases but since we did not cover them in the lecture notes, you can go with one of the naive ways we mentioned above.

Should random forest use overlapped attributes or overlapped subsets?

Yes.

How to handle duplicate tuples in training?

You should count each tuple as one when you are building classifiers no matter whether they are duplicate tuples or not.

How many branches should a feature have during tree construction?

Since assignment 4 only deals with categorical features, we don’t do value grouping nor select splitting point for the feature. We simply regard every possible value as one branch. Thus the # of branches should be the same as # of possible values of the feature.

Should parameters be tuned on a validation set?

In principle, model parameters are supposed to be tuned according to the model performance on the validation set, which is distinct from the training test and the test set. However, in this homework, we do not hold a validation set for simplicity, and you would tune the parameters on the test set.