## Question 1

Consider the specification of a Markov Decision Process according to the following figure. Code your own implementation of Value Iteration and compute the optimal policy as well as the optimum utilities for this challenge.

Indicate the original utilities you used in order to start the process. Provide at least 5 intermediate results (in terms of optimum utilities and policies) depending on the number of iterations needed for convergence as well as the final results. Describe your implementation and your convergence criterion. Report computation time and number of iterations.

## Question 2

Consider the criteria for accepting graduate students at the hypothetical Univ. of Excellence. Each candidate is evaluated according to four attributes:

- the grade point average (GPA)
- the quality of the undergraduate degree
- the publication record
- the strength of the recommendation letters

To simplify our example, let us discretize and limit the possible values of each attribute: Possible GPA scores are 4.0, 3.6, and 3.3; universities are categorized as rank_1, rank_2, and rank_3; publication record is a binary attribute - either the applicant has published previously or not; and recommendation letters are similarly binary, they are either good or normal. Finally, the candidates are classified into two classes: accepted, or P (for ‘positive’) and rejected, or N (for ‘negative’). Figure 2 provides an example of one possible decision tree determining acceptance.

An applicant doesn’t know this decision tree, but does have the data regarding twelve of last year’s applicants as in Table 1.

- a) Does the provided tree correctly categorize the provided examples?
- b) The applicant uses the decision tree algorithm shown in class (with the information gain computations for selecting split variables) to induce the decision tree employed by U. of E. officials. What tree will the algorithm come up with? Show the computations involved, in addition to the decision tree itself. [Hint: The information content of the examples before choosing any split variable is.

You have to find the attribute that has the highest information gain, where attribute A divides the examples into subsets, and p and n represent the number of positive and negative examples in subset. - c) Is the tree that you got in part b) equivalent to the tree provided here (i.e., do the two trees classify every application in the same way)? If the answer is yes, explain whether this is a coincidence or not. If the answer is no, give an example of a data case that will be classified differently by the two trees.

## Question 3: Consider building an SVM for the following two-class training data:

Positive class:

```
[1 3]^T , [0 2]^T , [0 1]^T , [0 0]^T
```

Negative class:

```
[1 5]^T , [1 6]^T , [3 3]^T
```

- (a) Plot the training points and, by inspection, draw a linear classifier that separates the data with maximum margin.
- (b) This linear SVM is parameterized by
`h(x) = w^tx + b`

. Write the parameters w and b. - (c) Suppose you observe an additional set of points, all from the positive class:

More positive points:

```
[-2 0]^T , [-2 1]^T , [-2 3]^T , [-1 0]^T , [-1 1]^T , [0 0]^T
```

What is the linear SVM (in terms of w and b) now?

## Question 4

Consider the chart below:

- (a) Notice that for the above graph there is a perfect linear separator for the two input classes.

Therefore, a single perceptron should be able to learn this classification task perfectly. Your task is to replicate the learning process, starting with a random perceptron with weights, where the weight 0 corresponds to the constant offset. For the inputs, just estimate their coordinates from the chart.

Start by adding the perceptron’s initial line of separation to the chart. Compute then how many samples are misclassified? Then, select an arbitrary misclassified sample and describe the computation of the weight update (you can choose = 1 or any other value; if you like you can experiment a bit to find a value that leads to efficient learning).

Illustrate the perceptron’s new line of division in the same chart or a different one, and give the number of misclassified samples. Repeat this process four more times so that you have a total of six lines (or fewer if your perceptron achieves perfect classification earlier).

You can generate the computations and/or graphs either by hand or by writing a simple computer program. If you write a program, please attach a printout, and let the program run until the perceptron achieves perfect classification (after how many steps?). - (b) If your perceptron did not reach perfect classification, determine a set of weights that would achieve perfect classification, and draw the separating line for those weights.
- (c) Now assume that less information were available about the samples. For instance, consider we only know the value for 1 for each sample, which means that our perceptron has only two weights to classify the input as best as possible, i.e., it has weights 0 and 1 , where 0 is once again the weight for the constant offset 0 = 1. Draw a diagram that visualizes this one-dimensional classification task, and determine weights for a perceptron that does the task as best as possible (minimum error, i.e., minimum proportion of misclassified samples). Where does it separate the input space, and what is its error?

## Question 5

Consider the more difficult classification problem shown in the chart below:

- (a) As you certainly noticed, a single perceptron cannot do this classification task perfectly. Determine the minimum error that a single perceptron can reach, and show the dividing line in the input space for such a perceptron.
- (b) Clearly, we need a multi-layer perceptron to do this task perfectly. Develop such a system of connected perceptrons and write it down, together with the required weights for each unit. Illustrate the dividing lines for these perceptrons in a copy of the chart above.