## Question 1

Design a dataset for training a machine learning model such that the model is 100% accurate on the training data but 0% accurate on the test data. Show an example of the data set.

## Question 2

Write an algorithm that uses deep learning to do co-training. Describe the algorithm and design an example for illustration.

### Note

Co-training has not been introduced in the lectures, you may first search for relevant information and study it yourself to write the algorithm.
For deep learning, you may treat it as a “black box” function Y = D(x) for an input vector x.

## Question 3

### Question 3.1

Design a distance function to evaluate the similarity between two customers in the domain of online purchases, e.g. Amazon.com. Assuming the database records the following attributes:

``````Customer_id
User_name (composed of less or equal to 10 characters)
Purchased_items (the set of items the bought last month)
Payment_methods (a nominal attribute of 3 values: visa, paypal, on_delivery)
Amount_spend (average amount spent per purchase in dollars and cents; it has a mean of 200.00 a standard deviation of 50, the minimum is 0.02 and the maximum is 980)
Age groups (an ordinal attribute of 4 values: <=17; 18 – 29; 30 – 49; >=50)
Purchase_reviews (the set of customer reviews submitted)
``````

Note: you can choose which attributes to be included in the distance function.

### Question 3.2

In this question, you are asked to compare the results with the Minhashing and the results with the normal Jaccard similarity. You have to use the BBCSport data set from http://mlg.ucd.ie/datasets/bbc.html. You can download and use the pre-processed dataset.

1. Compute the exact Jaccard similarities for all pairs of articles. List the pairs of documents with similarity at least 0.5.
2. Using MinHashing, generate the signature matrix with 50 hash functions. From the signature matrix compute a similarity matrix S that every S(i,j) is the similarity of articles i and j. List the pairs of documents with similarity at least 0.5.
You may use the following
• settings: 50 hash functions
• Hash function: hi(r) = (air + bi) % c
where ai and bi are randomly chosen integers less than the maximum value of r. c is a prime number slightly bigger than the maximum value of r List the values of ai, bi and c used.
3. Compare the results in step (1) and (2) and evaluate the approximate Jaccard similarity obtained in step (2). Report the number of false positive and positive negatives.
4. Repeat steps (2) and (3) using 100 hash functions.

Note: You may use any tool or programming language.

## Question 4

Let’s look at the data integration for the music industry. Identify at least three sources of music information including:

• One providing music metadata (e.g. about musical artists, music albums, labels, and genres) - One providing music streaming services
• One providing music related information, e.g. popularity rankings, music reviews, or information of concerts

Based on the topics that are introduced in this course, discuss with examples on how the following aspects of data integration of the above sources of music information can be performed? Also, in each aspect, describe with examples, whether there are any challenges in the “V” dimensions (variety, volume, veracity and velocity) and what are they?

## Question 5

Suppose the university would like to release the data of student records as “open data” to help researchers and public communities to investigate and discover useful information about education. What are the privacy concerns on the publishing and release of the data? What are the techniques or approaches required to preserve privacy? Explain with examples.