Machine Learning代写:QHF30004 QMUL

加载并分析一个手写数字数据集,完成多项任务。作业包括数据描述、图像生成、特征提取、数字预测和转录等,评分标准涵盖代码正确性、质量和解释清晰度。学生需提交代码、生成的图像和CSV文件,并确保代码可重复运行。作业重点考察数据处理、可视化和机器学习基础技能.

Machine Learning

Instructions

Please read these instructions carefully. It is your responsibility to read and understand them all. If you have any questions, please email the module organiser ASAP; he will not have time to help you if you email at the last minute!

What you will do

You will load a dataset, answer questions about it, write code to analyse it, and display results.

Evaluation

The assignment is worth 60% of your grade. It is marked out of 120 points total. The weight of each question and sub-question is indicated.

Your submission will be evaluated based on:

  1. The correctness of the results: some questions ask you to obtain certain values and save them as variables. Your script will be automatically checked to see if these variables exist and if their values are correct.
  2. The correctness of the code: does it run without error, and does it perform the assigned task? This will be evaluated partly by the grader reading your code and partly by your code being re-run on test cases.
  3. The quality of the code: all of your code should be readable. Be sure to use appropriate variable names; all functions you define should have docstrings; code should achieve the task without excessively complicated steps.
  4. Your explanations: do your responses explain things clearly, in a logical order? Do you explain them succinctly, avoiding irrelevant detail?

Writing

In QHF3004 and QHF3005, you have learned how to describe and explain graphs, figures and processes in clear terms. Draw on these skills as you complete this assignment!

Submission

When you have finished editing your .ipynb file, create a zipfile that contains your code and all the supplementary files generated by your code, and then upload it to the QMplus submission point.

Remember to:

  • Include your name and QMUL student ID in the header above;
  • Double check that your notebook runs smoothly from a fresh kernel;
    • Tip: use the menu option Kernel > Restart Kernel and Run All Cells...
  • Sign the declaration at the bottom of the notebook by ticking the checkboxes.

Tip: After uploading your zipfile, download it and unzip it to make sure that the submission is complete and correct. This is the best way to double check that you didn’t upload the wrong file by accident.

Automatic script checking

Some of the questions will ask you to define functions or variables by name. Be sure to use these names exactly: they will be used to automatically check that certain parts of your code exist and work as required.

For example, a question could say:

  1. Define the variable half_my_id, equal to your student ID number divided by 2, rounded down.

This solution would get full marks:

1
half_my_id = int(my_id) // 2

The following solutions would be penalised:

1
2
3
4
5
HALF_MY_ID = int(my_id) // 2

q1_answer = int(my_id) // 2

print("Half my ID would be: ", int(my_id) // 2)

Digits dataset

Load dataset, define train/test split (0 marks)

Run the code below to import a dataset of digits as digit_data.

Read the documentation to make sure you understand it:

  • Link to documentation
  • In case the documentation website is not accessible, a copy is preserved in digit_dataset_documentation.pdf.

The documentation explains that this is a set of 1797 images is the ‘test set’ of a larger set of images. For this assignment, please disregard that context. Instead, consider that we have:

  • a training set of 1500 examples
  • a test set of 297 examples
  • together, these make a full set of 1797 examples

The code below will create two new objects, train_data (which contains around 84% of the data) and test_data (which contains the other 16%). In later questions, you will use the properties of the training data to predict the labels for the test data.

Note: your ID is used to make the test/train split unique to you.

IMPORTANT: DON’T EDIT THIS CODE

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# Package imports
import sklearn.datasets
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# Data import
digit_data = sklearn.datasets.load_digits()
n_samples = len(digit_data["target"])

# Define test/train split based on your ID, which should end in 3 digits
seed_number = (int(my_id[-3:]) % 598) + 1
assert 0 < seed_number < 599
test_indices = sorted(set(seed_number*i % n_samples for i in range(1,298)))
train_indices = sorted(set(range(n_samples)).difference(test_indices))

# Create train_data and test_data
field_names = ["data", "target", "images"]
test_data = {}
train_data = {}
for name in field_names:
train_data[name] = digit_data[name][train_indices]
test_data[name] = digit_data[name][test_indices]

# Check that the splits sum up to the full dataset
assert digit_data["images"].sum() == 561718
assert train_data["images"].sum() + test_data["images"].sum() == 561718

print("Set-up worked. Ready to go!")

IMPORTANT: If any lines in this cell trigger an error, please contact me!

Question 1: Describe the dataset

The number of marks that each question is worth is given in brackets after the question.

  1. Briefly describe, in your own words, what the full dataset contains.
  2. The documentation says that the data have been “normalized”. What does that mean in this case? Justify your answer briefly with your observations of the data.
  3. Write a function count_digits_in_dataset that takes a dataset as input and returns a length-10 numpy array giving the number of times each digit occurs.
    • The function should work whether the input is digit_data, train_data or test_data.
    • count_digits_in_dataset({"target": [1,1,9,9,9]}) should return [0,2,0,0,0,0,0,0,0,3].
  4. Use count_digits_in_dataset to count the digits in each dataset, and combine them into a pandas dataframe named digit_frequencies.
    • The dataframe should have three columns: full, train and test.
    • The row index should indicate the digit: e.g., digit_frequencies.loc[6]["train"] should return the number of times the digit “6” appears in the training dataset.
    • Save the dataframe as digit_frequencies.csv.
  5. Create a bar chart (use plt.bar) displaying the number of times each digit appears in the training dataset.
    • Save the plot as digit_frequency_train.png
  6. Load the file new_digit.png as a numpy array. Follow the steps described in the documentation to convert it into an 8x8 array.
    • Save the 8x8 array as the variable new_digit.

The cells below provide some hints about the steps to complete. For later questions, this kind of hint will not be provided.

Question 2: Generate an ID image

  1. Write a function make_number_image that takes a string of digits as input (e.g., "0123456789") and returns an image of that number using samples from the training dataset.
    • Hint: If the input is an $N$-digit integer, then the output should be an $8 \times 8N$ numpy array.
    • Extra requirement: the same image should not appear twice in the output. I.e., if the input is 55, then two different fives from digit_data should be used.
  2. Use make_number_image to generate an image of your student ID. Use plt.imsave to save the image as my_id.png.
  3. Use make_number_image to generate an image of the number 00000111112222233333. Save the image as number.png with black writing on a white background.
  4. Use make_number_image to generate an image of the number 567. Then, manipulate the image so that 5 is red, 6 is green, and 7 is blue. Save the image as number-rgb.png.

Question 3: Describe pixels

  1. What are the possible values of the pixels in the dataset? Define a list or array named possible_values that contains all of them.
  2. Define a function image_stats that returns the mean, standard deviation, and maximum value in an input array.
  3. Define a function pixel_sum that returns the sum of all the pixel values in an input array.
  4. Which image in the full dataset has the greatest pixel sum?
    • Use pixel_sum to find the right image.
    • Define the variable brightest_image_index to be the index into digits_data["images"] of the image.
  5. There are 64 pixels in each digit image. Find the (row,col) coordinates of these two interesting ones:
    • The pixel that is non-zero in the greatest number of images in the full dataset. Define brightest_pixel as the (row,col) tuple of this pixel.
    • The pixel with the greatest variance throughout the dataset. Define least_predictable_pixel as the (row,col) of this pixel.
  6. Define another pixel of interest for the image, and locate it. Set the coordinates as interesting_pixel.
  7. Consider the subset of training examples that have the digit “1”. Sort this subset by pixel_sum, from least to greatest. Select the first, middle, and last images in this subset and arrange them into an 8x24 array. Save this array as 111.png.

Question 4: Calculate features of images

  1. Create a function tb_diff to calculate the difference between the sum of the top and bottom halves of an input image.
  2. Create a function lr_diff to calculate the difference between the sum of the left and right halves of an input image.
  3. Create a function max_col to calculate the maximum column sum in an image, and its index.
    • In case of a tie, use the earlier column.
  4. Create a function max_row to do the same for rows.
    • Don’t solve from scratch. Instead, it should call max_col.
    • Hint for steps 1–4: we provide a set of assertions that should work.
  5. Create a function image_features to run all of the above functions (including image_stats and pixel_sum from Question 3) on an input image and combine the outputs into a length-10 vector.
  6. Run image_features on all of the images in the train and test sets, and compile the results into two dataframes, train_features and test_features. Save each dataframe to a CSV.
    • CSV filenames should be: train_features.csv and test_features.csv.

Question 5: Compare images and digits

Parts of this question and others depend on you having created the dataframes train_features and test_features.

If you are unable to solve the previous questions and define these dataframes, you can still get partial credit for the next questions by loading the contents of fake_train_features.csv and fake_test_features.csv.

ONLY DO THIS if you have not managed to build the dataframes! The data is “fake”: it has been invented and resembles real data for the purposes of the next tasks, but it is not real or accurate.

1
2
3
# Code to load the dummy data
train_features = pd.read_csv("fake_train_features.csv", index_col=0)
test_features = pd.read_csv("fake_test_features.csv", index_col=0)
  1. Summarise the values in train_features. What is the range (min, max) and average (mean) of each feature?
    • Store the values in train_feat_summary
  2. Define a function distance_euclidean that takes two feature vectors as input and returns the ‘Euclidean distance’ between them.
    • Each ‘feature vector’ should be a length-10 vector with the features calculated in Questions 3 and 4.
    • You should check that the distance of a vector with itself is 0: distance_euclidean(vec1,vec1) == 0
  3. Define another function distance_2 that takes two feature vectors as input and returns a ‘distance’ between them.
    • It is up to you to define your approach, but it should be a distance score, with a minimum value 0 when the inputs are identical.
    • It should not be Euclidean distance again, of course!
    • Explain your approach in the docstring for the function.
  4. Define a function compare_digits(i,j,feat1,feat2) that generates a scatter plot based on 4 inputs: two digits i and j, and two feature names feat1 and feat2.
    • Each point (x,y) in the scatter plot should indicate the (feat1,feat2) values of a digit.
    • There should be one point for every image in the training set that is an i or j.
    • The set of points that correspond to is should be one colour, and the points that correspond to js should be a second colour.
    • The plot should have a title (use plt.title) indicating what digits are being compared.
    • The plot axes should have labels (use plt.xlabel and plt.ylabel) indicating what features are being compared.
    • Generate a plot using compare_digits(0,1,"tb","lr") and save the result as comparison_0_1_tb_lr.png.
  5. Use compare_digits with different digits and features, and find two examples: one where the two digits are very well separated, and one where the digits are poorly separated.
    • Save both figures using the same format as above: comparison_i_j_feat1_feat2.png.
    • Briefly discuss your observations of the digits and features you tested. Do you think any features are useful for distinguishing between digits?
    • Save the names of the two features that separated the two digits well as a tuple, best_feats.
  6. Create a scatter plot of all of the points in the training dataset using the same two features from the previous step, best_feats.
    • The plot should use 10 colours, one per digit.
    • The plot should have a title and axis labels.
    • Set the “alpha” parameter as required to make the plot more legible.
    • Based on the plot, discuss whether you think that unknown digits could be predicted based on these two features.

Question 6: Predict digits

  1. Create an array digit_averages such that digit_averages[i] is an 8x8 image that is the average of images for digit i in the training data.
    • The array should have shape (10, 8, 8).
  2. Create a function image_bitmap_distances(image, digit_averages) that compares an input 8x8 image to all of the 10 templates in digit_averages and returns a length-10 vector of distances.
    • The docstring for image_bitmap_distances should explain what distance metric is used.
    • You are free to use any distance metric, but you should briefly justify your choice in the docstring.
  3. Run the function image_bitmap_distances on all the images in the test set and use it to predict the digits.
    • Calculate the accuracy of the predictions and save this as bitmap_distance_accuracy. This should be a value between 0 and 1.
    • Create a confusion matrix out of the results using sklearn.metrics.confusion_matrix. Save the result as bitmap_distance_confusion.
    • Note: you should try to make your system work well, but you are not graded based on whether your system is successful or not!
  4. Write a function image_feature_distances(image_ind, train_features) to calculate the distance of one image to all of the other images in the training set.
    • The parameter image_ind should indicate a row in the dataframe train_features.
    • The output should have one value for every row in train_features.
    • The choice of distance metric is up to you; as always, briefly justify and explain your choice in the docstring.
  5. Run image_feature_distances on all the images in the test set and use it to predict the digits.
    • Calculate the accuracy and save this as feature_distance_accuracy.
    • Calculate the confusion matrix and save as feature_distance_confusion.
  6. Discuss the results of the two prediction methods you tried above.
    • This discussion can go as deep as you like, but should not exceed 300 words.
    • Your observations should be supported with calculations or pyplot figures where relevant.

Question 7: Transcribe digits

  1. Write a function transcribe_digits(filepath) that accepts a filepath as input and returns a prediction of the string of digits it contains.
    • You can assume that the filepath points to an image with dimensions 32xN, and that the digits are written in black on white.
    • Four different “handwriting samples” are provided as examples for you to prototype with. The intended output of transcribe_digits("handwriting_2.png") would be "2586734910".
    • You are free to re-use any of the functions you defined earlier in the assignment, or any other methods you think are appropriate.
    • The docstring should explain the approach you take.
    • Do not import or use a ready-made optical character recognition system to solve this.
    • Note: your function does not need to have perfect accuracy in order to get full marks!