加载并分析一个手写数字数据集,完成多项任务。作业包括数据描述、图像生成、特征提取、数字预测和转录等,评分标准涵盖代码正确性、质量和解释清晰度。学生需提交代码、生成的图像和CSV文件,并确保代码可重复运行。作业重点考察数据处理、可视化和机器学习基础技能.

Instructions
Please read these instructions carefully. It is your responsibility to read and understand them all. If you have any questions, please email the module organiser ASAP; he will not have time to help you if you email at the last minute!
What you will do
You will load a dataset, answer questions about it, write code to analyse it, and display results.
Evaluation
The assignment is worth 60% of your grade. It is marked out of 120 points total. The weight of each question and sub-question is indicated.
Your submission will be evaluated based on:
- The correctness of the results: some questions ask you to obtain certain values and save them as variables. Your script will be automatically checked to see if these variables exist and if their values are correct.
- The correctness of the code: does it run without error, and does it perform the assigned task? This will be evaluated partly by the grader reading your code and partly by your code being re-run on test cases.
- The quality of the code: all of your code should be readable. Be sure to use appropriate variable names; all functions you define should have docstrings; code should achieve the task without excessively complicated steps.
- Your explanations: do your responses explain things clearly, in a logical order? Do you explain them succinctly, avoiding irrelevant detail?
Writing
In QHF3004 and QHF3005, you have learned how to describe and explain graphs, figures and processes in clear terms. Draw on these skills as you complete this assignment!
Submission
When you have finished editing your .ipynb
file, create a zipfile that contains your code and all the supplementary files generated by your code, and then upload it to the QMplus submission point.
Remember to:
- Include your name and QMUL student ID in the header above;
- Double check that your notebook runs smoothly from a fresh kernel;
- Tip: use the menu option
Kernel > Restart Kernel and Run All Cells...
- Tip: use the menu option
- Sign the declaration at the bottom of the notebook by ticking the checkboxes.
Tip: After uploading your zipfile, download it and unzip it to make sure that the submission is complete and correct. This is the best way to double check that you didn’t upload the wrong file by accident.
Automatic script checking
Some of the questions will ask you to define functions or variables by name. Be sure to use these names exactly: they will be used to automatically check that certain parts of your code exist and work as required.
For example, a question could say:
- Define the variable
half_my_id
, equal to your student ID number divided by 2, rounded down.
This solution would get full marks:
1 | half_my_id = int(my_id) // 2 |
The following solutions would be penalised:
1 | HALF_MY_ID = int(my_id) // 2 |
Digits dataset
Load dataset, define train/test split (0 marks)
Run the code below to import a dataset of digits as digit_data
.
Read the documentation to make sure you understand it:
- Link to documentation
- In case the documentation website is not accessible, a copy is preserved in
digit_dataset_documentation.pdf
.
The documentation explains that this is a set of 1797 images is the ‘test set’ of a larger set of images. For this assignment, please disregard that context. Instead, consider that we have:
- a training set of 1500 examples
- a test set of 297 examples
- together, these make a full set of 1797 examples
The code below will create two new objects, train_data
(which contains around 84% of the data) and test_data
(which contains the other 16%). In later questions, you will use the properties of the training data to predict the labels for the test data.
Note: your ID is used to make the test/train split unique to you.
IMPORTANT: DON’T EDIT THIS CODE
1 | # Package imports |
IMPORTANT: If any lines in this cell trigger an error, please contact me!
Question 1: Describe the dataset
The number of marks that each question is worth is given in brackets after the question.
- Briefly describe, in your own words, what the full dataset contains.
- The documentation says that the data have been “normalized”. What does that mean in this case? Justify your answer briefly with your observations of the data.
- Write a function
count_digits_in_dataset
that takes a dataset as input and returns a length-10 numpy array giving the number of times each digit occurs.- The function should work whether the input is
digit_data
,train_data
ortest_data
. count_digits_in_dataset({"target": [1,1,9,9,9]})
should return[0,2,0,0,0,0,0,0,0,3]
.
- The function should work whether the input is
- Use
count_digits_in_dataset
to count the digits in each dataset, and combine them into a pandas dataframe nameddigit_frequencies
.- The dataframe should have three columns:
full
,train
andtest
. - The row index should indicate the digit: e.g.,
digit_frequencies.loc[6]["train"]
should return the number of times the digit “6” appears in the training dataset. - Save the dataframe as
digit_frequencies.csv
.
- The dataframe should have three columns:
- Create a bar chart (use
plt.bar
) displaying the number of times each digit appears in the training dataset.- Save the plot as
digit_frequency_train.png
- Save the plot as
- Load the file
new_digit.png
as a numpy array. Follow the steps described in the documentation to convert it into an 8x8 array.- Save the 8x8 array as the variable
new_digit
.
- Save the 8x8 array as the variable
The cells below provide some hints about the steps to complete. For later questions, this kind of hint will not be provided.
Question 2: Generate an ID image
- Write a function
make_number_image
that takes a string of digits as input (e.g.,"0123456789"
) and returns an image of that number using samples from the training dataset.- Hint: If the input is an $N$-digit integer, then the output should be an $8 \times 8N$ numpy array.
- Extra requirement: the same image should not appear twice in the output. I.e., if the input is
55
, then two different fives fromdigit_data
should be used.
- Use
make_number_image
to generate an image of your student ID. Useplt.imsave
to save the image asmy_id.png
. - Use
make_number_image
to generate an image of the number00000111112222233333
. Save the image asnumber.png
with black writing on a white background. - Use
make_number_image
to generate an image of the number567
. Then, manipulate the image so that5
is red,6
is green, and7
is blue. Save the image asnumber-rgb.png
.
Question 3: Describe pixels
- What are the possible values of the pixels in the dataset? Define a list or array named
possible_values
that contains all of them. - Define a function
image_stats
that returns the mean, standard deviation, and maximum value in an input array. - Define a function
pixel_sum
that returns the sum of all the pixel values in an input array. - Which image in the full dataset has the greatest pixel sum?
- Use
pixel_sum
to find the right image. - Define the variable
brightest_image_index
to be the index intodigits_data["images"]
of the image.
- Use
- There are 64 pixels in each digit image. Find the (row,col) coordinates of these two interesting ones:
- The pixel that is non-zero in the greatest number of images in the full dataset. Define
brightest_pixel
as the (row,col) tuple of this pixel. - The pixel with the greatest variance throughout the dataset. Define
least_predictable_pixel
as the (row,col) of this pixel.
- The pixel that is non-zero in the greatest number of images in the full dataset. Define
- Define another pixel of interest for the image, and locate it. Set the coordinates as
interesting_pixel
. - Consider the subset of training examples that have the digit “1”. Sort this subset by
pixel_sum
, from least to greatest. Select the first, middle, and last images in this subset and arrange them into an 8x24 array. Save this array as111.png
.
Question 4: Calculate features of images
- Create a function
tb_diff
to calculate the difference between the sum of the top and bottom halves of an input image. - Create a function
lr_diff
to calculate the difference between the sum of the left and right halves of an input image. - Create a function
max_col
to calculate the maximum column sum in an image, and its index.- In case of a tie, use the earlier column.
- Create a function
max_row
to do the same for rows.- Don’t solve from scratch. Instead, it should call
max_col
. - Hint for steps 1–4: we provide a set of assertions that should work.
- Don’t solve from scratch. Instead, it should call
- Create a function
image_features
to run all of the above functions (includingimage_stats
andpixel_sum
from Question 3) on an input image and combine the outputs into a length-10 vector. - Run
image_features
on all of the images in the train and test sets, and compile the results into two dataframes,train_features
andtest_features
. Save each dataframe to a CSV.- CSV filenames should be:
train_features.csv
andtest_features.csv
.
- CSV filenames should be:
Question 5: Compare images and digits
Parts of this question and others depend on you having created the dataframes train_features
and test_features
.
If you are unable to solve the previous questions and define these dataframes, you can still get partial credit for the next questions by loading the contents of fake_train_features.csv
and fake_test_features.csv
.
ONLY DO THIS if you have not managed to build the dataframes! The data is “fake”: it has been invented and resembles real data for the purposes of the next tasks, but it is not real or accurate.
1 | # Code to load the dummy data |
- Summarise the values in
train_features
. What is the range (min, max) and average (mean) of each feature?- Store the values in
train_feat_summary
- Store the values in
- Define a function
distance_euclidean
that takes two feature vectors as input and returns the ‘Euclidean distance’ between them.- Each ‘feature vector’ should be a length-10 vector with the features calculated in Questions 3 and 4.
- You should check that the distance of a vector with itself is 0:
distance_euclidean(vec1,vec1) == 0
- Define another function
distance_2
that takes two feature vectors as input and returns a ‘distance’ between them.- It is up to you to define your approach, but it should be a distance score, with a minimum value 0 when the inputs are identical.
- It should not be Euclidean distance again, of course!
- Explain your approach in the docstring for the function.
- Define a function
compare_digits(i,j,feat1,feat2)
that generates a scatter plot based on 4 inputs: two digitsi
andj
, and two feature namesfeat1
andfeat2
.- Each point (x,y) in the scatter plot should indicate the (feat1,feat2) values of a digit.
- There should be one point for every image in the training set that is an
i
orj
. - The set of points that correspond to
i
s should be one colour, and the points that correspond toj
s should be a second colour. - The plot should have a title (use
plt.title
) indicating what digits are being compared. - The plot axes should have labels (use
plt.xlabel
andplt.ylabel
) indicating what features are being compared. - Generate a plot using
compare_digits(0,1,"tb","lr")
and save the result ascomparison_0_1_tb_lr.png
.
- Use
compare_digits
with different digits and features, and find two examples: one where the two digits are very well separated, and one where the digits are poorly separated.- Save both figures using the same format as above:
comparison_i_j_feat1_feat2.png
. - Briefly discuss your observations of the digits and features you tested. Do you think any features are useful for distinguishing between digits?
- Save the names of the two features that separated the two digits well as a tuple,
best_feats
.
- Save both figures using the same format as above:
- Create a scatter plot of all of the points in the training dataset using the same two features from the previous step,
best_feats
.- The plot should use 10 colours, one per digit.
- The plot should have a title and axis labels.
- Set the “alpha” parameter as required to make the plot more legible.
- Based on the plot, discuss whether you think that unknown digits could be predicted based on these two features.
Question 6: Predict digits
- Create an array
digit_averages
such thatdigit_averages[i]
is an 8x8 image that is the average of images for digiti
in the training data.- The array should have shape (10, 8, 8).
- Create a function
image_bitmap_distances(image, digit_averages)
that compares an input 8x8image
to all of the 10 templates indigit_averages
and returns a length-10 vector of distances.- The docstring for
image_bitmap_distances
should explain what distance metric is used. - You are free to use any distance metric, but you should briefly justify your choice in the docstring.
- The docstring for
- Run the function
image_bitmap_distances
on all the images in the test set and use it to predict the digits.- Calculate the accuracy of the predictions and save this as
bitmap_distance_accuracy
. This should be a value between 0 and 1. - Create a confusion matrix out of the results using
sklearn.metrics.confusion_matrix
. Save the result asbitmap_distance_confusion
. - Note: you should try to make your system work well, but you are not graded based on whether your system is successful or not!
- Calculate the accuracy of the predictions and save this as
- Write a function
image_feature_distances(image_ind, train_features)
to calculate the distance of one image to all of the other images in the training set.- The parameter
image_ind
should indicate a row in the dataframetrain_features
. - The output should have one value for every row in
train_features
. - The choice of distance metric is up to you; as always, briefly justify and explain your choice in the docstring.
- The parameter
- Run
image_feature_distances
on all the images in the test set and use it to predict the digits.- Calculate the accuracy and save this as
feature_distance_accuracy
. - Calculate the confusion matrix and save as
feature_distance_confusion
.
- Calculate the accuracy and save this as
- Discuss the results of the two prediction methods you tried above.
- This discussion can go as deep as you like, but should not exceed 300 words.
- Your observations should be supported with calculations or pyplot figures where relevant.
Question 7: Transcribe digits
- Write a function
transcribe_digits(filepath)
that accepts a filepath as input and returns a prediction of the string of digits it contains.- You can assume that the filepath points to an image with dimensions 32xN, and that the digits are written in black on white.
- Four different “handwriting samples” are provided as examples for you to prototype with. The intended output of
transcribe_digits("handwriting_2.png")
would be"2586734910"
. - You are free to re-use any of the functions you defined earlier in the assignment, or any other methods you think are appropriate.
- The docstring should explain the approach you take.
- Do not import or use a ready-made optical character recognition system to solve this.
- Note: your function does not need to have perfect accuracy in order to get full marks!