Python代写:CS6035 Machine Learning on CLAMP

练习对Pandas和Sci-kit库的理解,学习有关数据科学和机器学习的基础概念,并应用于网络安全领域。

Pandas

Learning Goals of this Project

Students will learn introductory level concepts about Data Science and Machine Learning as it can be applied to the Cybersecurity Domain. This lab develops understanding of the general data science process and commonly used python libraries like pandas and sci-kit learn.

The final deliverables

Complete the Canvas Quiz.

Submit the 5 python files to Gradescope. The files should be named task1.py, task2.py, task3.py, task4.py and task5.py and should implement the functions described below (and in the starter code).

Important Reference Material

Submission

  • Gradescope (autograded)
  • Canvas Quiz

Introduction

You may be asking yourself “what is the importance of learning about Data Science and Machine Learning in a cybersecurity class?”, The short answer is that data science is a useful set of tools to handle the massive amount of data that flow through IT systems and it is used by many security teams either explicitly or within tools/programs they use so it is important to get a basic understanding of how it works. This Project will go through a simplified scenario where data science can be used, if this sparks your interest there are plenty of other ML focused classes at GaTech that you may be interested in taking, as well as a wealth of training materials on Youtube, Coursera, Udacity, Udemy, DataCamp etc that you could use to go deeper into the field.

Scenario

You are an analyst on a security team for a midsized software company that runs a messaging app (a slack, gchat, microsoft teams competitor). It is Monday morning and you see an email from your manager setting up a meeting to discuss a new security feature that needs to be implemented in the product ASAP. You join the meeting and learn that recently there has been a big uptick in malicious executable files being sent over the chat app and it is starting to generate bad press for the company. A few analysts on the team already worked on analyzing a set of files sent over the app and classifying them as malicious or benign. They also used a python library (pefile) to get some attributes of each executable file and have created a CSV with those extracted attributes and a column with the name class with a 1 denoting a malicious file and a 0 denoting a benign files. They documented their preprocessing work in a readme in the git repo (urwithajit9/ClaMP: A Malware classifier dataset built with header fields’ values of Portable Executable files (github.com)) and shared the repo with software engineers so they can get to work writing code that will generate those features for every executable file sent over the messaging app. Your boss turns to you and says I would like you to help us to understand a bit more about how big of a problem this is on our app and write a model that takes in these features and produces a propensity score from 0 to 1 where scores closer to 0 mean a low likelihood of the file being malicious and closer to a 1 means a higher likelihood of a file being malicious. Also since the team may want to reuse this type of work in the future for different types of files or with different extracted attributes you should create functions that can be used in the future with minimal rework. Once you produce a model, you will share your code and the trained model file with the software engineers who will integrate the model into the messaging app and will score all files uploaded to the app.

General Advice

  • Develop locally then test in the autograder when you are confident your code runs without errors. You can run the python files locally, develop in a local vscode/jupyter notebook or on a hosted web notebook like google colab.
  • Do not use print statements in your gradescope submissions. While print statements are useful to debug issues locally in an autograder context they can leak sensitive answer information. We have detections in gradescope that will block you from viewing scores/outputs from your code if you use print statements in any of your submitted code. If you try to bruteforce/hack/game the autograder or extract information we will give you a 0 for the whole assignment.
  • Read the python library documentation. You will be using pandas, scikit-learn, yellowbrick
  • Do not hard code solutions (see screenshot below for what not to do)

Task 0

We have a Canvas quiz that is meant to test that you have read the library documentation for the packages we use for this class. It is not meant to be tricky and can be completed before you start the project or after you finish it.

  • scikit-learn Documentation

Deliverables

  • Complete Canvas Quiz

Task 1

Lets first get familiar with some pandas basics. Pandas is a library that handles data frames which you can think of as a python class that handles tabular data. In this section you will make a very simple function that takes in a pandas dataframe of the file attributes and classifications and returns some simple attributes. See the function skeleton and implement a count of rows, count of columns, count of rows where the classification is 1, count of rows where the classification is 0 and a percentage of classification of 1 in the dataset. Generally in the real world you would also use plotting tools like PowerBi, Tableau, Data Studio, Matplotlib etc to create graphics and other visuals to better understand the dataset you are working with, this step is generally known as Exploratory Data Analysis. Since we are using an autograder for this class we will skip the plotting for this project.

  • pandas documentation pandas 1.5.3 documentation (pydata.org)
  • What is Exploratory Data Analysis? | IBM
  • Top Data Visualization Tools | KDnuggets

Deliverables

  • Complete find_dataset_statistics function in task1.py
  • Submit task1.py to gradescope

Task 2

Now that you have a basic understanding of pandas and the dataset it is time to dive into some more complex data processing tasks. The first subtask in this task is splitting your dataset into both features and targets (columns) and splitting your dataset into training and test sets (rows). These are basic concepts in model building but at a high level it is important to hold out a subset of your data when you train a model so you can see what the expected performance is on unseen samples and so you can determine if the resulting model is overfit (performs much better on training data vs test data). Preprocessing data is important since most models only take in numerical values so categorical features need to be “encoded” to numerical values so models can use them. Numerical scaling can be more or less useful depending on the type of model used but is especially important in linear models. These preprocessing techniques will provide you options to augment your dataset and improve model performance.

  • Training and Test Sets | Machine Learning | Google Developers
  • Bias-variance tradeoff - Wikipedia
  • https://en.wikipedia.org/wiki/Overfitting
  • Categorical and Numerical Types of Data | 365 Data Science
  • scikit-learn: machine learning in Python scikit-learn 1.2.1 documentation

Deliverables

  • Make use of the scikit-learn (sklearn) python package in your function implementations
  • Complete train_test_split function and PreprocessDataset class in task2.py
  • Submit task2.py to Gradescope

Task 3

So far we have functions to split the data and preprocessed it. Now we will run a basic model on the data to cluster files (rows) with similar attributes together. We will use an unsupervised (model with no target column) model, Kmeans, since it is simple to use and understand. Please use scikit-learn to create the model and Yellowbrick to determine the optimal value of k for our dataset.

  • Unsupervised learning - Wikipedia
  • What is Clustering? | Machine Learning | Google Developers
  • ML | K-means++ Algorithm - GeeksforGeeks scikit-learn: machine learning in Python
  • scikit-learn 1.2.1 documentation
  • Yellowbrick: Machine Learning Visualization Yellowbrick v1.5 documentation (scikit-yb.org)

Deliverables

  • Make use of the scikit-learn (sklearn) and yellowbrick python packages in your function implementations
  • Complete the KmeansClustering class in task3.py
  • Submit task3.py to Gradescope

Task 4

Finally we are ready to try a few different supervised classification models. We have chosen a few commonly used models for you to use here but there are many options and in the real world specific algorithms may fit a specific dataset better. You also won’t be doing any hyperparameter tuning yet to better focus on writing the code. You will train a model using the training set, predict on the training/test sets and calculate performance metrics and return a ModelMetrics object and trained scikit-learn model from each model function. (Note: You should use RFE for determining feature importance of the logistic regression model but do NOT use RFE for random forest or gradient boosting models to determine feature importance please use their built in values for this)

  • Supervised Learning | Machine Learning | Google Developers
  • scikit-learn: machine learning in Python scikit-learn 1.2.1 documentation
  • Classification | Machine Learning | Google Developers
  • Classification: True vs. False and Positive vs. Negative | Machine Learning | Google Developers
  • Classification: Accuracy | Machine Learning | Google Developers
  • Classification: Precision and Recall | Machine Learning | Google Developers
  • Classification: ROC Curve and AUC | Machine Learning | Google Developers

Deliverables

  • Make use of the scikit-learn (sklearn) python package in your function implementations
  • Complete the calculate_naive_metrics, calculate_logistic_regression_metrics, calculate_random_forest_metrics and calculate_gradient_boosting_metrics functions in task4.py
  • Submit task4.py to Gradescope

Task 5

Now that you have written functions for different steps of the model building process you will put it all together. You will write code that trains a model with hyperparameters you determine (you should do any tuning locally or in a notebook ie don’t tune your model in gradescope since the autograder will likely timeout). It will take in the CLAMP training data, train a model then predict on a test set and output values from 0 to 1 for each row and our autograder will compare your predictions with the correct answers and to get credit you will need a roc auc score of .9 or higher on the test set (should not require much hyperparameter tuning for this dataset).

Deliverables

  • Make use of any of the techniques we covered in this project in your function implementation
  • Complete the train_model_return_scores function in task5.py
  • Submit task5.py to Gradescope

FAQs

  • How many submissions do we have in Gradescope?
    • Answer: Unlimited
  • When are office hours for this project?
    • Answer: There will be a pinned Ed Discussion Post with office hour date/times as well as recordings after they take place.
  • I need an extension for this project
    • Answer: Open a private Ed Discussion post with your situation and we will determine if it is an approved reason for an extension and decide how many extra days you will get. Note: The earlier you tell us the more likely that we can give you an extension (ie asking the day it is due will make it very unlikely that you will get an approved extension)
  • I am overwhelmed and don’t know where to start.
    • Answer: Start simple with reviewing the useful links we have provided and doing the coding tasks (tasks 1-5) in order. They will somewhat build on each other and will get progressively harder so early tasks are easier to complete.
  • I am using RFE to find the feature importance of a random forest or gradient boosting model and it is running for a long time and timing out in the autograder
    • Answer: Only use RFE for logistic regression models and use the built in values of feature importance for random forest and gradient boosting models.
  • Should I make my own post related to this project in Ed Discussion?
    • Answer: Please try to ask your question in one of the pinned project posts and remove answer data (ie don’t post your code, even snippets) or other information that should not be publicly shared and ask in the public forum so others can benefit from your questions.
  • I can’t see any scores/output in the autograder is it broken?
    • Answer: We have a protection in the autograder to prevent printing sensitive information so if your code has print statements then you wont see your score or any outputs of the autograder. Please resubmit your code with print statements removed and you should see the normal outputs.
  • Can you review my code in a private Ed Discussion Post?
    • Answer: Since we have a Gradescope autograder we will not be reviewing code of students and expect you to debug your code using information in public posts in Ed Discussion or via google searches/stack overflow.
  • I think I found a bug in the Autograder
    • Answer: Open a private Ed Discussion Post and we can take a look. This is the first semester we are running this version of the project and while we tested it internally with the TAs there is a chance that we missed something. If this happens we will make an update to the autograder to fix it and will make a pinned post letting students know the autograder was changed.
  • I have constructive feedback that can improve this project for next semester
    • Answer: Open a private Ed Discussion Post and we will review your ideas and may implement them in a future semester.