Python代写:CS116 TF-IDF

和上次作业类似,代写四个应用程序,练习基础的Python程序设计。

Assignment Guidelines

  • This assignment covers material in Module 09.
  • Do not use recursion. All repetition must be performed using iteration (while and for loops only), and abstract list functions (map and filter).
  • Submission details:
    • Solutions to these questions must be placed in files a08q1.py, a08q2.py, a08q3.py, and a08q4.py, respectively, and must be completed using Python 3.
    • Download the interface file from the course Web page to ensure that all function names are spelled
      correctly and each function has the correct number and order of parameters.
    • All solutions must be submitted to MarkUs. No solutions will be accepted through email, even if you
      are having issues with MarkUs.
    • Verify using MarkUs and your basic test results that your files were properly submitted and are readable on MarkUs.
    • For full style marks, your program must follow the Python section of the CS116 Style Guide.
    • Be sure to review the Academic Integrity policy on the Assignments page
  • Download the testing module from the course web page. Include import check in each solution file.
    • When a function produces a floating point value, you must use check.within for your testing. Unless told otherwise, you may use a tolerance of 0.00001 in your tests.
    • Test data for all questions will always meet the stated assumptions for consumed values.

freq dict

Complete the Python function freq_dict which consumes a string called document, and produces a dictionary whose keys correspond to each token in document. For each token, the produced dictionary contains the number of occurrence (or called frequency) of the token in document. HINT: to get a token list for a document, you need to process the document in the following way:

  • a. Breaking the entire documents into words in space boundaries.
  • b. Removing punctuations from each word. After the removal, each word should only contain numerical characters, alphabetical characters and underscore sign. i.e.’0’-‘9’, ‘a’-‘z’, ‘A’-‘Z’, and ‘_’.
  • c. Converting each word to its lower case.

For example,

1
doc1 = "This is an example. And this is another example: CS116 is a non_cs major, 1st year, amazing!!! course. #CS116"

The token list of doc1 should be:

1
['this', 'is', 'an', 'example', 'and', 'this', 'is', 'another', 'example', 'cs116', 'is', 'a', 'non_cs', 'major', '1st', 'year', 'amazing', 'course', 'cs116']

The dictionary produced by function freq_dict applied on doc1 should be:

1
freq_dict(doc1) => {'cs116': 2, 'major': 1, 'a': 1, 'an': 1, 'non_cs': 1, 'year': 1, 'example': 2, 'and': 1, 'this': 2, 'is': 3, 'another': 1, 'course': 1, 'amazing': 1, '1st': 1}

TF-IDF

Complete the Python function tf_idf that consumes a list of frequency dictionaries called doc_set (each frequency dictionary is a produced dictionary of a document from the above question) and produces a dictionary whose keys are all the tokens in doc_set, and whose associated values are the TF-IDF value for the token calculated by:

term_frequency * (document_frequency / N)

where,
term_frequency is the sum of all corresponding values of the token in each frequency dictionary, document_frequency is the number of documents containing the token, N is the total number of dictionaries(documents) in doc_set.
For example,

1
2
3
4
5
articles = [{'cs116': 3, 'the': 2, 'is': 2, 'an': 1, 'example': 2},
{'the': 3, 'another': 1, 'example': 1},
{'this': 2, 'is': 1, 'cs116': 2}]
tf_idf(articles) => {'another': 0.3333, 'an': 0.3333, 'this': 0.6667,
'cs116': 3.3333, 'example': 2.0, 'the': 3.3333, 'is': 2.0}

NOTE: round the TF-IDF value to 4 decimal places.

Questions 3 and 4 use the new type CS116_Marks:

1
2
3
4
5
6
7
8
9
class CS116_Marks:
''' Fields: student (Nat), assignment (Nat), participation (Nat), midterm (Nat), final_exam (Nat)
where:
student is a 5-digit student ID number,
assignment is the rounded and calculated score for the assignments (an integer between 0 and 20),
participation is the rounded and calculated score for participating the course (an integer between 0 and 5),
midterm is the rounded score of midterm exam (an integer between 0 and 30),
final_exam is the rounded score of final exam (an integer between 0 and 45).
'''

ispass

Compete the Python function isPass_cs116 that consumes score, a CS116_Marks object containing the marks of each part for one student, and produces True if the student passes CS116, False otherwise. The final mark should be calculated using the grading scheme on our course website
For example,

1
isPass_cs116(CS116_Marks(12345, 10, 3, 18, 30)) => True

  • the student gets 60% (18/30) on the midterm and 67% (30/45) on the final, the weighted exam average is 64% (48/75) (greater than 50%);
  • The final grade would be (10 + 3 + 18 + 30) / 100 = 61/100, a passing grade.
1
isPass_cs116(CS116_Marks (23451, 15, 3, 18, 17)) => False
  • the student gets 60% (18/30) on the midterm and 38% (17/30) on the final, the weighted exam average is 47% (35/75) (less than 50%)
1
isPass_cs116(CS116_Marks (34512, 5, 3, 18, 22)) => False
  • the student gets 60% (18/30) on the midterm and 49% (22/45) on the final, the weighted exam average is 53% (40/75) (greater than 50%)
  • the final grade would be (5 + 3 + 18 + 22) / 100 = 48/100, a failing grade.

assn

Complete the Python function generate_cs116_marks, that consumes four parameters:

  • assn - a dictionary with student ID numbers as keys, and a list of scores to 5 assignments for each student as the associated values. NOTE: we assume there are in total 5 assignments this term (maximum 4 marks each), and we take the average of the best 4 assignment scores as final assignment score. NOTE: to be consistent of float calculation, use integer division to compute the averages in this step. For example:
1
{12345: [4, 3, 0, 4, 2], 23451: [1, 2, 3, 4, 2]}
  • part - a dictionary with student ID numbers as keys, and a participation score for each student as the associated values. eg. {12345: 4, 23451: 5}
  • mid - a dictionary with student ID numbers as keys, and a midterm score for each student as the associated values. eg. {12345: 20}
  • final - a dictionary with student ID numbers as keys, and a final_exam score for each student as the associated values. eg. {23451: 30}

The function produces a dictionary with the student IDs of all students in part as its keys, and a CS116_Marks object for that student as the associated value. NOTE: missing scores of the students in the other dictionaries are counted as 0.
For example,

1
2
generate_cs116_marks({12345: [4, 3, 0, 4, 2], 23451: [1, 2, 3, 4, 2]}, {12345: 4, 23451: 5}, {12345: 20}, {23451: 30}) =>
{12345: CS116_Marks(12345, 15, 4, 20, 0), 23451: CS116_Marks(23451, 10, 5, 0, 30)}