Python代写:COMP550 Tagged Corpora and POS Tagging

代写数据处理相关的作业,需要使用一个叫NTLK的库,并完成五个小问题。

Introduction

To do this homework, you will need to download the dependency_treebank data for NLTK (see “Important note” in Lab 5 for how to download NLTK data).
You should ideally also install these packages:

If you have trouble installing these packages, you can post to Piazza for help or ask course staff. However, it is also possible to do this homework without these packages – they are only needed for plotting the histogram in Q2, which you can get a picture of from a classmate if needed. (If you take this route you will also need to comment out some parts of the homework file where functions from matplotlib and numpy are imported and called.)

(matplotlib and numpy may already be installed, if you are using Anaconda or some other distributions. Type import matplotlib and import numpy in a Python console to find out.)

Instructions

This homework comes with a file, hw4.py , which contains places for you to fill in, marked in lines including “STUDENT” or “XX”.

Completing the questions below involves either completing functions (Q1, Q3, Q5), or completing code snippets and giving verbal answers (Q2, Q4) that print out when the file is run in Python. Only the sentences in bold in the questions below require answering by changing the Python file. (For example, do not include code in the file to compute the histogram referred to in Q2, as this will mess with our grading scripts.)

You should rename the file in the format hw4_lastname.py. This file is all that should be submitted on myCourses.

We should be able to run your script on the command line (= command prompt on windows), without errors, using

python hw4_lastname.py

We should also be able to import your script (without errors) when running python in interactive mode, by typing

import hw4_lastname

after changing the directory to be in the same directory as your script.

Make sure to include a note specifying any collaborators in your submitted HW (see collaboration policy in the syllabus). This should be in comments at the top of hw4_lastname.py , or in comments with your submission on myCourses.

Preliminaries

hw4.py contains code to be referred to below, and places for you to fill in. Run hw4.py in a Python console/interactive mode (e.g. run hw4.py in iPython) before proceeding.

Part of speech (POS) tagging is one of the basic steps in developing many Natural Language Processing tools, and is often also a first step in starting to annotate and analyze a corpus in a new language. In this lab, we will explore POS tagging and use a (very) simple POS tagger using an already annotated corpus.

A POS tagset is the set of Part-of-Speech tags used for annotating a particular corpus. The Penn Tagset is one such tagset which is widely used for English. Click on the link and have a look at the tagset.

For this homework, we consider a small part of the Penn Treebank POS annotated data. This data consists of around 3900 sentences, where each word is annotated with its POS tag using the Penn POS tagset. To access the data, our code (in hw4.py ) first imports the dependency_treebank from the nltk.corpus package using the command

from nltk.corpus import dependency_treebank

We then extract the tagged sentences using the following command:

tsents = dependency_treebank.tagged_sents()

tsents contains a list of tagged sentences. A tagged sentence is a list of pairs, where each pair consists of a word and its POS tag. A pair is just a “tuple” with two members. Once you’ve loaded hw4.py , tsents[0] contains the first tagged sentence. tsents[0][0] gives the first tuple in the first sentence, which is a (word, tag) pair, and tsents[0][0][0] gives you the word from that pair, tsents[0][0][1] its tag.

Question 1

Complete the code in the tag_distribution function. This function:

  • takes as input a list of tagged sentences in NLTK format (like tsents , or tsents[:10] for the first ten sentences)
  • returns a dictionary with POS tags as keys and the number of word tokens with that tag as values.

Some sample inputs to test your tag_distribution function with are given in comments in hw4.py

This function can then be used construct a frequency distribution of POS tags in the Penn Treebank data, by running:

freqDist = tag_distribution(tsents)

Question 2

Using the function plot_histogram , plot a histogram of the tag distribution with tags on the x-axis and their counts on the y-axis, ordered by descending frequency. Hint: To sort the items (i.e., key-value pairs) in a dictionary by their values, you can use:

1
sorted(mydict.items(), key=lambda x: x[1])

a). Describe the distribution you see in the histogram, in 1-2 sentences. (What does it say about how frequent different parts of speech are in English?)

Write code to determine the 5 most frequent POS tags using your frequency dictionary, and use it in answering this question:

b). What are the 5 most frequent POS tags in this data? Do they agree with your intuition about what the most frequent parts of speech are in English? (Briefly explain.)

Question 3

Construct a conditional frequency distribution (CFD) by completing the code in the word_tag_distribution function. A CFD is a dictionary whose values are themselves distributions, keyed by context or condition. (You also constructed a CFD in Homework 3: a CFD is one kind of “dictionary of dictionaries”.) In our case we want words as conditions == keys, with values a frequency distribution of tags for that word.

For example, for the word “book”, the value of your CFD should be a frequency distribution of the POS tags that occur with book. Once you have completed word_tag_distribution correctly, you can construct the conditional frequency distribution for words (and their POS tags) in this corpus by running:

>>> condFreqDist = word_tag_distribution(tsents)

This dictionary should give:

>>> condFreqDist['book']
{'NN': 7, 'VB': 1}

This means that the word “book” occurs 7 times as a noun and once as a verb in the Penn Treebank sentences.

Question 4

Using your CFD, write code to compute the level of tag ambiguity in the corpus. That is, on average, how many different tags does each word (type) have? Then, write code to compute the level of tag ambiguity for just the first 2000 sentences in the corpus, the first 1000, and the first 500. (Note that this is roughly 50%, 25%, 10% of sentences, as there are 3914 sentences.)

(This involves filling in the lines of hw4.py where tagAmbiguity , etc. are defined.)

a). What is the level of tag ambiguity in the corpus? For the first 50%, 25%, 10% of the corpus?

b). You should observe a pattern in the tag ambiguity numbers: the amount of tag ambiguity should either steadily inscrease or decrease as the size of the corpus is increased. Explain in 1-3 sentences why you might expect this pattern. (You may also be able to think of reasons why you might expect the opposite pattern; if so, you could list these as well.)

Question 5

The homework file contains a simple POS tagger called a unigram tagger, in the function unigram_tagger.This function takes three arguments:

  • The first argument is a conditional frequency distribution, which can be generated using the word_tag_distriution you completed above.
  • The second argument is the most frequent POS tag in the corpus.
  • The third argument is a sentence that needs to be tagged.

The goal of this function is to tag the sentence using probabilities from the CFD and most frequent POS tag. The function uses a helper function called ut1 , that processes a single word. If the word has been seen (it’s present in the CFD), ut1 assigns the most frequent tag for that word. For unseen words (not present in the CFD), ut1 assigns the overall most frequent POS tag, as passed in as the 2nd argument.

For example, running the tagger using “NN” as the most frequent POS tag (after defining condFreqDist as above), on a test sentence:

unigram_tagger(condFreqDist, 'NN', 'This is a test')

should give this output:

[('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('test', 'NN')]

a). Why is this called a “unigram tagger”? Your answer (1-2 sentences) should make reference to the concepts of “unigrams” and “unigram probability” discussed in class.

Run the tagger to tag the sentences “the bank has money” and “you can bank on it”. Look at the POS tags.

b). What errors are there in the POS tags? What caused the errors? Explain why an HMM tagger (discussed in class) would probably not make the same error(s) (if trained on enough data). (Your answer can be up to a paragraph.)