Python代写:CS106 XML and Text Comparison

代写Python基础作业,实现XML解析以及文本比较。

XML

Requirement

For all the tasks you have to write comments in your code which briefly explains what is going on (ie in the .py file itself you have to write small comments). In addition, you must write a report in which you describe your program in more detail and explain your choice of solutions. The scope of the report is 4-5 pages (for groups of two students 6-8 pages and for groups of three students 8-10 pages). You must attach as many attachments as you like.

  • Take a screenshot of a run of each program (i.e. run the program and take a screenshot of the result that appears).
  • Attach screen prints as attachments (even if the program only works partially).
  • All your programs (i.e. the .py files themselves) must be submitted as attachments.

Task 1 xml

a) Write a program called with one argument (an input file) from the terminal window. The program must load the input file that is in tei xml (‘iso-8859-1’) and it must print to the terminal: the title of the file (the text in element sourceDesc / bib / title), the author (the text in element sourceDesc / bib / author), number of quoter elements (elements with tag q). Try the program like this: python find_quoter.py ‘fair_tei.xml’. Call your program find_quoter.py

b) Write a new program called with two arguments in the terminal window (an input file and an output file). The program should load the same input file as before, but it must print the following in the output file: the title (the text of the element title), and the words in the texts for p and q elements that appear at least 3 times in the corpus. Try the program like this: python find_freq.py ‘fair_tei.xml’ ‘outq.txt’. Call your program find_freq.py

Task 2 Frequency lists and text comparison by Python

Files you need for this program: vivaldi_positiv.txt and vivaldi_negativ.txt.

Imagine being a communications officer at Vivaldi and forming an overview of Vivaldi reviews. Vivaldi_positiv and vivaldi_negativ contain a sample of user reviews from Trustpilot and Tripadvisor (the reviews are randomly selected so they do not necessarily give a fair picture of users’ opinions about Vivaldi).

Write a program in Python that looks at the words used in texts in different ways. Remember that it is important to think of uppercase / lowercase letters and punctuation when working with texts. Use vivaldi_positiv and vivaldi_negativ as examples. Call your program lex_analysis.py

1) Your program should make a frequency list of the words in vivaldi_positiv and a frequency list of the words in vivaldi_negativ (1-word frequency lists). The two lists must be sorted with the most frequent word at the top. You will need Dictionaries in your solution. Use a Stopordslliste (fill words list) to sort out the most “contentless” words (create your own Stopordslliste or find one online). Print the two frequency lists for text_out.txt.

2) Make frequency lists of word pairs (bigrams) and 3 word combinations (trigrams) for each text. The lists are sorted with the most frequent bigrams / trigrams at the top and the lists are printed to text_out.txt.

3) Your program should print to the same file (text_out.txt) the words that ONLY occur in vivaldi_negativ (and thus not in vivaldi_negativ) and the words that ONLY occur in vivaldi_negativ (and thus not in vivaldi_positiv).

4) Describe the trends you see in text_out.txt. Ie try to put into words how the different frequency lists can contribute to an overview of the content in large volumes of text.

5) Provide at least one suggestion (and preferably several) on how to improve / extend the program so that it could be even more widely used in an overall analysis of the content of Vivaldi’s reviews.

Remember to print appropriate headings in your text_out.txt - so it’s easy to find the file (for example, which frequency list belongs to which file, etc.).

Task 3 Frequency lists with Unix commands

In this task, using the command line in the terminal window, you must achieve some of the same results that you obtained using the Python program in the above task. Here, too, you should reflect on upper / lower case letters and punctuation. Explain the commands in detail.
1) Use unix commands to create a frequency list of the 25 most frequent words from either vivaldi_positiv or vivaldi_negativ (1-word frequency list). Lay out the frequency list in a new file that you call freq_vivaldi.txt. Use a stop word list to sort out the most “contentless” words (create your own stop word list or find one online).

2) Also make frequency lists with big frames and trigrams for either vivaldi_negativ or vivaldi_positiv. Also print these lists for freq_vivaldi.txt.

3) Make at least one more relevant study of the texts of your choice.

Screenshots showing your commands and freq_vivaldi.txt.

Task 4 Automatic Literature Generation

In this assignment, you must create a program to generate a bibliography automatically from a collection of books. The books are given to you in a csv file that you have to load in python and print as a sorted list of literature.

Write a program that loads the file literature.csv and prints a sorted literature list with the books from the file. Call your program literature.py.
Your bibliography should look as described below under “Formatting of bibliography”, where an explanation is also given on how the books can be found in the csv file (“Description of csv file”).
An example of how your program might work (for example, if you created a function called csv2lit):

>>> fil = "litteratur.csv"
>>> csv2lit (fil)
Litteraturliste
S., Leon. Linear Algebra with applications. Pearson. 8. (2010)
A., Turing. Solvable and unsolvable problems. Penguin Books. 1. (1954)

Here the bibliography contains only the first two books from the file, but your program must print a bibliography with all the books.

Description of csv file

The books to work with are in the file litteratur.csv. A CSV file is a text file that is formatted in a very specific way: Each line in the file corresponds to (in our case) one book and for each line / book a number of information is specified. The information is given by being separated by the symbol semicolon, “;”. Thus, the first lines from litteratur.csv

First Name, Last Name, Title, Publisher, Edition, Year
Alan; Turing; Solvable and unsolvable problems; Penguin Books; 1; 1954
Steven; Leon; Linear algebra with applications; Pearson; 8; 2010

is understood as

First; Last name; Title; Publisher; edition; Year
Alan; Turing; Solvable and unsolvable problems; Penguin Books; 1; 1954
Steven; Leon; Linear algebra with applications; Pearson; 8; 2010

where the first line contains headings for the different information.

Literature formatting

The books in your bibliography must be printed in the following format:

F., Surname. Title. Publisher. Edition. (Year)

where first names are abbreviated by a period, the year is written in parentheses, first names and surnames are separated by a comma, and where the different values (name, title, publisher, edition and year) are separated by a period. In order to obtain full marks for the assignment, the literature list must be sorted by the author’s last name.