Java代写:CS112 Analytical Problems

代写Java数据结构作业,作业分为两部分,上半部分的理论知识以及下半部分的编程设计。

Introduction

This homework has two parts: in Part A you will write up answers to analytical problems related to the lectures in the past week, both to confirm and extend your understanding of the abstract principles discussed; in Part B you will write code to implement this understanding, and to practice your Java coding skills. I suggest you read this whole assignment carefully and for Part B, it is definitely worth thinking about your solution for a bit before launching Dr. Java and beginning to type. In addition to the requirements for the Java problems, you must observe the following requirements (for all homework submitted in this course):

  • All programs should be literate, i.e., easily understandable by a human (say, your grader) and follow the Java Style Guidelines for CS112 posted on the class web site;
  • All files for this homework should be submitted using WebSubmit, following the instructions on the class web site;
  • You may not use data structure libraries such as ArrayList, since we are learning to write Java from the ground up and you must learn how these libraries are built; however, you are free to use (unless specifically directed otherwise) the basic libraries String, Character, Scanner, and Math;
  • You may freely use code from the class web site, the textbook, or lecture (unless specifically directed otherwise) as long as you cite the source in your comments; but you may NEVER use code from the web or other students’ programs—this will be considered plagiarism and penalized accordingly.

Part A: Analytical Problems

In these questions we will first explore the binary heap data structure, and then consider the general problem of hashing by considering the two basic paradigms, using a simple hash function. In the first one, involving separate chaining, we will make this realistic by stored (key, value) pairs; the hash function is applied to the key alone, but the entire pair is stored in the node. For the second, involving linear probing, we will simply store the key in the array slot.
Write your solutions in a file hw11.txt, hw11.pdf, hw11.rtf, etc. and submit by the due date and time.

  1. Insert the following sequence of keys into an intially empty 2-3 tree. You should write all the trees down as you do the transformations (this is what would be expected on an exam) but for the purposes of grading this exercise you can just draw the final tree that results.

  2. Consider the following 2-3 tree:
    Suppose we count ONLY comparisons between two keys, and NOT checks to see if the key exists, or checks to see if a pointer is null. (i) How many comparisons would necessary to find the key 10? (ii) How about 140? (iii) How about 60? (iv) Which key(s) occurring at leaf nodes would require a minimum number of comparisons? State which and how many comparisons. (v) Which key(s) would require a maximum number of comparisons? State which and how many comparisons. (vi) What is the average number of comparisons to find the keys in this tree (count for all and then divide by the size of the tree).

  3. Insert the following keys into a binary heap, showing the structure of the tree (draw them sides as in the homeworks, or some other way) and the contents of the array A at the indicated points. At the end, give the values of the indicated variables.

  4. Let us consider using a hash table for a symbol table storing (key, value) pairs, where the key is used as input to the hash function. Insert the following key-value pairs into a hash table implemented with the technique of separate chaining with an array H of size 7. For each insertion, use the hash function hash(x) = (3 * x % 7) on the key part and store the pair in the bucket at H[ hash(key )]. Assume nodes are defined as class Node{ int key; String value; Node next; }.

    • A. Show the hash table after all insertions are performed. What is the worst-case lookup time (in the number of comparisons), and what is the average-case lookup (average the number of comparisons over all keys in the table)?
    • B. Perform the following operations on the table from (a) and show the table that would result
    • C. What is the worst-case lookup and what is the average-case lookup for keys in the table that resulted from part (b)?
    • D. Give a sequence of keys (just the keys, not pairs) that would all hash to the same location for a table of size 7, creating a worst-case hash table (i.e., start with an empty table, and create a linked list worst-case table).
    • E. If we insert M keys into a hash table with N buckets, creating a worst-case table, what is the worst case time for looking up a key? Express in terms of N and M, using Theta(…) notation.
    • F. If we insert M keys into a hash table with N buckets, suppose the hash table arranges itself in the best possible way, what is the worst-case lookup? Express in terms of N and M, using Theta(…) notation.
  5. Suppose now we have a hash table L using the strategy of linear probing, which just stores integer keys (not key-value pairs). The size of the array is N = 11 and the hash function is hash(k) = 3*k%11. Consider the following series of insertions.

    • A. Show L after all insertions were performed.
    • B. What is the worst-case time to search for one of the four keys 4, 2, 1, or 28?
    • C. Perform the following operations on the table from (a) and show the table that would result.
    • D. What is the worst-case time for search for one of the keys 4, 2, 1, 28, 23, 63, 19, or 13 in the table from part (c)?
    • E. What is the average-case time in this table (total the number of comparisons used for searching for each key, and divide by the number of keys, i.e., 8)?
    • F. Perform the following operations on the table from part (c) and show the table that would result.
    • G. Describe the worst-case of a hash table using linear probing. Use Theta(….) notation to characterize the number of comparisons in this case when searching the hash table for a given item which is present in the table.

Part B: Programming Problems

Problem B.1 (Lab 11)

Submit the file IntStack.java from Lab 11 as problem B.1.

Problem B.2 (Lab 12)

Submit your text file Lab12.txt as part of your HW 11 submission.
In the remaining problems, you will develop a realistic program which implements a text-retrieval system similar to Google.
You will extend a basic program called Minipedia that maintains a database of articles on various subjects. We supply a folder of ]2200 articles culled from Wikipedia (first paragraphs only), two test articles Test1 and Test2, and also the code to access this collection. The basic user interaction is provided, as well as a bare-bones implementation of the database as a dumb unordered list. You must extend the functionality in several interesting ways, and also provide a more intelligent data structure using a hash table.

Problem B.3 Article Table

For this problem, you will develop a hash-table based version of the DumbList.java code.

Step One

Download the Code Distribution (ZIP) and unzip it in the folder where you will create your solution. It has the following files (the files marked with asterisk (*) you should NOT modify in any way):

  • articles.zip* – A zipped archive of a folder called “articles” which contains 2511 Wikipedia articles in text form;
  • Article.java* – A class to hold an article, including the title and the body of the article;
  • DatabaseIterator.java* – A class which reads the text files in the articles folder which returns one article after another when next() is called;
  • DumbList.java – An unordered array of articles, which you can add, delete, and lookup.
  • Minipedia.java – The client code which provides a user interface and interacts with the other classes to provide basic functionality for adding, deleting, and looking up articles. It does NOT do search in the body of the articles, and this is the primary functionality that you will be adding.

First, you should unzip the articles folder and browse around and get a sense for what an article looks like. Then say “wow, that’s a lot of articles” (this step optional). Then compile and run the Minipedia.java client, and experiment with the interface, looking up articles (look at the folder to see titles—you will need to get the title exactly right to look it up). Finally, look through the code and understand what the program does.

Note on running Minipedia with the articles folder:

If you get an error either compiling or running Minipedia.java in OS X, e.g.,
The problem is that OS X puts a hidden file in the articles folder when you create it. You must remove this file. Please open up the Terminal and do the following: (1) Type in ‘cd ‘ (just two letters and a space) and then drag the articles folder into the Terminal app. You should see the complete name of the article folder following the ‘cd ‘. Then (only after dragging the folder in) hit enter. Now you are in the article folder. (2) Then type ‘rm .DS_Store’ and hit enter. This will delete the file. (3) Close the terminal window.
After this, Minipedia.java should compile.

Step Two

Next you must rewrite the DumbList.java program as a program ArticleTable.java which will store articles in a hash table using the technique of separate chaining, instead of using a “dumb list.” It should have the same functionality (as specified below). You should initialize the table by simply inserting each article from the list returned by getArticleList(db) in line 65.
Your hash table should be implemented as described in lecture 11/29 and in my notes on hash tables. Please use a table size which is a prime number around 2500 (Google “prime numbers” to see a list). You should use the article titles as the keys to insert into the table.
You may use any “good” hash function for String, e.g., the two found here work well: HTML.
Your public methods should include the following for inserting, deleting, and retrieval an article based on its exact title
Hash Table Iterator In addition, you must write your ArticleTable class to be Iterable, as in Lab 09.
The iterator will need to go through each bucket and each node in each bucket. Hint: use two cursors, one an index into the array and one a pointer to a Node in a bucket; initialize the iterator by finding the first non-empty bucket and point to the first Node in that bucket; when you advance the cursors, you will need to figure out how to advanced to the next Node, even if it in a different bucket.
You, of course, will write a hash table which has Article values and String keys (=titles). Write your class in a file ArticleTable.java and provide a unit test that verifies that it works properly, and can add, delete, and lookup articles, as well as enumerate all articles in the table using the iterator. You MUST test all the features of your code by writing your own unit test, which will be part of the grade. I strongly suggest that you modify the Minipedia code to use this new implementation and test it that works (this is not a requirement).

Problem B.4 Cosine Similarity Calculator

In this problem you must write a program file TermFrequencyTable.java which will store the words from two Strings (i.e., documents) in such a way that you can easily calculate the “cosine” similarity of the two.

Calculating the Cosine Similarity

To calculate the similarity of the two documents, you must take the union of the two word lists A and B (this is the total vocabulary of the two documents, minus any “blacklisted words”—very common words are often not used to calculate similarity measures) and then calculate the new term frequency vector of each document with respect to this total list of words. For example, if document A is “the man with the hat ran up to the man with the dog” and document B is “a man with a hat approached a dog and a man”, and the black list is {“the”, “a”, “to”, “with”, “up”}, then we have the following word lists which record how many times each word occurred in each document, and then compile a list of the total vocabulary (minus black-listed words):

Now we must calculate the term frequency vector for each test, indicating how many times each word in the total list occurs in each of the two documents (of course, some entries will be 0):

Finally, we calculate the cosine similarity using the formula:
(Note that the indices here start at 1 and end at n, instead of starting at 0 and ending at n-1; this is just a mathematician’s way of writing sequences; you should always think in terms of 0 to n-1.) More explanation of the cosine similarity may be found here.

This means the documents are very similar, but not identical (with respect to non-black-listed words). The cosine similarity is literally the cosine of the angle between two vectors in N-dimensional space (you’ll learn about this in Linear Algebra), so it is always between 0 and 1.

You should write this class using a separate-chaining table with a table size which is a relatively small prime around size 100 (the articles are relatively short and you need to store only the vocabulary of two articles). Your buckets should be composed of Nodes using something similar to the following.

The basic idea here is to put all the terms in two strings (documents) into the table, and count how the number of occurrences of each term in each document. The total list of all words is just the total of all words in the hash table. At the end, you may return the cosine similarity of the terms contained in it; if you go through the entire table, the list of all (non-duplicate) terms is the term list, and the list of the term frequencies contained in termFreq[] contains the term frequency vectors.

In the formula above, the sequence Ai is the sequence of integers found in p.termFreq[0] for all nodes p, and the sequence Bi is the sequence found in p.termFreq[1]. You need to calculate the product of these integers (for the numerator) and the squares of each (for the denominator). You must therefore use a method for iterating through all nodes in the table; you do not need to provide an interface to do this, but you will need to iterate through all the entries (two for loops should do it!).

Write your class in a file TermFrequencyTable.java and provide a unit test that verifies that it works properly. Test three examples, one with the exact same terms in each document (e.g., “A B” and “A A B B”, the c.s. should be 1.0), one where there is no common term between the two documents (“A B” and “C D”, the c.s. should be 0.0), and also an example which produces a c.s. between 0 and 1, perhaps “CS112 HW11” and “CS112 HW11 HW11” which should be 0.9487.

Problem B.5: ArticleHeap

You must write a file ArticleHeap.java which implements a priority queue for Articles, ordered by the cosine similarity (which is a double field in the Article class, and has a method compareCS(…), take a look there). The method getMax() should return the article in the heap with the largest cosineSimilarity field. You should not need to make any changes to the Article class to get this to work.
You should be able to write this easily using the code for the MaxHeap from the web site.
The method getMax() in your class should throw a HeapUnderflowException if you call the method when the heap is empty. You MUST create a Unit Test which verifies that your heap works properly and throws this exception in the case of heap underflow. And you will need to call the getMax() inside of a try-catch block.
Put the exception definition at the end of your file, e.g.,
Hand in your code as ArticleHeap.java with an appropriate Unit Test.

Problem B.6: MiniGoogle

For this problem you must use Minipedia.java from the code distribution (which currently is implemented using the DumbList.java data structure to search for titles, insert, and delete articles) as a template to write a program MiniGoogle.java,which will allow for searching the text of an article using keywords, using Cosine Similarity (see previous).
The steps in this process are as follows:
To find articles most similar to a search phrase P:

  1. Use the ArticleTable iterator to enumerate all the articles in the ArticleTable;
  2. For each article, compare the body of the article with the search phrase P (as if they are two documents), by first preprocessing the strings (described below), then using the TermFrequencyTable class to calculate the Cosine Similarity (calling it once for each article to compare with the search phrase);
  3. Insert any articles with a cosine similarity greater than 0.001 into the ArticleHeap;
  4. Call getMax() for this heap to print out the top three matching articles; if no articles are found, print out “No matching articles found!” but if only 1 or 2 are found, simply print them out.

Note that you will construct a TermFrequencyTable for each article that you compare with the search phrase.

Grading

We will be grading this using a grading client that builds an ArticleTable as in the main method of your MiniGoogle, and then calls phraseSearch(…) for various test cases.
You should develop your code as an interactive program which uses Minipedia as a basis to create the program MiniGoogle, and make sure that it performs as you expect.