Java代写:CS112 Hash Functions and Separate Chaining

代写数据结构中Hash的几种不同实现算法,并比较算法优劣。

Requirement

In this lab we will investigate how a hash function affects the structure of a separate chaining hash table. You will use a Java program that produces an GUI in which you can use various hash functions with different parameters, input text to be hashed, and watch the hash table grow, and finally read off various statistics about the table.

Exercise Zero

Download the code distribution here (ZIP) and get it to run in a folder of your chosing. Please go to the following web site to get a text version of Dicken’s A Tale of Two Cities: http://www.gutenberg.org/cache/epub/98/pg98.txt

Now you should run the StringHash.java program, and cut and paste a couple of thousand words from the novel into the Input Box in String Hash. Push Hash and watch what happens. You will see a simulation of the hash function growing in Hash Table, a histogram of the bucket lengths, and below these a printout of some statistics.

There is nothing else to do for this problem except to make sure you know what the program does and how to run it.

Exercise One

Now you will try a couple of the hash functions we have written into the program, they are as follows, where M is the size of the table (number of buckets); all are functions from strings to integers in the range [0 .. M-1]:

  • Silly: Take the ASCII value of the last character and then do % M;
  • Naive Add: Add all the ASCII values of the characters and then % M;
  • Naive Mult: Multiply all the ASCII values of the characters and then % M;
  • Add Lin Cong: Add all the ASCII values to get sum, then return (LCMultiplier*sum) % M
  • Sequential LC: Define an array of prime numbers, P[…]; multiply each of the ASCII values by a different prime, sum and then return (LCMultiplier*sum) % M
  • Folding, Java HashCode, Seq Java HashCode: These are all “industrial strength” hash functions for string which may be found on the web; all are supposed to be very high performance, and spread the keys out as randomly as possible.

Roughly, the order of functions here is from worst to best.

What to do for this problem: Cut and paste a couple of thousand words from A Tale of Two Cities into the input box, and try each of the hash functions, without changing any of the parameters (Table Size or LC Multiplier). Be sure to examine the shape of the hash table in the Hash Table box, and the histogram, and then look at the statistics. Focus on the the mean lookup cost – this is the average number of comparisons to find each key in the table (same as we have been thinking about all semester!); optimal lookup cost – this is the average number of comparisons to find a key, in an optimal table where all buckets were the same size (+/- 1); standard deviation – this is the spread of the bucket lengths around the mean – closer to 0 means the table is closer to optimal; histogram – just a picture of the distribution of the bucket lengths – optimal is very few buckets lengths, centered around the mean.

The most important thing to look at is how close the mean lookup cost comes to the optimal lookup cost, and how small the standard deviation is. The table is most balanced when the mean lookup is closest to the optimal lookup, when the standard deviation is smallest, and the histogram shows this graphicallly by clustering the bucket lengths more closely around the mean.

Now answer the following question:

  • a. Which hash functions performed better?
  • b. Did you need the “industrial-strength” hash functions near the end of the list?
  • c. Which was the simplest (i.e., earliest in the list) hash function to do reasonably well?

Exercise Two

Now set the hash function to Add Lin Cong, a method we discussed in lecture. Here is the code:

1
2
3
4
5
6
7
8
// do naiveAdd, then linear congruential
// M and LCM... will be extracted from the appropriate text boxes
static int AddLinCong(String s, int M, int LCMultiplier) {
int sum = 0;
for (int i = 0; i < s.length(); i++)
sum += (s.charAt(i) * LCMultiplier) % M;
return sum % M;
}

Now we want you to play with the table size M and the LCMultiplier and see how sensitive the input data is to these choices.

  • (i) Try M = 144 and LCMultiplier = 2.
  • (ii) Now try M = 144 and LCMultiplier = 12.
  • (iii) Finally, try two prime numbers from the list at the bottom of this lab.

For each of these, answer the following questions:

  • a. What do you observe?
  • b. Why did this happen?

Exercise Three

Now we are going to try to “break” these hash functions. Set M = 11, LCMultiplier = 3, and clear the Input Box.

  • (i) First, find a sequence of words which perform badly, i.e., create an imbalanced table (Hint: try various permutations of the same letters, e.g., “wayne” and “enyaw”). Think about what statistics would indicate that this is a worst case…..
  • (ii) Now try this same sequence of words with a different M and LCMultiplier from the list of primes, but still use Add Lin Cong.
  • (iii) Now try this same sequence of words with the other, “industrial strength” hash functions.

For each of these, answer the following questions:

  • a. What do you observe?
  • b. Why did this happen? (Just answer what? for part C.)

What to Submit

Please submit your answers to the italicized questions as a file Lab12.txt as part of your submission for HW 11.