Transcribed Text
Testing hash functions
In this assignment you are going to test various hash functions to see how good they are, in terms of how many collisions they have. Your input will be strings,
in fact, all the strings that are defined in the system's dictionary. (If you try to download it your browser may say that its a binary file, but it is in fact a text
file, with one word per line.) Note that if you try to open the system copy of the dictionary in your code on the server, you will have to open it for reading only,
otherwise opening the file will fail. This file contains just under 100,000 English words, which we're going to use to test the uniformity of various hash
functions.
Your hash functions will hash strings into 16-bit (not 32-bit) ints. This is important, because we' 're going to keep a table of the number of collisions for each
hash value. With 16-bit ints there are only 65,536 possible hash values, so this table will easily fit in memory If we used 32-bit intsten there would
be
4,294,967,296 possble hashes, a more troublesome amount. For C++, you can get a 16-bit unsigned int type by doing
#includescstdint>
uint16_t x; // x has exactly 16 bits, and is unsigned
For each of the possible hash functions your program should:
Create a vector hashes of size 65,536
Process the list of words, and for each word, compute its hash h
Increment the entry in the table for that hash: hashes at(h)++
When finished, use Pearson! X2 test to determine the probability that the resulting distribution is uniformly distributed. (See below for the deets.)
Hash functions to test
The hash (non)functions you should test are:
String length (modulo 216)
First character
Additive checksum (add all characters together), modulo 216
Remainder (use a modulo of 65413, this is the first prime that is smaller than the table size). Remember that you cannot just add up all the characters and
then take the mod of the result; you have to thread the modulo through the loop that computes the sum.
Multiplicative (using the scheme described in class/in the lecture notes). Again, remember that you can't just use the final sum; you have to incorporate
the multiplicative calculation into hashing loop.
Also remember that in C++ char may be signed or unsigned; on your computers and on the server it is signed, so character values actually range from -128 to
+127. Normally this doesn't matter, as all of the standard characters are in the range 0-127. However, a few of the words in the dictionary use extended foreign
characters in the range 128-255, which will map to negative values if char is signed, and negative values will throw off the hash functions. A hacky way to
compensate for this is to just assumethat char is signed and pass the character value c through the transformation
c < 0 ? int (255 + c) : int(c)
before using it. (This will leave standard character codes alone, but will map extended characters to positive values above 127.)
The correct (cross-platform compatible) way isto get the unsigned version of char (which will be either unsigned char or just char itself) and then convert the
value to that, using something like this:
#include stype_traits>
unsigned char char_to_unsigned(char c) {
return static_cast::type>(c)
}
(l'd also suggest making your "hash testing" function take a function pointer, so that you can just write all the above hash functions and then test them one by
one.)
Pearson's
test
The test is a statistical method for determining the probability that a given set of observed results (in this case, the number of hashes per "bin") could have
been generat ed from a given distribution. In our case, we want our distribution to be uniform, meaning that every bin should have
Numberofwords
expected =
65, 536
items in it (this will be a fractional value). To perform the test, you will take the hashes vector and from it compute the value x2:
(expected hashes[i])2
c2
c2= C
expected
We also need to know the "degrees of freedom", which is just the number of bins, minus 1: 65,535.
We'll use the Boost.Math library for the statistical functions:
#include
Create a X2 distribution, with the desired degrees of freedom:
boost: math: chi_squared c2d(65535.8);
and then use the cumulative distribution function to get the probability:
float P - boost :math: cd f c2d, c2);
p tells us the probability that c2 is not derived from a uniform distribution, so the smaller this is, the better! (You can also negate it and convert to a percentage,
for a kind of "good distribution" measure: (1 - p) 100.0.)
Implementation
This assignment doesn't have a test runner. Just rull each of the above hash functions through the full dictionary and then report on the p value you get for each
You should get p = 1 for the really bad hash functions (indicating that they are very far from uniform) but
for at least the remainder method. (I got P
-
p
1
0.251 for remainder hashing; multiplicative should give you a but it's tricky to get right, so don't worry about it too much if you getp = 1.)
p<1
These solutions may offer step-by-step problem-solving explanations or good writing examples that include modern styles of formatting and construction
of bibliographies out of text citations and references. Students may use these solutions for personal skill-building and practice.
Unethical use is strictly forbidden.
int main(int argc, char** argv) {
vector<string> strs;
read_file(strs, "words.txt");
if (strs.empty()) {
cout << "Cannot read file\n";
return 0;
}
cout << "String length (modulo 2^18)\n";
hash_testing(hash_string_length, strs);
cout << "First character\n";
hash_testing(hash_first_character, strs);
cout << "Additive checksum (add all characters together), modulo 2^16\n";
hash_testing(hash_additive_checksum, strs);
cout << "Remainder (use a modulo of 65413, this is the first prime "
<< "that is smaller than the table size)\n";
hash_testing(hash_remainder, strs);
cout << "Multiplicative\n";
hash_testing(hash_multiplicative, strs);
return 0;
}
void read_file(vector<string> & stts, string path) {
fstream in;
in.open(path.c_str(), ios::in);
stts.clear();
if (in.is_open() == false) {
return;
}
string word;
while (in >> word) {
stts.push_back(word);
}
in.close();
}...