QuestionQuestion

Transcribed TextTranscribed Text

Testing hash functions In this assignment you are going to test various hash functions to see how good they are, in terms of how many collisions they have. Your input will be strings, in fact, all the strings that are defined in the system's dictionary. (If you try to download it your browser may say that its a binary file, but it is in fact a text file, with one word per line.) Note that if you try to open the system copy of the dictionary in your code on the server, you will have to open it for reading only, otherwise opening the file will fail. This file contains just under 100,000 English words, which we're going to use to test the uniformity of various hash functions. Your hash functions will hash strings into 16-bit (not 32-bit) ints. This is important, because we' 're going to keep a table of the number of collisions for each hash value. With 16-bit ints there are only 65,536 possible hash values, so this table will easily fit in memory If we used 32-bit intsten there would be 4,294,967,296 possble hashes, a more troublesome amount. For C++, you can get a 16-bit unsigned int type by doing #includescstdint> uint16_t x; // x has exactly 16 bits, and is unsigned For each of the possible hash functions your program should: Create a vector hashes of size 65,536 Process the list of words, and for each word, compute its hash h Increment the entry in the table for that hash: hashes at(h)++ When finished, use Pearson! X2 test to determine the probability that the resulting distribution is uniformly distributed. (See below for the deets.) Hash functions to test The hash (non)functions you should test are: String length (modulo 216) First character Additive checksum (add all characters together), modulo 216 Remainder (use a modulo of 65413, this is the first prime that is smaller than the table size). Remember that you cannot just add up all the characters and then take the mod of the result; you have to thread the modulo through the loop that computes the sum. Multiplicative (using the scheme described in class/in the lecture notes). Again, remember that you can't just use the final sum; you have to incorporate the multiplicative calculation into hashing loop. Also remember that in C++ char may be signed or unsigned; on your computers and on the server it is signed, so character values actually range from -128 to +127. Normally this doesn't matter, as all of the standard characters are in the range 0-127. However, a few of the words in the dictionary use extended foreign characters in the range 128-255, which will map to negative values if char is signed, and negative values will throw off the hash functions. A hacky way to compensate for this is to just assumethat char is signed and pass the character value c through the transformation c < 0 ? int (255 + c) : int(c) before using it. (This will leave standard character codes alone, but will map extended characters to positive values above 127.) The correct (cross-platform compatible) way isto get the unsigned version of char (which will be either unsigned char or just char itself) and then convert the value to that, using something like this: #include stype_traits> unsigned char char_to_unsigned(char c) { return static_cast::type>(c) } (l'd also suggest making your "hash testing" function take a function pointer, so that you can just write all the above hash functions and then test them one by one.) Pearson's test The test is a statistical method for determining the probability that a given set of observed results (in this case, the number of hashes per "bin") could have been generat ed from a given distribution. In our case, we want our distribution to be uniform, meaning that every bin should have Numberofwords expected = 65, 536 items in it (this will be a fractional value). To perform the test, you will take the hashes vector and from it compute the value x2: (expected hashes[i])2 c2 c2= C expected We also need to know the "degrees of freedom", which is just the number of bins, minus 1: 65,535. We'll use the Boost.Math library for the statistical functions: #include Create a X2 distribution, with the desired degrees of freedom: boost: math: chi_squared c2d(65535.8); and then use the cumulative distribution function to get the probability: float P - boost :math: cd f c2d, c2); p tells us the probability that c2 is not derived from a uniform distribution, so the smaller this is, the better! (You can also negate it and convert to a percentage, for a kind of "good distribution" measure: (1 - p) 100.0.) Implementation This assignment doesn't have a test runner. Just rull each of the above hash functions through the full dictionary and then report on the p value you get for each You should get p = 1 for the really bad hash functions (indicating that they are very far from uniform) but for at least the remainder method. (I got P - p 1 0.251 for remainder hashing; multiplicative should give you a but it's tricky to get right, so don't worry about it too much if you getp = 1.) p<1

Solution PreviewSolution Preview

These solutions may offer step-by-step problem-solving explanations or good writing examples that include modern styles of formatting and construction of bibliographies out of text citations and references. Students may use these solutions for personal skill-building and practice. Unethical use is strictly forbidden.

int main(int argc, char** argv) {

    vector<string> strs;
    read_file(strs, "words.txt");

    if (strs.empty()) {
       cout << "Cannot read file\n";
       return 0;
    }

    cout << "String length (modulo 2^18)\n";
    hash_testing(hash_string_length, strs);

    cout << "First character\n";
    hash_testing(hash_first_character, strs);


    cout << "Additive checksum (add all characters together), modulo 2^16\n";
    hash_testing(hash_additive_checksum, strs);


    cout << "Remainder (use a modulo of 65413, this is the first prime "
            << "that is smaller than the table size)\n";
    hash_testing(hash_remainder, strs);

    cout << "Multiplicative\n";
    hash_testing(hash_multiplicative, strs);


    return 0;
}

void read_file(vector<string> & stts, string path) {
    fstream in;
    in.open(path.c_str(), ios::in);
    stts.clear();
    if (in.is_open() == false) {
       return;
    }
    string word;
    while (in >> word) {
       stts.push_back(word);
    }
    in.close();
}...

By purchasing this solution you'll be able to access the following files:
Solution.cpp and SolutionWords.txt.

$68.00
for this solution

or FREE if you
register a new account!

PayPal, G Pay, ApplePay, Amazon Pay, and all major credit cards accepted.

Find A Tutor

View available Data Structures and Algorithms Tutors

Get College Homework Help.

Are you sure you don't want to upload any files?

Fast tutor response requires as much info as possible.

Decision:
Upload a file
Continue without uploading

SUBMIT YOUR HOMEWORK
We couldn't find that subject.
Please select the best match from the list below.

We'll send you an email right away. If it's not in your inbox, check your spam folder.

  • 1
  • 2
  • 3
Live Chats