class: center, middle name: title # CS 10C - Hashing ![:newline] .Huge[*Prof. Ty Feng*] ![:newline] .Large[UC Riverside - WQ 2026] ![:newline 6]  .footnote[Copyright © 2026 Joël Porquet-Lupine and Ty Feng - [CC BY-NC-SA 4.0 International License](https://creativecommons.org/licenses/by-nc-sa/4.0/)] --- layout: true # Introduction --- ## Symbol table - Association of a *value* with a *key* * `table[key] = value` - Typical operations - `Insert()`, `Remove()`, `Contains()/Get()` - Other possible operations - `Min()`, `Max()`, `Rank()`, `SortedList()`, etc. ??? - Typically, a map or dictionary -- ![:newline 0.75] ## Example #1: frequency of letters - Compute the frequency of all the (alphabetical) characters in a text .lcol75.footnotesize[ > I was born in Tuckahoe, near Hillsborough, and about twelve miles from > Easton, in Talbot county, Maryland. I have no accurate knowledge of my > age, never having seen any authentic record containing it. By far the > larger part of the slaves know as little of their ages as horses know of > theirs, and it is the wish of most masters within my knowledge to keep > their slaves thus ignorant. I do not remember to have ever met a slave > who could tell of his birthday. They seldom come nearer to it than > planting-time, harvest-time, cherry-time, spring-time, or fall-time. > [...] ] .rcol25.tiny[ | Key | Value | |-----|-------| | 'a' | 37 | | 'b' | 6 | | 'c' | 10 | | 'd' | 10 | | 'e' | 54 | | 'g' | 11 | | ... | ... | ] --- ## Naive implementations ![:newline 0.5] ### Linear search - Array (or list) is unordered - Sequentially check each item until match is found .lcol66.small[ | Search | Insert | Delete | |--------|--------|------------------------| | $O(N)$ | $O(1)$ | (search time +) $O(1)$ | ] .rcol33[  ] ??? - Talk about the delete in O(1) * Replace deleted item with last item to keep array contiguous -- ![:flush 1] ### Binary search - Array is maintained in order - Can perform binary search to find item .lcol66.small[ | Search | Insert | Delete | |-------------|--------|--------| | $O(\log N)$ | $O(N)$ | $O(N)$ | ] .rcol33[  ] --- ## Advanced implementations ### Binary search tree - Organization of data items in BST, following certain order - Any node's key is greater than all keys in its left subtree - Any node's key is smaller than all keys in its right subtree - At most, explore height of tree .small[ | Search avg/worst | Insert avg/worst | Delete avg/worst | |--------------------|--------------------|--------------------| | $O(\log N)$/$O(N)$ | $O(\log N)$/$O(N)$ | $O(\log N)$/$O(N)$ | ] .lcol[  ] ??? - Referring to unbalanced BSTs here, so average is log N but worst case is N if tree has degenerated into linked list -- .rcol[ ![:newline 1] ### AVL or RB tree - Balanced BST - Height of tree is kept to $O(\log N)$ .small[ | Search | Insert | Delete | |------------|-------------|------------| | $O(\log N)$ | $O(\log N)$ | $O(\log N)$ | ] ] ??? - log is really good already - 1 billion items result in a tree of height 30, which is not too many steps --- ## Ideal data structure .lcol66[ Can we achieve constant time on all three main operations? ] .rcol33.small[ | Search | Insert | Delete | |--------|--------|--------| | $O(1)$ | $O(1)$ | $O(1)$ | ] -- .lcol75[ ### Using data as index - Keys are small integers - ASCII is 128 characters - `std::vector
fmap(128)`; - Can directly be used as index - `fmap[key] = value;` ] .rcol25[  ] -- .lcol75[ ![:newline 1] ### Limitations - Doesn't *(efficiently)* support other operations - `Min()`, `Max()`, etc. - Wasted memory for unused keys * E.g., ASCII defines about 30 control characters - What if the key is not a small integer? ] --- ## Example #2: frequency of words - Compute the frequency of all the (alphabetical) characters in a text .lcol75.footnotesize[ > I was born in Tuckahoe, near Hillsborough, and about twelve miles from > Easton, in Talbot county, Maryland. I have no accurate knowledge of my > age, never having seen any authentic record containing it. By far the > larger part of the slaves know as little of their ages as horses know of > theirs, and it is the wish of most masters within my knowledge to keep > their slaves thus ignorant. I do not remember to have ever met a slave > who could tell of his birthday. They seldom come nearer to it than > planting-time, harvest-time, cherry-time, spring-time, or fall-time. > [...] ] .rcol25.tiny[ | Key | Value | |--------|-------| | 'of' | 6 | | 'time' | 5 | | 'to' | 3 | | 'the' | 3 | | 'it' | 3 | | 'I' | 3 | | ... | ... | ] -- ![:flush 1] ### Issues - Very large number of words in English - Would need a huge table to index by word - `std::vector
fmap(171476)`; - Cannot index an array with a string - `fmap["time"]` is not a proper C++ construct - Can only index arrays with an integer ??? - 171,476 words in current use --- .lcol66[ ## Hashing - Compute an array index from any key - Translate key range `0..K-1` to index range `0..M-1` - How to provide **hashing** for any type of key? - Boolean, integers, floating-point numbers, strings, user-defined objects, etc. ] .rcol33[  ] -- ![:flush 1] .lcol66[ ## Collisions - Hash function is said to be *non-injective* - Key range is typically much larger than index range - Two keys can be reduced to the same hash - How to resolve potential **collisions** properly? ] .rcol33[  ] ??? - Non injective function means that two different inputs can map to a single output --- layout: true # Hash functions --- ## Required properties - Simple to compute, and fast - Hash of a key is deterministic - Equal keys must produce the same hash value - `assert(a == b && hash(a) == hash(b));` - Uniform distribution of the keys across the index range ![:newline 1]  - Two implicit steps: 1. Transform any potentially non-integer key into an integer hash code 2. Reduce this hash code to an index range ??? - Hash code is an integer, in the same "domain" as array index --- ## Some pitfalls ### Worse (yet functional) hash function .lcol.footnotesize[ ```c++ template
size_t GalaxyHashCode(K &key) { return 42; } ``` ] -- .rcol.small[ #### Problem - Can only insert one key - All the following keys will collide, regardless of their value #### Solution - Need to distinguish keys by value ] -- .wide[ ### Key subset selection ] .lcol[ Possible scenario: - Keys are phone numbers: e.g., (951) 827-5639 - Hash table is of size 1000 - Idea: use first three digits as a direct index? - `hash("(951) 827-5639") = 951` ] -- .rcol.small[ #### Problem - Phone numbers with same area code will collide - Very likely to happen if hash function was used on campus! #### Solution - Select another set of 3 digits that shows better "random" properties (e.g., the last three) - Or better, consider the entire phone number instead ] --- ## Hashing positive integers - Key is a simple unsigned integer in the range `0..K-1` - Array index is always an unsigned integer in the range `0..M-1` - Hashing is done via modulo operation .footnotesize[ ```C size_t HashUInt(unsigned key, size_t M) { return key % M; } ```  ] -- .lcol66.footnotesize[ ```C // Hash table std::vector
ht(100, false); // Uniform random number generator std::mt19937 mt(0); std::uniform_int_distribution
distr(0, 999); // Generate some random keys for (int i = 0; i < 25 ; i++) { unsigned key = distr(mt); size_t hash = HashUInt(key, ht.size()); std::cout << "Hash(" << key << ") => " << hash; if (ht[hash]) std::cout << " (collision!)"; ht[hash] = true; std::cout << std::endl; } ```  ] ??? - Uniform random generator * Numbers are uniformly distributed in the specified range * If generator runs a billion of times, each number of the range will get picked about the same number of times ![:newline 1] - Live quiz! -- .rcol33.scriptsize[ ``` Hash(548) => 48 Hash(592) => 92 Hash(715) => 15 Hash(844) => 44 Hash(602) => 2 Hash(857) => 57 Hash(544) => 44 (collision!) Hash(847) => 47 Hash(423) => 23 Hash(623) => 23 (collision!) Hash(645) => 45 Hash(384) => 84 Hash(437) => 37 Hash(297) => 97 Hash(891) => 91 Hash(56) => 56 Hash(963) => 63 Hash(272) => 72 Hash(383) => 83 ... ``` ] --- .lcol75[ ## Consideration about table size (1/2) - Does the size of the hash table matter? - Same scenario as previous slide - Keys are random numbers between 0 and 999 - Distribution of keys is uniform - Hashing of 25 keys - But hash table can either be of size **100** or size **97** ] -- .rcol25.tiny[ | Key | % 100 | % 97 | |-----------|----------|----------| | ... | ... | ... | | 857 | 57 | 81 | | 544 | .red[44] | 59 | | 847 | 47 | 71 | | 423 | 23 | 35 | | 623 | .red[23] | 41 | | 645 | 45 | .red[63] | | 384 | 84 | 93 | | ... | ... | ... | | 477 | 77 | 89 | | 791 | .red[91] | 15 | | 812 | 12 | .red[36] | | 528 | 28 | 43 | | ... | ... | ... | | 568 | 68 | 83 | | **Collisions** | **3** | **2** | ] -- .lcol75[ ![:newline 1] ### Conclusion - Table size is not critical - If distribution of keys is *uniform* ] ??? - Table 97 is smaller than 100, but a little less collisions --- .lcol75[ ## Consideration about table size (2/2) - Does the size of the hash table matter? - Same scenario as last slide - Keys are random numbers between 0 and 999 - Hashing of 25 keys - Hash table can either be of size 100 or size 97 - But distribution of keys is **non-uniform** ] -- .rcol25.tiny[ | Key | % 100 | % 97 | |----------------|-----------|-------| | 0 | 0 | 0 | | 25 | 25 | 25 | | 50 | 50 | 50 | | 75 | 75 | 75 | | 100 | .red[0 ] | 3 | | 125 | .red[25] | 28 | | 150 | .red[50] | 53 | | 175 | .red[75] | 78 | | 200 | .red[0 ] | 6 | | 225 | .red[25] | 31 | | 250 | .red[50] | 56 | | 275 | .red[75] | 81 | | 300 | .red[0 ] | 9 | | 325 | .red[25] | 34 | | 350 | .red[50] | 59 | | ... | .red[...] | ... | | **Collisions** | **21** | **0** | ] -- .lcol75[ ![:newline 1] ### Conclusion - Table size becomes critical - If distribution of keys is *non-uniform* - Prime numbers exhibit good properties ] ??? - Conclusion: use prime numbers for table size! - Now, let's talk about hashing other types than simple integers --- template: title ??? - Today, continue talking about hash functions - Talk about why collisions are inevitable - First implementation of hash table --- layout: false # Recap .lcol[ ## Hash tables - Data structure implementing efficient associative arrays - Key to value mappings - Average $O(1)$ for search, insert, and delete operations ] .rcol[ ## Hashing - Hash function - Compute array index from any type of key  ] ![:flush 0.5] .lcol[ ### Hashing positive integers - Hash key in range `0..K-1` into array index in range `0..M-1` - Simple modulo operation .footnotesize[ ```C size_t HashUInt(unsigned key, size_t M) { return key % M; } ```  ] ] .rcol[ ### Table size - Table size is critical if distribution of keys is *non-uniform* - Pick prime number to minimize number of collisions .center.tiny[ | Key | % 100 | % 97 | |----------------|-----------|-------| | ... | .red[...] | ... | | 100 | .red[0 ] | 3 | | ... | .red[...] | ... | | **Collisions** | **21** | **0** | ] ] --- layout: true # Hash functions --- .lcol[ ## Hashing strings ### Naive approach - Sum ASCII value of each character - Reduce to table size ] .rcol.footnotesize[ ```C size_t HashStr(std::string key, size_t M) { size_t h = 0; for (auto c : key) h += c; return h % M; } ```  ] -- ![:flush 0.5] .lcol45[ #### Observations - Average word length is 4.5 - $4.5 * 127 = 576$ - Longest word is 45 letters long - $45 * 127 = 5715$ ] -- .rcol55[ #### Issues - Assuming a hash table of size 10,000 - Indices up to 500 will be crowded - Indices above 5000 will never be used ] -- .lcol60.footnotesize[ #### Experiment ```C std::vector
dict{ // Some of the most used words "the", "and", "that", "have", "for", "not", "with", "you", // Longest word in English "pneumonoultramicroscopicsilicovolcanoconiosis" }; for (auto &d : dict) std::cout << "Hash(" << d << ") => " << HashStr(d, 10000) << std::endl; ```  ] -- .rcol40.footnotesize[ #### Result ``` Hash(the) => 321 Hash(and) => 307 Hash(that) => 433 Hash(have) => 420 Hash(for) => 327 Hash(not) => 337 Hash(with) => 444 Hash(you) => 349 Hash(pneumonoultramicroscopicsilicovolcanoconiosis) => 4880 ``` ] --- .lcol[ ## Hashing strings ### Better approach - Multiple/add ASCII value of each character - Reduce to table size ] .rcol.footnotesize[ ```C size_t HashStr(std::string key, size_t M) { size_t h = 0; for (auto c : key) h = (h * 31) + c; return h % M; } ```  ] ![:flush 0.5] #### Horner's method - Consider string of length L as a polynomial - \\(h = key[0]\*31^{L-1} + key[1]\*31^{L-2} + ... + key[L-1] * 31^0\\) -- ![:newline 0.5] .lcol60[ #### Properties - If *unsigned* hash value gets too big, overflow is controlled - `31` is an interesting *prime number* - Multiply item by 32 (by shifting `<< 5`) and subtract item - Experimentations has shown good distribution of indices overall ] ??? - The shift and subtract is interesting if the processor's multiply instruction is slow -- .rcol40.footnotesize[ #### Result ``` Hash(the) => 4801 Hash(and) => 6727 Hash(that) => 8823 Hash(have) => 5240 Hash(for) => 1577 Hash(not) => 9267 Hash(with) => 9734 Hash(you) => 9839 Hash(pneumonoultramicroscopicsilicovolcanoconiosis) => 2420 ``` ] ??? - OK, let's see some other more complex data types now --- ## Hashing compound data types .lcol40[ ### Example: phone number - Fields of same type - Same method as string, but for each field ] .rcol60.footnotesize[ ```c++ struct PhoneNumber { unsigned area, exch, ext; // (951) 827-5639 }; size_t HashPhone(PhoneNumber &p, size_t M) { return ((p.area * 31 + p.exch) * 31 + p.ext) % M; } ``` ] -- ![:flush 1] .lcol40[ ### Example: street address - Fields of different type - Combine each field using $31x+y$ rule ] .rcol60.footnotesize[ ```c++ struct Address { unsigned st_num; std::string st_name; }; size_t HashAddress(Address &a, size_t M) { size_t h = 17; // small prime as initial value // Hash street number h = h * 31 + a.st_num; // Hash street name for (auto c : a.st_name) h = h * 31 + c; return h % M; } ``` ] -- .lcol40[ ![:newline 1] - If key is too long to compute, pick a subset of the fields - E.g., only a few well-chosen characters from the street name ] ??? - Live quiz! --- layout: true # Collisions --- ## Uniform hashing - We assume that the hash function has a uniform distribution * Keys are mapped as evenly as possible over the index range -- ![:newline 1] ## Collision probability .lcol75[ - Assuming an array index in the range `0..M-1` - After how many hashed keys will there be the first collision? ] -- .rcol25.tiny[ | Hash | Key | % 97 | |---------|-----|----------| | #1 | 548 | 63 | | #2 | 592 | 10 | | #3 | 715 | 36 | | #4 | 844 | 68 | | #5 | 603 | 20 | | #6 | 857 | 81 | | #7 | 544 | 59 | | #8 | 847 | 71 | | #9 | 423 | 35 | | #10 | 623 | 41 | | **#11** | 645 | .red[63] | ] .lcol75[ ![:newline 1] ### Experiments - Example: - Output range of 97 - 11 hashes to observe the first collision - Now, what if the output range was 223?... ] -- .lcol75[ - Need to restart a simulation... - Or need for better mathematical tools! ] --- ### Birthday problem .lcol60[ - In a group of $N$ random people, what is the probability that two people have the same birthday? ] .rcol40[  ] -- .lcol60[ - 100% if group counts 367 people (*pigeonhole principle*) ] -- .lcol60[ - But 50% if group counts only 23 people - And 99.9% if group counts 70 people! ] -- ![:flush 1] ### Back to collision probability - Number of hashed keys to first collision is $\approx \sqrt{\frac{\pi}{2}M}$ -- .lcol75[ - For an output range of 97: $\sqrt{\frac{\pi}{2}97} = 12$ ] .rcol25.tiny[ | Hash | Key | % 97 | |---------|-----|----------| | ... | ... | ... | | #10 | 623 | 41 | | **#11** | 645 | .red[63] | ] -- .lcol75[ - And for an output of 223: $\sqrt{\frac{\pi}{2}223} = 18$ 😲 ] --- ## Observation - Impossible to avoid collisions * Unless table size is quadratically bigger than necessary! ![:newline 1] ## Conclusion - Need to deal with collisions gracefully - Two options: 1. **Separate chaining**: couple the hash table with another container 2. **Open addressing**: put the key at another index in case of collision --- layout: true # Separate chaining --- ## Principle - Hash table is an array of lists - Keys are still hashed to array index $i$ in the range `0..M-1` ![:newline 1] .lcol40.footnotesize.center[ | key | hash | |-----|------| | v | 3 | | e | 1 | | c | 4 | | t | 1 | | o | 1 | | r | 4 | | s | 0 | ] .rcol60[  ] ![:flush 1] - Insertion: key pushed in $i^{th}$ chain - Search: only need to search key in $i^{th}$ chain --- ## Public API .footnotesize[ ```C // Returns whether @key is found in table bool Contains(const K &key); // Inserts @key in hash table void Insert(const K &key); // Removes @key from hash table void Remove(const K &key); ```  ] -- ![:newline 1] ## Data structure implementation .lcol60.footnotesize[ ```C template
class HTChaining { public: ... private: static const size_t init_capacity = 11; std::vector
> ht{init_capacity}; size_t cur_size = 0; // Helper methods size_t Hash(const K &key); ... }; ```  ] .rcol40[ - Small prime number as initial hash table size - Hash table of chains, as a vector of linked lists ] --- ## Hash function .lcol60.footnotesize[ ```C template
size_t HTChaining
::Hash(const K &key) { static std::hash
k_hash; return k_hash(key) % ht.size(); } ```  ] .rcol40[ - `std::hash
` provided by C++ library - Knows how to hash most basic types (e.g., integers, floating-point, strings) ] -- .wide[ .footnotesize[ ### Example ```C std::hash
int_hash; std::hash
double_hash; std::hash
str_hash; std::hash
intptr_hash; std::cout << 42 << " => " << int_hash(42) << std::endl; std::cout << -42 << " => " << int_hash(-42) << std::endl; std::cout << 42.23 << " => " << double_hash(42.23) << std::endl; std::cout << "algorithm" << " => " << str_hash("algorithm") << std::endl; int a; std::cout << &a << " => " << intptr_hash(&a) << std::endl; ```  ] ] -- .lcol45.footnotesize[ ``` 42 => 42 -42 => 18446744073709551574 42.23 => 4631140161442744893 algorithm => 9064388152055783788 0x16fb82ec0 => 5748297291990795977 ``` ] .rcol55[ - Good and easy way to get a well-distributed hash code - Still need to reduce it to array index with a modulo operation ] --- template: title --- layout: false # Recap .lcol[ ## Hashing - Horner's method .footnotesize[ \\(h = key[0]\*31^{L-1} + key[1]\*31^{L-2}\\) \\(\qquad+ ... + key[L-1] * 31^0\\) ] ] .rcol[ ## Collisions - Collisions are inevitable - $\sqrt{\frac{\pi}{2}M}$ insertions before first collision in hashtable of size $M$ ] ![:flush 1] ## Separate chaining .lcol45[ - Lists of keys hashed to same array index  ] .rcol55[ ### Implementation .footnotesize[ ```C template
class HTChaining { public: ... private: static const size_t init_capacity = 11; std::vector
> ht{init_capacity}; size_t cur_size = 0; // Helper methods size_t Hash(const K &key); ... }; ```  ] ] --- layout: true # Separate chaining --- ## API implementation .footnotesize[ ```C template
bool HTChaining
::Contains(const K &key) { auto &list = ht[Hash(key)]; return std::find(list.begin(), list.end(), key) != list.end(); } ```  ] - `std::find()` returns iterator on matching item within specified bounds - Returns special `end()` iterator if not found -- ![:newline 1] .lcol.footnotesize[ ```C template
void HTChaining
::Insert(const K &key) { // Key already there if (Contains(key)) return; // Push key into right bucket auto &list = ht[Hash(key)]; list.push_back(key); // Update size cur_size++; ... } ```  ] -- .rcol.footnotesize[ ```C template
void HTChaining
::Remove(const K &key) { // Key not there if (!Contains(key)) return; // Remove key from bucket auto &list = ht[Hash(key)]; auto itr = std::find(list.begin(), list.end(), key); list.erase(itr); // Update size cur_size--; } ```  ] --- ## Performance analysis - Hashing key to array index: $O(1)$ - Searching/inserting/removing key in chain: $O($*length of chain*$)$ -- ![:newline 0.5] ### Hash function - If using poor hash function, such as `GalaxyHashCode()` - One chain can end up containing all the items - Runtime complexity of operations: $O(N)$! - If using hash function providing uniform distribution - On average, chains will be of length $N/M = L$ - $L$ is also know as the *load factor* - Runtime complexity of operations: $O(L)$ ??? - But initially we said that hashtables should be in constant time so that seems paradoxal right? O(L) sounds like it's anything but constant.. - How can I still make it constant? -- ![:newline 0.5] ### Load factor - If load factor is maintained to be constant - Then complexity also becomes constant - Number of items $N$ in hash table cannot be controlled - But number of chains $M$ can... .comment[ --- template: title --- layout: false # Recap .lcol[ ## Hashing - Horner's method .footnotesize[ \\(h = key[0]\*31^{L-1} + key[1]\*31^{L-2}\\) \\(\qquad+ ... + key[L-1] * 31^0\\) ] ] .rcol[ ## Collisions - Collisions are inevitable - $\sqrt{\frac{\pi}{2}M}$ insertions before first collision in hashtable of size $M$ ] ![:flush 1] ## Separate chaining .lcol60[ - Lists of keys hashed to same array index  ] .rcol40[ ### Load factor - Average length of chains in hashtable ($L=N/M$) - Defines complexity of most operations ($O(L)$) - Maintain constant load factor via resizing the hash table! ] ] --- layout: true # Separate chaining --- .lcol40[ ## Resizing ### Scenario  - Load factor: $7/5 = 1.4$ ] -- .rcol60[ ### Resizing - Rehashing 1. Increase the size of hash table 1. **Rehash** every item into new table ] ??? - Better to choose a new prime number -- .rcol60[ ![:newline 0.5] ### Result  - Load factor: $7/11 \approx 0.64$ ] ??? - Can show the example of 'd' which has the ascii code 100 - 100 % 5 => 0 - 100 % 11 => 1 (99 + 1) --- ## Resizing implementation .lcol40.footnotesize[ ```C template
class HTChaining { public: ... private: ... void Resize(size_t capacity); }; ```  ] -- .rcol60.footnotesize[ ```C template
void HTChaining
::Insert(const K &key) { ... // Update size cur_size++; // Resize hash table if load factor is 1 if (cur_size > ht.size()) Resize(ht.size() * 2); // XXX: should use prime! } ```  ] ??? - Did not choose a prime number to make the code simpler, but should! -- .lcol66[ ![:newline 0.4] .footnotesize[ ```C template
void HTChaining
::Resize(size_t capacity) { // Back up existing hash table std::vector
> old_ht = ht; // Resize table to new capacity, and empty it ht.resize(capacity); for (auto &list : ht) list.clear(); cur_size = 0; // Reinsert all previous items into new table for (auto &list : old_ht) for (auto &k : list) Insert(k); } ```  ] ] -- .rcol33[ ![:newline 1] ### Complexity - Resizing happens infrequently - *Amortized* time complexity: $O(1)$ ![:newline 1] - Ideally, resize to another prime number ] --- ## Conclusion ### Complexity .lcol[ - Proportional to load factor - Typically force load factor to around 0.75 * Good compromise between performance and memory usage ] .rcol.small[ | Search | Insert | Delete | |--------|--------------------|--------| | $O(1)$ | $O(1)$ *amortized* | $O(1)$ | ] ![:flush 1] ### Pros - Very effective if keys are uniformly distributed ![:newline 1] ### Cons - Rely on another data structure, such as lists, to implement the chains - Note that the chains could also be implemented with balanced trees! - Cost associated to this additional data structure - Memory management, traversal, etc. --- layout: false # Open addressing ## Principle - Keys hashed to the same array index are not chained together - Inserted in the same *flat* hash table, but at the *next* available index - Hash table must be *at least* the same size as the number of keys total -- ![:newline 0.5] .lcol66[ ### Formally - Key $x$ is inserted at the first index $h_i$ available - $h_i(x) = (hash(x) + f(i)) \mod M$ - where $f(0) = 0$ - Function $f$ is the collision resolution strategy - Linear probing, quadratic probing, etc. ] .rcol33[  ] ![:flush] ### Operations - Insert: keep probing until an empty index is found - Search: keep probing until key is found - Or until encountering an empty index (i.e., key not found) --- layout: true # Linear probing --- ## Principle - Function $f$ is a linear function of $i$ - Typically $f(i) = i$ - In case of collision, try next indices sequentially ![:newline 1] .lcol60[ ## Example - Hash table of size $M = 17$ - Insert sequence of characters ] .rcol40.footnotesize.center[ | Key | Hash | Index | |:---:|---------|:----------------------| | `I` | 5 | .green[5] | | `<` | 9 | .green[9] | | `3` | 0 | .green[0] | | `L` | 8 | .green[8] | | `U` | .red[0] | .green[1] *(= 0 + 1)* | | `V` | .red[1] | .green[2] *(= 1 + 1)* | ] -- .lcol60[ ![:newline 2] ### Observations - Apparition of *primary clusters* - Contiguous block of keys - Caused by multiple collisions - Key hashed into a cluster might require multiple attempts to be inserted ] --- ## Data structure - Based on *flat* hash table - Same public API as for separate chaining implementation .footnotesize[ ```C template
class HTLinear { public: ... private: struct Node { K key; bool taken; }; static const size_t init_capacity = 11; std::vector
ht{init_capacity}; size_t cur_size = 0; // Helper methods size_t Hash(const K &key); void Resize(size_t capacity); }; ```  ] -- .lcol40[ ![:newline 1] - Same hash function as for separate chaining implementation ] .rcol60.footnotesize[ ```C template
size_t HTLinear
::Hash(const K &key) { static std::hash
k_hash; return k_hash(key) % ht.size(); } ```  ] --- ## API implementation (1/2) .lcol.footnotesize[ ```C template
bool HTLinear
::Contains(const K &key) { size_t idx = Hash(key); while (ht[idx].taken) { if (ht[idx].key == key) return true; idx = (idx + 1) % ht.size(); } return false; } ```  ] -- .rcol[ .footnotesize[ ```C template
void HTLinear
::Insert(const K &key) { // Key already there if (Contains(key)) return; // Find empty spot size_t idx = Hash(key); while (ht[idx].taken) idx = (idx + 1) % ht.size(); // Put key there ht[idx].key = key; ht[idx].taken = true; // Update size cur_size++; // Resize hash table if 50% full if (cur_size > ht.size() / 2) Resize(ht.size() * 2); } ```  ] - Here, try to maintain load factor of maximum 0.5 ] -- .lcol.footnotesize[ ```C template
void HTLinear
::Resize(size_t capacity) { std::vector
old_ht = ht; // Resize to new capacity and empty ht.resize(capacity); for (auto &k : ht) k.taken = false; cur_size = 0; // Reinsert all previous items for (auto &k : old_ht) if (k.taken) Insert(k.key); } ```  ] --- ## Deletion principle ### Scenario .lcol66[ - Assuming following hash table - Cluster of three keys - How to remove key `U`? - Key `V` should still be reachable ] .rcol33.center.small[ | Key | Hash | Index | |:---:|:-----|:------| | `I` | 5 | 5 | | `<` | 9 | 9 | | `3` | 0 | 0 | | `L` | 8 | 8 | | `U` | 0 | 1 | | `V` | 1 | 2 | ] -- .lcol66[ ![:newline 1] ### Problems - If free index at `U`, `V` is not reachable anymore - `Hash(V)` points to available entry - Same issue for key `U` if remove key `3` ] -- .wide[ ### Solution - Free key's index upon removal - But rehash every key of cluster located directly after key - `V` would be reinserted at index `1`, and stay reachable ] ??? - Close any gaps in cluster --- ## Deletion implementation .footnotesize[ ```C template
void HTLinear
::Remove(const K &key) { // Key not there if (!Contains(key)) return; // Find key position and remove it size_t idx = Hash(key); while (ht[idx].key != key) idx = (idx + 1) % ht.size(); ht[idx].taken = false; // Rehash the next keys of the same cluster idx = (idx + 1) % ht.size(); while (ht[idx].taken) { // Temporarily remove key from its spot, and reinsert it right away auto &key_cpy = ht[idx].key; ht[idx].taken = false; cur_size--; Insert(key_cpy); idx = (idx + 1) % ht.size(); } // Update size, and halve hash table if 12.5% full cur_size--; if (cur_size > 0 && cur_size <= ht.size() / 8) Resize(ht.size() / 2); } ```  ] --- ## Conclusion ### Complexity .lcol[ - Difficult complexity analysis - Often simplified to $O(1)$ ] .rcol.scriptsize[ | Search hit | Search miss/Insert | |------------------------------------------------------------|----------------------------------------------------------------| | $O(\sim\frac{1}{2}\left(1 + \frac{1}{1 - \alpha}\right))$ | $O(\sim\frac{1}{2}\left(1 + \frac{1}{(1 - \alpha)^2}\right))$ | ] ![:flush 1] ### Pros - No overhead for memory management - Locality of reference, especially if multiple consecutive items can be in the same cache line ![:newline 1] ### Cons - Requires low load factor (0.5) compared to separate chaining - Bigger hash table - Causes primary clustering ??? - $\alpha$ is the load factor - Primary clustering: tendency for one collision to cause more nearby collisions --- layout: false # Open addressing strategies ### Linear probing - $f(i) = i$ - If index at `i` is taken, try `i+1`, then `i+2`, ..., `i+k`. -- ### Quadratic probing - $f(i) = i^2$ - If index at `i` is taken, try `i+1^2`, then `i+2^2`, ..., `i+k^2`. -- ### Double hashing - Use second hash functions to compute intervals between probes - $f(i) = i * h_2(k)$ -- ![:newline 1] ## Comparison .small.center[ | Strategy | Locality of reference | Prevents clustering | |-------------------|:----------------------|:-----------| | Linear probing | Excellent | Bad | | Quadratic probing | Good | Good | | Double hashing | Bad | Excellent | ] --- layout: true # Standard C++ containers --- The C++ Standard Library defines two main **unordered** *associative containers*, implemented as hashed data structures: - `std::unordered_set` - Contains an *unordered* set of unique objects - Objects must be comparable for equality, as they are used as keys directly - `std::unordered_map` - Contains *unordered* pairs of key-value - Keys must be comparable for equality and unique ![:newline 2] - Comparison with **ordered** *associate containers* (map, set, etc.) - Constant vs logarithm complexity - Collection is not ordered by key, by design ??? - multiset and multimap can have multiple items with same key --- ## Demo .lcol66.footnotesize[ ```C #include
#include
int main() { // Initialization std::unordered_map
m = { {36, "Alice"}, {23, "Bob"} }; // Dynamic insertion by indexing m[8] = "Lisa"; // Dynamic pair insertion m.insert({10, "Bart"}); // Alternative pair insertion // and of duplicated item m.insert(std::make_pair(10, "Bart")); // Iterate and print values for (auto n : m) std::cout << "(" << n.second << ": " << n.first << ")" << std::endl; return 0; } ```  ] .rcol33.small[ ``` (Bart: 10) (Lisa: 8) (Alice: 36) (Bob: 23) ``` ]