Hashing

# CS 10C - Hashing

![:newline]

.Huge[*Prof. Ty Feng*]

![:newline]

.Large[UC Riverside - SQ 2026]

![:newline 6]

![:scale 40%](../utils/style/ucr_cs_logo.png)

---

# Introduction

---

## Symbol table

- Association of a *value* with a *key*
    * `table[key] = value`
- Typical operations
    - `Insert()`, `Remove()`, `Contains()/Get()`
- Other possible operations
    - `Min()`, `Max()`, `Rank()`, `SortedList()`, etc.

???

- Typically, a map or dictionary

![:newline 0.75]

## Example #1: frequency of letters

- Compute the frequency of all the (alphabetical) characters in a text

.lcol75.footnotesize[
> I was born in Tuckahoe, near Hillsborough, and about twelve miles from
> Easton, in Talbot county, Maryland. I have no accurate knowledge of my
> age, never having seen any authentic record containing it. By far the
> larger part of the slaves know as little of their ages as horses know of
> theirs, and it is the wish of most masters within my knowledge to keep
> their slaves thus ignorant. I do not remember to have ever met a slave
> who could tell of his birthday. They seldom come nearer to it than
> planting-time, harvest-time, cherry-time, spring-time, or fall-time.
> [...]
]

.rcol25.tiny[

| Key | Value |
|-----|-------|
| 'a' | 37    |
| 'b' | 6     |
| 'c' | 10    |
| 'd' | 10    |
| 'e' | 54    |
| 'g' | 11    |
| ... | ...   |

]

---

## Naive implementations

![:newline 0.5]

### Linear search

- Array (or list) is unordered
- Sequentially check each item until match is found

.lcol66.small[
| Search | Insert | Delete                 |
|--------|--------|------------------------|
| $O(N)$ | $O(1)$ | (search time +) $O(1)$ |
]
.rcol33[
![:scale 100%](images/linear_search.svg)
]

???

- Talk about the delete in O(1)
    * Replace deleted item with last item to keep array contiguous

![:flush 1]

### Binary search

- Array is maintained in order
- Can perform binary search to find item

.lcol66.small[
| Search      | Insert | Delete |
|-------------|--------|--------|
| $O(\log N)$ | $O(N)$ | $O(N)$ |
]
.rcol33[
![:scale 100%](images/binary_search.svg)
]

---

## Advanced implementations

### Binary search tree

- Organization of data items in BST, following certain order
    - Any node's key is greater than all keys in its left subtree
    - Any node's key is smaller than all keys in its right subtree
- At most, explore height of tree

.small[
| Search avg/worst   | Insert avg/worst   | Delete avg/worst   |
|--------------------|--------------------|--------------------|
| $O(\log N)$/$O(N)$ | $O(\log N)$/$O(N)$ | $O(\log N)$/$O(N)$ |
]

???

- Referring to unbalanced BSTs here, so average is log N but worst case is N if
  tree has degenerated into linked list

### AVL or RB tree

- Balanced BST
- Height of tree is kept to $O(\log N)$

.small[
| Search     | Insert      | Delete     |
|------------|-------------|------------|
| $O(\log N)$ | $O(\log N)$ | $O(\log N)$ |
]
]

???

- log is really good already
    - 1 billion items result in a tree of height 30, which is not too many steps

---

## Ideal data structure

.rcol33.small[
| Search | Insert | Delete |
|--------|--------|--------|
| $O(1)$ | $O(1)$ | $O(1)$ |
]

### Using data as index

- Keys are small integers
    - ASCII is 128 characters
    - `std::vector<size_t> fmap(128)`;
- Can directly be used as index
    - `fmap[key] = value;`
]

### Limitations
- Doesn't *(efficiently)* support other operations
    - `Min()`, `Max()`, etc.
- Wasted memory for unused keys
    * E.g., ASCII defines about 30 control characters
- What if the key is not a small integer?
]

---

## Example #2: frequency of words

- Compute the frequency of all the (alphabetical) characters in a text

.rcol25.tiny[

| Key    | Value |
|--------|-------|
| 'of'   | 6     |
| 'time' | 5     |
| 'to'   | 3     |
| 'the'  | 3     |
| 'it'   | 3     |
| 'I'    | 3     |
| ...    | ...   |

]

![:flush 1]

### Issues

- Very large number of words in English
    - Would need a huge table to index by word
    - `std::vector<size_t> fmap(171476)`;
- Cannot index an array with a string
    - `fmap["time"]` is not a proper C++ construct
    - Can only index arrays with an integer

???

- 171,476 words in current use

---

- Compute an array index from any key
- Translate key range `0..K-1` to index range `0..M-1`
- How to provide **hashing** for any type of key?
    - Boolean, integers, floating-point numbers, strings, user-defined objects,
      etc.
]

![:flush 1]

- Hash function is said to be *non-injective*
- Key range is typically much larger than index range
- Two keys can be reduced to the same hash
- How to resolve potential **collisions** properly?
]

???

- Non injective function means that two different inputs can map to a single
  output

---

# Hash functions

---

## Required properties

- Simple to compute, and fast
- Hash of a key is deterministic
    - Equal keys must produce the same hash value
    - `assert(a == b && hash(a) == hash(b));`
- Uniform distribution of the keys across the index range

![:newline 1]
![:scale 100%](images/hash_function_generic.svg)

- Two implicit steps:
    1. Transform any potentially non-integer key into an integer hash code
    2. Reduce this hash code to an index range

???

- Hash code is an integer, in the same "domain" as array index

---

## Some pitfalls

### Worse (yet functional) hash function

.lcol.footnotesize[
```c++
template<typename K>
size_t GalaxyHashCode(K &key) {
    return 42;
}
```
]

.rcol.small[
#### Problem
- Can only insert one key
- All the following keys will collide, regardless of their value

#### Solution
- Need to distinguish keys by value
]

.lcol[
Possible scenario:
- Keys are phone numbers: e.g., (951) 827-5639
- Hash table is of size 1000
- Idea: use first three digits as a direct index?
    - `hash("(951) 827-5639") = 951`
]

.rcol.small[
#### Problem
- Phone numbers with same area code will collide
    - Very likely to happen if hash function was used on campus!

#### Solution
- Select another set of 3 digits that shows better "random" properties (e.g.,
  the last three)
- Or better, consider the entire phone number instead
]

---

## Hashing positive integers

- Key is a simple unsigned integer in the range `0..K-1`
- Array index is always an unsigned integer in the range `0..M-1`
- Hashing is done via modulo operation

```C
size_t HashUInt(unsigned key, size_t M) {
    return key % M;
}

```

![:filename](uint_hash_uniform.cc)

]

.lcol66.footnotesize[

```C
  // Hash table
  std::vector<bool> ht(100, false);

// Uniform random number generator
  std::mt19937 mt(0);
  std::uniform_int_distribution<unsigned> distr(0, 999);

// Generate some random keys
  for (int i = 0; i < 25 ; i++) {
    unsigned key = distr(mt);
    size_t hash = HashUInt(key, ht.size());

std::cout << "Hash(" << key << ") => " << hash;

if (ht[hash])
      std::cout << " (collision!)";
    ht[hash] = true;
    std::cout << std::endl;
  }

```

![:filename](uint_hash_uniform.cc)

]

???

- Uniform random generator
    * Numbers are uniformly distributed in the specified range
    * If generator runs a billion of times, each number of the range will get
      picked about the same number of times

![:newline 1]

- Live quiz!

.rcol33.scriptsize[
```
Hash(548) => 48
Hash(592) => 92
Hash(715) => 15
Hash(844) => 44
Hash(602) => 2
Hash(857) => 57
Hash(544) => 44 (collision!)
Hash(847) => 47
Hash(423) => 23
Hash(623) => 23 (collision!)
Hash(645) => 45
Hash(384) => 84
Hash(437) => 37
Hash(297) => 97
Hash(891) => 91
Hash(56) => 56
Hash(963) => 63
Hash(272) => 72
Hash(383) => 83
...
```
]

---

.lcol75[
## Consideration about table size (1/2)
- Does the size of the hash table matter?
- Same scenario as previous slide
    - Keys are random numbers between 0 and 999
    - Distribution of keys is uniform
    - Hashing of 25 keys
- But hash table can either be of size **100** or size **97**
]

.rcol25.tiny[
| Key       | % 100    | % 97     |
|-----------|----------|----------|
| ...       | ...      | ...      |
| 857       | 57       | 81       |
| 544       | .red[44] | 59       |
| 847       | 47       | 71       |
| 423       | 23       | 35       |
| 623       | .red[23] | 41       |
| 645       | 45       | .red[63] |
| 384       | 84       | 93       |
| ...       | ...      | ...      |
| 477       | 77       | 89       |
| 791       | .red[91] | 15       |
| 812       | 12       | .red[36] |
| 528       | 28       | 43       |
| ...       | ...      | ...      |
| 568       | 68       | 83       |
| **Collisions** | **3**   | **2**    |
]

### Conclusion
- Table size is not critical
- If distribution of keys is *uniform*
]

???

- Table 97 is smaller than 100, but a little less collisions

---

.lcol75[
## Consideration about table size (2/2)
- Does the size of the hash table matter?
- Same scenario as last slide
    - Keys are random numbers between 0 and 999
    - Hashing of 25 keys
    - Hash table can either be of size 100 or size 97
- But distribution of keys is **non-uniform**
]

.rcol25.tiny[
| Key            | % 100     | % 97  |
|----------------|-----------|-------|
| 0              | 0         | 0     |
| 25             | 25        | 25    |
| 50             | 50        | 50    |
| 75             | 75        | 75    |
| 100            | .red[0 ]  | 3     |
| 125            | .red[25]  | 28    |
| 150            | .red[50]  | 53    |
| 175            | .red[75]  | 78    |
| 200            | .red[0 ]  | 6     |
| 225            | .red[25]  | 31    |
| 250            | .red[50]  | 56    |
| 275            | .red[75]  | 81    |
| 300            | .red[0 ]  | 9     |
| 325            | .red[25]  | 34    |
| 350            | .red[50]  | 59    |
| ...            | .red[...] | ...   |
| **Collisions** | **21**    | **0** |
]

### Conclusion
- Table size becomes critical
- If distribution of keys is *non-uniform*
- Prime numbers exhibit good properties
]

???

- Conclusion: use prime numbers for table size!
- Now, let's talk about hashing other types than simple integers

---

???

- Today, continue talking about hash functions
- Talk about why collisions are inevitable
- First implementation of hash table

---

# Recap

.lcol[
## Hash tables
- Data structure implementing efficient associative arrays
- Key to value mappings
- Average $O(1)$ for search, insert, and delete operations
]
.rcol[
## Hashing
- Hash function
- Compute array index from any type of key
![:scale 100%](images/hash_function_generic.svg)
]
![:flush 0.5]
.lcol[
### Hashing positive integers
- Hash key in range `0..K-1` into array index in range `0..M-1`
- Simple modulo operation

```C
size_t HashUInt(unsigned key, size_t M) {
    return key % M;
}

```

![:filename](uint_hash_uniform.cc)

]
]
.rcol[
### Table size
- Table size is critical if distribution of keys is *non-uniform*
- Pick prime number to minimize number of collisions

.center.tiny[
| Key            | % 100     | % 97  |
|----------------|-----------|-------|
| ...            | .red[...] | ...   |
| 100            | .red[0 ]  | 3     |
| ...            | .red[...] | ...   |
| **Collisions** | **21**    | **0** |
]
]

---

# Hash functions

---

### Naive approach

- Sum ASCII value of each character
- Reduce to table size
]
.rcol.footnotesize[

```C
size_t HashStr(std::string key, size_t M) {
  size_t h = 0;
  for (auto c : key)
    h += c;
  return h % M;
}

```

![:filename](string_hash_naive.cc)

]

![:flush 0.5]

- Average word length is 4.5
    - $4.5 * 127 = 576$
- Longest word is 45 letters long
    - $45 * 127 = 5715$
]

- Assuming a hash table of size 10,000
- Indices up to 500 will be crowded
- Indices above 5000 will never be used
]

.lcol60.footnotesize[
#### Experiment

```C
  std::vector<std::string> dict{
    // Some of the most used words
    "the", "and", "that", "have",
    "for", "not", "with", "you",
    // Longest word in English
    "pneumonoultramicroscopicsilicovolcanoconiosis"
  };

for (auto &d : dict)
    std::cout << "Hash(" << d << ") => "
        << HashStr(d, 10000) << std::endl;

```

![:filename](string_hash_naive.cc)

]

.rcol40.footnotesize[
#### Result
```
Hash(the) => 321
Hash(and) => 307
Hash(that) => 433
Hash(have) => 420
Hash(for) => 327
Hash(not) => 337
Hash(with) => 444
Hash(you) => 349
Hash(pneumonoultramicroscopicsilicovolcanoconiosis) => 4880
```
]

---

### Better approach

- Multiple/add ASCII value of each character
- Reduce to table size
]
.rcol.footnotesize[

```C
size_t HashStr(std::string key, size_t M) {
  size_t h = 0;
  for (auto c : key)
    h = (h * 31) + c;
  return h % M;
}

```

![:filename](string_hash_better.cc)

]

![:flush 0.5]

#### Horner's method
- Consider string of length L as a polynomial
- \$h = key[0]\*31^{L-1} + key[1]\*31^{L-2} + ... + key[L-1] * 31^0\$

![:newline 0.5]

- If *unsigned* hash value gets too big, overflow is controlled
- `31` is an interesting *prime number*
    - Multiply item by 32 (by shifting `<< 5`) and subtract item
    - Experimentations has shown good distribution of indices overall
]

???

- The shift and subtract is interesting if the processor's multiply instruction
  is slow

--
.rcol40.footnotesize[
#### Result
```
Hash(the) => 4801
Hash(and) => 6727
Hash(that) => 8823
Hash(have) => 5240
Hash(for) => 1577
Hash(not) => 9267
Hash(with) => 9734
Hash(you) => 9839
Hash(pneumonoultramicroscopicsilicovolcanoconiosis) => 2420
```
]

???

- OK, let's see some other more complex data types now

---

## Hashing compound data types

- Fields of same type
- Same method as string, but for each field
]
.rcol60.footnotesize[
```c++
struct PhoneNumber {
  unsigned area, exch, ext; // (951) 827-5639
};

size_t HashPhone(PhoneNumber &p, size_t M) {
  return ((p.area * 31 + p.exch) * 31 + p.ext) % M;
}
```
]

![:flush 1]

- Fields of different type
- Combine each field using $31x+y$ rule
]
.rcol60.footnotesize[
```c++
struct Address {
  unsigned st_num;
  std::string st_name;
};

size_t HashAddress(Address &a, size_t M) {
  size_t h = 17; // small prime as initial value

// Hash street number
  h = h * 31 + a.st_num;

// Hash street name
  for (auto c : a.st_name)
    h = h * 31 + c;

return h % M;
}
```
]

.lcol40[
![:newline 1]
- If key is too long to compute, pick a subset of the fields
    - E.g., only a few well-chosen characters from the street name
]

???

- Live quiz!

---

# Collisions

---

## Uniform hashing

- We assume that the hash function has a uniform distribution
    * Keys are mapped as evenly as possible over the index range

![:newline 1]

## Collision probability

.lcol75[
- Assuming an array index in the range `0..M-1`
- After how many hashed keys will there be the first collision?
]

.rcol25.tiny[
| Hash    | Key | % 97     |
|---------|-----|----------|
| #1      | 548 | 63       |
| #2      | 592 | 10       |
| #3      | 715 | 36       |
| #4      | 844 | 68       |
| #5      | 603 | 20       |
| #6      | 857 | 81       |
| #7      | 544 | 59       |
| #8      | 847 | 71       |
| #9      | 423 | 35       |
| #10     | 623 | 41       |
| **#11** | 645 | .red[63] |
]

.lcol75[
![:newline 1]
### Experiments
- Example:
    - Output range of 97
    - 11 hashes to observe the first collision
- Now, what if the output range was 223?...
]

---

### Birthday problem

.lcol60[
- In a group of $N$ random people, what is the probability that two people have
  the same birthday?
]

![:flush 1]

### Back to collision probability

- Number of hashed keys to first collision is $\approx \sqrt{\frac{\pi}{2}M}$

.rcol25.tiny[
| Hash    | Key | % 97     |
|---------|-----|----------|
| ...     | ... | ...      |
| #10     | 623 | 41       |
| **#11** | 645 | .red[63] |
]

---

## Observation

- Impossible to avoid collisions
    * Unless table size is quadratically bigger than necessary!

![:newline 1]

## Conclusion
- Need to deal with collisions gracefully
- Two options:
    1. **Separate chaining**: couple the hash table with another container
    2. **Open addressing**: put the key at another index in case of collision

---

# Separate chaining

---

## Principle

- Hash table is an array of lists
- Keys are still hashed to array index $i$ in the range `0..M-1`

![:newline 1]

.lcol40.footnotesize.center[
| key | hash |
|-----|------|
| v   | 3    |
| e   | 1    |
| c   | 4    |
| t   | 1    |
| o   | 1    |
| r   | 4    |
| s   | 0    |
]
.rcol60[
![:scale 90%](images/hash_separate_chaining.svg)
]

![:flush 1]

- Insertion: key pushed in $i^{th}$ chain
- Search: only need to search key in $i^{th}$ chain

---

## Public API
.footnotesize[

```C
  // Returns whether @key is found in table
  bool Contains(const K &key);
  // Inserts @key in hash table
  void Insert(const K &key);
  // Removes @key from hash table
  void Remove(const K &key);

```

![:filename](ht_chaining.h)

]

![:newline 1]

## Data structure implementation
.lcol60.footnotesize[

```C
template <typename K>
class HTChaining {
 public:
...
 private:
  static const size_t init_capacity = 11;
  std::vector<std::list<K>> ht{init_capacity};
  size_t cur_size = 0;

// Helper methods
  size_t Hash(const K &key);
...
};

```

![:filename](ht_chaining.h)

]
.rcol40[
- Small prime number as initial hash table size
- Hash table of chains, as a vector of linked lists
]

---

## Hash function

.lcol60.footnotesize[

```C
template <typename K>
size_t HTChaining<K>::Hash(const K &key) {
  static std::hash<K> k_hash;
  return k_hash(key) % ht.size();
}

```

![:filename](ht_chaining.h)

]
.rcol40[
- `std::hash<T>` provided by C++ library
- Knows how to hash most basic types (e.g., integers, floating-point, strings)
]

```C
  std::hash<int> int_hash;
  std::hash<double> double_hash;
  std::hash<std::string> str_hash;
  std::hash<int*> intptr_hash;

std::cout << 42        << " => " << int_hash(42) << std::endl;
  std::cout << -42       << " => " << int_hash(-42) << std::endl;
  std::cout << 42.23     << " => " << double_hash(42.23) << std::endl;
  std::cout << "algorithm" << " => " << str_hash("algorithm") << std::endl;
  int a;
  std::cout << &a        << " => " << intptr_hash(&a) << std::endl;

```

![:filename](std_hash.cc)

]
]

--
.lcol45.footnotesize[
```
42 => 42
-42 => 18446744073709551574
42.23 => 4631140161442744893
algorithm => 9064388152055783788
0x16fb82ec0 => 5748297291990795977
```
]
.rcol55[
- Good and easy way to get a well-distributed hash code
- Still need to reduce it to array index with a modulo operation
]

---

---

# Recap

.lcol[
## Hashing
- Horner's method
.footnotesize[
\$h = key[0]\*31^{L-1} + key[1]\*31^{L-2}\$
\$\qquad+ ... + key[L-1] * 31^0\$
]
]
.rcol[
## Collisions
- Collisions are inevitable
- $\sqrt{\frac{\pi}{2}M}$ insertions before first collision in hashtable of size
  $M$
]
![:flush 1]

## Separate chaining
.lcol45[
- Lists of keys hashed to same array index
![:scale 80%](images/hash_separate_chaining.svg)
]
.rcol55[
### Implementation
.footnotesize[

```C
template <typename K>
class HTChaining {
 public:
...
 private:
  static const size_t init_capacity = 11;
  std::vector<std::list<K>> ht{init_capacity};
  size_t cur_size = 0;

// Helper methods
  size_t Hash(const K &key);
...
};

```

![:filename](ht_chaining.h)

]
]

---

---

## API implementation
.footnotesize[

```C
template <typename K>
bool HTChaining<K>::Contains(const K &key) {
  auto &list = ht[Hash(key)];
  return std::find(list.begin(), list.end(), key) != list.end();
}

```

![:filename](ht_chaining.h)

]
- `std::find()` returns iterator on matching item within specified bounds
- Returns special `end()` iterator if not found

![:newline 1]

.lcol.footnotesize[

```C
template <typename K>
void HTChaining<K>::Insert(const K &key) {
  // Key already there
  if (Contains(key)) return;

// Push key into right bucket
  auto &list = ht[Hash(key)];
  list.push_back(key);

// Update size
  cur_size++;
...
}

```

![:filename](ht_chaining.h)

]

.rcol.footnotesize[

```C
template <typename K>
void HTChaining<K>::Remove(const K &key) {
  // Key not there
  if (!Contains(key)) return;

// Remove key from bucket
  auto &list = ht[Hash(key)];
  auto itr = std::find(list.begin(),
					   list.end(), key);
  list.erase(itr);

// Update size
  cur_size--;
}

```

![:filename](ht_chaining.h)

]

---

## Performance analysis

- Hashing key to array index: $O(1)$
- Searching/inserting/removing key in chain: $O($*length of chain*$)$

![:newline 0.5]

### Hash function

- If using poor hash function, such as `GalaxyHashCode()`
    - One chain can end up containing all the items
    - Runtime complexity of operations: $O(N)$!
- If using hash function providing uniform distribution
    - On average, chains will be of length $N/M = L$
    - $L$ is also know as the *load factor*
    - Runtime complexity of operations: $O(L)$

???

- But initially we said that hashtables should be in constant time so that seems
  paradoxal right? O(L) sounds like it's anything but constant..
- How can I still make it constant?

![:newline 0.5]

### Load factor

- If load factor is maintained to be constant
    - Then complexity also becomes constant
- Number of items $N$ in hash table cannot be controlled
- But number of chains $M$ can...

---

---

# Recap

## Separate chaining
.lcol60[
- Lists of keys hashed to same array index
![:scale 80%](images/hash_separate_chaining.svg)
]
.rcol40[
### Load factor
- Average length of chains in hashtable ($L=N/M$)
- Defines complexity of most operations ($O(L)$)
- Maintain constant load factor via resizing the hash table!
]

]

---

# Separate chaining

---

### Scenario
![:scale 80%](images/hash_separate_chaining.svg)
- Load factor: $7/5 = 1.4$
]

.rcol60[
### Resizing - Rehashing
1. Increase the size of hash table
1. **Rehash** every item into new table
]

???

- Better to choose a new prime number

### Result
![:scale 30%](images/hash_separate_chaining_rehash.svg)
- Load factor: $7/11 \approx 0.64$
]

???

- Can show the example of 'd' which has the ascii code 100
    - 100 % 5 => 0
    - 100 % 11 => 1 (99 + 1)

---

## Resizing implementation

.lcol40.footnotesize[

```C
template <typename K>
class HTChaining {
 public:
...
 private:
...
  void Resize(size_t capacity);
};

```

![:filename](ht_chaining.h)

]

.rcol60.footnotesize[

```C
template <typename K>
void HTChaining<K>::Insert(const K &key) {
...
  // Update size
  cur_size++;

// Resize hash table if load factor is 1
  if (cur_size > ht.size())
    Resize(ht.size() * 2); // XXX: should use prime!
}

```

![:filename](ht_chaining.h)

]

???

- Did not choose a prime number to make the code simpler, but should!

```C
template <typename K>
void HTChaining<K>::Resize(size_t capacity) {
  // Back up existing hash table
  std::vector<std::list<K>> old_ht = ht;

// Resize table to new capacity, and empty it
  ht.resize(capacity);
  for (auto &list : ht)
    list.clear();
  cur_size = 0;

// Reinsert all previous items into new table
  for (auto &list : old_ht)
    for (auto &k : list)
      Insert(k);
}

```

![:filename](ht_chaining.h)

]
]

.rcol33[
![:newline 1]
### Complexity
- Resizing happens infrequently
- *Amortized* time complexity: $O(1)$

![:newline 1]
- Ideally, resize to another prime number
]

---

## Conclusion

### Complexity

.lcol[
- Proportional to load factor
- Typically force load factor to around 0.75
    * Good compromise between performance and memory usage
]

.rcol.small[
| Search | Insert             | Delete |
|--------|--------------------|--------|
| $O(1)$ | $O(1)$ *amortized* | $O(1)$ |
]

![:flush 1]

### Pros

- Very effective if keys are uniformly distributed

![:newline 1]

### Cons

- Rely on another data structure, such as lists, to implement the chains
    - Note that the chains could also be implemented with balanced trees!
- Cost associated to this additional data structure
    - Memory management, traversal, etc.

---

## Principle

- Keys hashed to the same array index are not chained together
- Inserted in the same *flat* hash table, but at the *next* available index
    - Hash table must be *at least* the same size as the number of keys total

![:newline 0.5]

- Key $x$ is inserted at the first index $h_i$ available
    - $h_i(x) = (hash(x) + f(i)) \mod M$
    - where $f(0) = 0$
- Function $f$ is the collision resolution strategy
    - Linear probing, quadratic probing, etc.
]
.rcol33[
![:scale 100%](images/open_addressing.svg)
]

![:flush]

### Operations

- Insert: keep probing until an empty index is found
- Search: keep probing until key is found
    - Or until encountering an empty index (i.e., key not found)

---

# Linear probing

---

## Principle

- Function $f$ is a linear function of $i$
    - Typically $f(i) = i$
- In case of collision, try next indices sequentially

![:newline 1]

- Hash table of size $M = 17$
- Insert sequence of characters
]

.rcol40.footnotesize.center[
| Key | Hash    | Index                 |
|:---:|---------|:----------------------|
| `I` | 5       | .green[5]             |
| `<` | 9       | .green[9]             |
| `3` | 0       | .green[0]             |
| `L` | 8       | .green[8]             |
| `U` | .red[0] | .green[1] *(= 0 + 1)* |
| `V` | .red[1] | .green[2] *(= 1 + 1)* |
]

### Observations
- Apparition of *primary clusters*
    - Contiguous block of keys
    - Caused by multiple collisions
- Key hashed into a cluster might require multiple attempts to be inserted
]

---

## Data structure

- Based on *flat* hash table
- Same public API as for separate chaining implementation

```C
template <typename K>
class HTLinear {
 public:
...
 private:
  struct Node {
    K key;
    bool taken;
  };
  static const size_t init_capacity = 11;
  std::vector<Node> ht{init_capacity};
  size_t cur_size = 0;

// Helper methods
  size_t Hash(const K &key);
  void Resize(size_t capacity);
};

```

![:filename](ht_linear.h)

]

.lcol40[
![:newline 1]
- Same hash function as for separate chaining implementation
]
.rcol60.footnotesize[

```C
template <typename K>
size_t HTLinear<K>::Hash(const K &key) {
  static std::hash<K> k_hash;
  return k_hash(key) % ht.size();
}

```

![:filename](ht_linear.h)

]

---

## API implementation (1/2)

.lcol.footnotesize[

```C
template <typename K>
bool HTLinear<K>::Contains(const K &key) {
  size_t idx = Hash(key);
  while (ht[idx].taken) {
    if (ht[idx].key == key)
      return true;
    idx = (idx + 1) % ht.size();
  }
  return false;
}

```

![:filename](ht_linear.h)

]

```C
template <typename K>
void HTLinear<K>::Insert(const K &key) {
  // Key already there
  if (Contains(key)) return;

// Find empty spot
  size_t idx = Hash(key);
  while (ht[idx].taken)
    idx = (idx + 1) % ht.size();

// Put key there
  ht[idx].key = key;
  ht[idx].taken = true;

// Update size
  cur_size++;

// Resize hash table if 50% full
  if (cur_size > ht.size() / 2)
    Resize(ht.size() * 2);
}

```

![:filename](ht_linear.h)

]
- Here, try to maintain load factor of maximum 0.5
]

.lcol.footnotesize[

```C
template <typename K>
void HTLinear<K>::Resize(size_t capacity) {
  std::vector<Node> old_ht = ht;

// Resize to new capacity and empty
  ht.resize(capacity);
  for (auto &k : ht)
    k.taken = false;
  cur_size = 0;

// Reinsert all previous items
  for (auto &k : old_ht)
    if (k.taken)
      Insert(k.key);
}

```

![:filename](ht_linear.h)

]

---

## Deletion principle

### Scenario

.lcol66[
- Assuming following hash table
    - Cluster of three keys
- How to remove key `U`?
    - Key `V` should still be reachable
]

.rcol33.center.small[
| Key | Hash | Index |
|:---:|:-----|:------|
| `I` | 5    | 5     |
| `<` | 9    | 9     |
| `3` | 0    | 0     |
| `L` | 8    | 8     |
| `U` | 0    | 1     |
| `V` | 1    | 2     |
]

### Problems
- If free index at `U`, `V` is not reachable anymore
    - `Hash(V)` points to available entry
- Same issue for key `U` if remove key `3`
]

.wide[
### Solution
- Free key's index upon removal
- But rehash every key of cluster located directly after key
    - `V` would be reinserted at index `1`, and stay reachable
]

???

- Close any gaps in cluster

---

## Deletion implementation

```C
template <typename K>
void HTLinear<K>::Remove(const K &key) {
  // Key not there
  if (!Contains(key)) return;

// Find key position and remove it
  size_t idx = Hash(key);
  while (ht[idx].key != key)
    idx = (idx + 1) % ht.size();
  ht[idx].taken = false;

// Rehash the next keys of the same cluster
  idx = (idx + 1) % ht.size();
  while (ht[idx].taken) {
    // Temporarily remove key from its spot, and reinsert it right away
    auto &key_cpy = ht[idx].key;
    ht[idx].taken = false;
    cur_size--;
    Insert(key_cpy);

idx = (idx + 1) % ht.size();
  }

// Update size, and halve hash table if 12.5% full
  cur_size--;
  if (cur_size > 0 && cur_size <= ht.size() / 8)
    Resize(ht.size() / 2);
}

```

![:filename](ht_linear.h)

]

---

## Conclusion

### Complexity

.rcol.scriptsize[
| Search hit                                                 | Search miss/Insert                                             |
|------------------------------------------------------------|----------------------------------------------------------------|
| $O(\sim\frac{1}{2}\left(1 + \frac{1}{1 - \alpha}\right))$ | $O(\sim\frac{1}{2}\left(1 + \frac{1}{(1 - \alpha)^2}\right))$ |
]

![:flush 1]

### Pros
- No overhead for memory management
- Locality of reference, especially if multiple consecutive items can be in the
  same cache line

![:newline 1]

### Cons
- Requires low load factor (0.5) compared to separate chaining
    - Bigger hash table
- Causes primary clustering

???

- $\alpha$ is the load factor
- Primary clustering: tendency for one collision to cause more nearby collisions

---

# Open addressing strategies

### Linear probing

- $f(i) = i$
- If index at `i` is taken, try `i+1`, then `i+2`, ..., `i+k`.

### Quadratic probing

- $f(i) = i^2$
- If index at `i` is taken, try `i+1^2`, then `i+2^2`, ..., `i+k^2`.

### Double hashing

- Use second hash functions to compute intervals between probes
- $f(i) = i * h_2(k)$

![:newline 1]

## Comparison

.small.center[
| Strategy          | Locality of reference | Prevents clustering |
|-------------------|:----------------------|:-----------|
| Linear probing    | Excellent             | Bad        |
| Quadratic probing | Good                  | Good       |
| Double hashing    | Bad                   | Excellent  |
]

---

# Standard C++ containers

---

The C++ Standard Library defines two main **unordered** *associative
containers*, implemented as hashed data structures:
- `std::unordered_set`
    - Contains an *unordered* set of unique objects
    - Objects must be comparable for equality, as they are used as keys directly
- `std::unordered_map`
    - Contains *unordered* pairs of key-value
    - Keys must be comparable for equality and unique

![:newline 2]
- Comparison with **ordered** *associate containers* (map, set, etc.)
    - Constant vs logarithm complexity
    - Collection is not ordered by key, by design

???

- multiset and multimap can have multiple items with same key

---

## Demo

.lcol66.footnotesize[

```C
#include <iostream>
#include <unordered_map>

int main() {
  // Initialization
  std::unordered_map<int, std::string> m = {
    {36, "Alice"}, {23, "Bob"}
  };

// Dynamic insertion by indexing
  m[8] = "Lisa";

// Dynamic pair insertion
  m.insert({10, "Bart"});

// Alternative pair insertion
  // and of duplicated item
  m.insert(std::make_pair(10, "Bart"));

// Iterate and print values
  for (auto n : m)
    std::cout << "(" << n.second
        << ": " << n.first << ")"
        << std::endl;

return 0;
}

```

![:filename](std_unordered_map_example.cc)

]
.rcol33.small[
```
(Bart: 10)
(Lisa: 8)
(Alice: 36)
(Bob: 23)
```
]