2. Problem: Search
• We are given a list of records.
• Each record has an associated key.
• Give efficient algorithm for searching for a
record containing a particular key.
• Efficiency is quantified in terms of average
time analysis (number of comparisons) to
retrieve an item.
3. Search
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 700 ]
Number 506643548
Number 233667136
Number 281942902
Number 155778322
Number 580625685
Number 701466868
…
Number 580625685
Each record in list has an associated key.
In this example, the keys are ID numbers.
Given a particular key, how can we
efficiently retrieve the record from the list?
4. Serial Search
• Step through array of records, one at a time.
• Look for record with matching key.
• Search stops when
– record with matching key is found
– or when search has examined all records
without success.
5. Pseudocode for Serial Search
// Search for a desired item in the n array elements
// starting at a[first].
// Returns pointer to desired record if found.
// Otherwise, return NULL
…
for(i = first; i < n; ++i )
if(a[first+i] is desired item)
return &a[first+i];
// if we drop through loop, then desired item was not found
return NULL;
6. Serial Search Analysis
• What are the worst and average case
running times for serial search?
• We must determine the O-notation for the
number of operations required in search.
• Number of operations depends on n, the
number of entries in the list.
7. Worst Case Time for Serial Search
• For an array of n elements, the worst case time
for serial search requires n array accesses: O(n).
• Consider cases where we must loop over all n
records:
– desired record appears in the last position of
the array
– desired record does not appear in the array at
all
8. Average Case for Serial Search
Assumptions:
1. All keys are equally likely in a search
2. We always search for a key that is in the array
Example:
• We have an array of 10 records.
• If search for the first record, then it requires 1 array
access; if the second, then 2 array accesses. etc.
The average of all these searches is:
(1+2+3+4+5+6+7+8+9+10)/10 = 5.5
9. Average Case Time for Serial Search
Generalize for array size n.
Expression for average-case running time:
(1+2+…+n)/n = n(n+1)/2n = (n+1)/2
Therefore, average case time complexity for serial
search is O(n).
10. Binary Search
• Perhaps we can do better than O(n) in the
average case?
• Assume that we are give an array of records
that is sorted. For instance:
– an array of records with integer keys sorted
from smallest to largest (e.g., ID numbers), or
– an array of records with string keys sorted in
alphabetical order (e.g., names).
11. Binary Search Pseudocode
…
if(size == 0)
found = false;
else {
middle = index of approximate midpoint of array segment;
if(target == a[middle])
target has been found!
else if(target < a[middle])
search for target in area before midpoint;
else
search for target in area after midpoint;
}
…
29. Binary Search: Analysis
• Worst case complexity?
• What is the maximum depth of recursive
calls in binary search as function of n?
• Each level in the recursion, we split the
array in half (divide by two).
• Therefore maximum recursion depth is
floor(log2n) and worst case = O(log2n).
• Average case is also = O(log2n).
30. Can we do better than O(log2n)?
• Average and worst case of serial search = O(n)
• Average and worst case of binary search = O(log2n)
• Can we do better than this?
YES. Use a hash table!
31. What is a Hash Table ?
• The simplest kind of hash
table is an array of records.
• This example has 701
records.
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ]
. . .
[ 700]
32. What is a Hash Table ?
• Each record has a special
field, called its key.
• In this example, the key is a
long integer field called
Number.
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ]
. . .
[ 700]
[ 4 ]
Number 506643548
33. What is a Hash Table ?
• The number might be a
person's identification
number, and the rest of the
record has information
about the person.
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ]
. . .
[ 700]
[ 4 ]
Number 506643548
34. What is a Hash Table ?
• When a hash table is in use,
some spots contain valid
records, and other spots are
"empty".
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]
Number 506643548
Number 233667136
Number 281942902
Number 155778322
. . .
35. Open Address Hashing
• In order to insert a new
record, the key must
somehow be converted to an
array index.
• The index is called the hash
value of the key.
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]
Number 506643548
Number 233667136
Number 281942902
Number 155778322
. . .
Number 580625685
36. Inserting a New Record
• Typical way create a hash
value:
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]
Number 506643548
Number 233667136
Number 281942902
Number 155778322
. . .
Number 580625685
(Number mod 701)
What is (580625685 % 701) ?
37. • Typical way to create a hash
value:
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]
Number 506643548
Number 233667136
Number 281942902
Number 155778322
. . .
Number 580625685
(Number mod 701)
What is (580625685 % 701) ?
3
38. • The hash value is used for
the location of the new
record.
Number 580625685
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]
Number 506643548
Number 233667136
Number 281942902
Number 155778322
. . .
[3]
39. Inserting a New Record
• The hash value is used for
the location of the new
record.
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]
Number 506643548
Number 233667136
Number 281942902
Number 155778322
. . .
Number 580625685
40. Collisions
• Here is another new record
to insert, with a hash value
of 2.
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]
Number 506643548
Number 233667136
Number 281942902
Number 155778322
. . .
Number 580625685
Number 701466868
My hash
value is [2].
41. Collisions
• This is called a collision,
because there is already
another valid record at [2].
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]
Number 506643548
Number 233667136
Number 281942902
Number 155778322
. . .
Number 580625685
Number 701466868
When a collision occurs,
move forward until you
find an empty spot.
42. Collisions
• This is called a collision,
because there is already
another valid record at [2].
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]
Number 506643548
Number 233667136
Number 281942902
Number 155778322
. . .
Number 580625685
Number 701466868
When a collision occurs,
move forward until you
find an empty spot.
43. Collisions
• This is called a collision,
because there is already
another valid record at [2].
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]
Number 506643548
Number 233667136
Number 281942902
Number 155778322
. . .
Number 580625685
Number 701466868
When a collision occurs,
move forward until you
find an empty spot.
44. Collisions
• This is called a collision,
because there is already
another valid record at [2].
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]
Number 506643548
Number 233667136
Number 281942902
Number 155778322
. . .
Number 580625685 Number 701466868
The new record goes
in the empty spot.
45. Searching for a Key
• The data that's attached to a
key can be found fairly
quickly.
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]
Number 506643548
Number 233667136
Number 281942902
Number 155778322
. . .
Number 580625685 Number 701466868
Number 701466868
46. • Calculate the hash value.
• Check that location of the array
for the key.
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]
Number 506643548
Number 233667136
Number 281942902
Number 155778322
. . .
Number 580625685 Number 701466868
Number 701466868
My hash
value is [2].
Not me.
47. • Keep moving forward until you
find the key, or you reach an
empty spot.
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]
Number 506643548
Number 233667136
Number 281942902
Number 155778322
. . .
Number 580625685 Number 701466868
Number 701466868
My hash
value is [2].
Not me.
48. • Keep moving forward until you
find the key, or you reach an
empty spot.
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]
Number 506643548
Number 233667136
Number 281942902
Number 155778322
. . .
Number 580625685 Number 701466868
Number 701466868
My hash
value is [2].
Not me.
49. • Keep moving forward until you
find the key, or you reach an
empty spot.
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]
Number 506643548
Number 233667136
Number 281942902
Number 155778322
. . .
Number 580625685 Number 701466868
Number 701466868
My hash
value is [2].
Yes!
50. • When the item is found, the
information can be copied to
the necessary location.
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]
Number 506643548
Number 233667136
Number 281942902
Number 155778322
. . .
Number 580625685 Number 701466868
Number 701466868
My hash
value is [2].
Yes!
51. Deleting a Record
• Records may also be deleted from a hash table.
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]
Number 506643548
Number 233667136
Number 281942902
Number 155778322
. . .
Number 580625685 Number 701466868
Please
delete me.
52. Deleting a Record
• Records may also be deleted from a hash table.
• But the location must not be left as an ordinary
"empty spot" since that could interfere with searches.
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]
Number 233667136
Number 281942902
Number 155778322
. . .
Number 580625685 Number 701466868
53. Deleting a Record
[ 0 ] [ 1 ] [ 2 ] [ 3 ] [ 4 ] [ 5 ] [ 700]
Number 233667136
Number 281942902
Number 155778322
. . .
Number 580625685 Number 701466868
• Records may also be deleted from a hash table.
• But the location must not be left as an ordinary
"empty spot" since that could interfere with searches.
• The location must be marked in some special way so
that a search can tell that the spot used to have
something in it.
54. • Hash tables store a collection of records with keys.
• The location of a record depends on the hash value of
the record's key.
• Open address hashing:
– When a collision occurs, the next available location is used.
– Searching for a particular key is generally quick.
– When an item is deleted, the location must be marked in a
special way, so that the searches know that the spot used to
be used.
• See text for implementation.
Hashing
55. Open Address Hashing
• To reduce collisions…
– Use table CAPACITY = prime number of form
4k+3
– Hashing functions:
• Division hash function: key % CAPACITY
• Mid-square function: (key*key) % CAPACITY
• Multiplicative hash function: key is multiplied by
positive constant less than one. Hash function
returns first few digits of fractional result.
56. Clustering
• In the hash method described, when the insertion
encounters a collision, we move forward in the
table until a vacant spot is found. This is called
linear probing.
• Problem: when several different keys are hashed to
the same location, adjacent spots in the table will be
filled. This leads to the problem of clustering.
• As the table approaches its capacity, these clusters
tend to merge. This causes insertion to take a long
time (due to linear probing to find vacant spot).
57. Double Hashing
• One common technique to avoid cluster is called double
hashing.
• Let’s call the original hash function hash1
• Define a second hash function hash2
Double hashing algorithm:
1. When an item is inserted, use hash1(key) to determine
insertion location i in array as before.
2. If collision occurs, use hash2(key) to determine how far to
move forward in the array looking for a vacant spot:
next location = (i + hash2(key)) % CAPACITY
58. Double Hashing
• Clustering tends to be reduced, because hash2() has
different values for keys that initially map to the same initial
location via hash1().
• This is in contrast to hashing with linear probing.
• Both methods are open address hashing, because the
methods take the next open spot in the array.
• In linear probing
hash2(key) = (i+1)%CAPACITY
• In double hashing hash2() can be a general function of the
form
– hash2(key) = (I+f(key))%CAPACITY
59. Chained Hashing
• In open address hashing, a collision is
handled by probing the array for the next
vacant spot.
• When the array is full, no new items can be
added.
• We can solve this by resizing the table.
• Alternative: chained hashing.
60. Chained Hashing
• In chained hashing, each location in the hash table
contains a list of records whose keys map to that
location:
…
[0] [1] [2] [3] [4] [5] [6] [7] [n]
Record whose
key hashes
to 0
Record whose
key hashes
to 0
…
Record whose
key hashes
to 1
Record whose
key hashes
to 1
…
Record whose
key hashes
to 3
Record whose
key hashes
to 3
…
…
61. Time Analysis of Hashing
• Worst case: every key gets hashed to same
array index! O(n) search!!
• Luckily, average case is more promising.
• First we define a fraction called the hash
table load factor:
= number of occupied table locations
size of table’s array
62. Average Search Times
For open addressing with linear probing, average
number of table elements examined in a successful
search is approximately:
½ (1+ 1/(1-))
Double hashing: -ln(1-)/
Chained hashing
63. Load
factor(
Open addressing,
linear probing
½ (1+1/(1-))
Open addressing
double hashing
-ln(1-)/
Chained hashing
0.5 1.50 1.39 1.25
0.6 1.75 1.53 1.30
0.7 2.17 1.72 1.35
0.8 3.00 2.01 1.40
0.9 5.50 2.56 1.45
1.0 Not applicable Not applicable 1.50
2.0 Not applicable Not applicable 2.00
3.0 Not applicable Not applicable 2.50
Average number of table elements examined during successful search
64. Summary
• Serial search: average case O(n)
• Binary search: average case O(log2n)
• Hashing
– Open address hashing
• Linear probing
• Double hashing
– Chained hashing
– Average number of elements examined is function of
load factor
Editor's Notes
#31:This lecture introduces hash tables, which are an array-based method for implementing a Dictionary. You should recall that we have seen dictionaries implemented in other ways, for example with a binary search tree. The abstract properties of a dictionary remain the same: We can insert items in the dictionary, and each item has a key associated with it. When we want to retrieve an item, we specify only the key, and the retrieval process finds the associated data.
What we do now is use an array to implement the dictionary. The array is an array of records. In this example, we could store up to 701 records in the array.
#32:Each record in the array contains two parts. The first part is a number that we'll use for the key of the item. We could use something else for the keys, such as a string. But for a hash table, numbers make the most convenient keys.
#33:The numbers might be identification numbers of some sort, and the rest of the record contains information about a person. So the pattern that you see here is the same pattern that you've seen in other dictionaries: Each entry in the dictionary has a key (in this case an identifying number) and some associated data.
#34:When a hash table is being used as a dictionary, some of the array locations are in use, and other spots are "empty", waiting for a new entry to come along.
Oftentimes, the empty spots are identified by a special key. For example, if all our identification numbers are positive, then we could use 0 as the Number that indicates an empty spot.
With this drawing, locations [0], [4], [6], and maybe some others would all have Number=0.
#35:In order to insert a new entry, the key of the entry must somehow be converted to an index in the array. For our example, we must convert the key number into an index between 0 and 700. The conversion process is called hashing and the index is called the hash value of the key.
#36:There are many ways to create hash values. Here is a typical approach.
a. Take the key mod 701 (which could be anywhere from 0 to 700).
So, quick, what is (580,625,685 mod 701) ?
#38:So, this new item will be placed at location [3] of the array.
#39:The hash value is always used to find the location for the record.
#40:Sometimes, two different records might end up with the same hash value.
#41:This is called a collision.
When a collision occurs, the insertion process will move forward through the array until an empty spot is found. Sometimes you will have a second collision...
#43:But if there are any empty spots, eventually you will reach an empty spot, and the new item is inserted here.
#44:The new record is always placed in the first available empty spot, after the hash value.
#45:It is fairly easy to search for a particular item based on its key.
#46:Start by computing the hash value, which is 2 in this case. Then check location 2. If location 2 has a different key than the one you are looking for, then move forward...
#47:...if the next location is not the one we are looking for, then keep moving forward...
#48:Keep moving forward until you find the sought-after key...
#50:The data from location [5] can then be copied to to provide the result of the search function.
What happens if a search reaches an empty spot? In that case, it can
halt and indicate that the key was not in the hash table.
#52:But the spot of the deleted record cannot be left as an ordinary empty spot, since that would interfere with searches. (Remember that a search can stop when it reaches an empty spot.)
#53:Instead we must somehow mark the location as "a location that used to have something here, but no longer does."
We might do this by using some other special value for the Number field of the record.
In any case, a search can not stop when it reaches "a location that used to have something here". A search can only stop when it reaches a true empty spot.