11. Hashing - Data Structures using C++ by Varsha Patil

Oxford University Press © 2012Data Structures Using C++ by Dr Varsha Patil
1

2
 Use of hashing techniques that support very fast retrieval
via a key
 Factors that affect the performance of hashing
 Collision resolution strategies

3
 Hashing is finding an address where the data is to be
stored as well as located using a key with the help of the
algorithmic function
 Hashing is a method of directly computing the address of
the record with the help of a key by using a suitable
mathematical function called the hash function
 A hash table is an array-based structure used to store <key,
information> pairs

4
key
Hash(key) Address
Fig 11.1:Hashing Concept

5
 The resulting address is used as the basis for storing and
retrieving records and this address is called as home
address of the record
 For array to store a record in a hash table, hash function is
applied to the key of the record being stored, returning an
index within the range of the hash table
 The item is then stored in the table of that index position

6
 1. With hashing, the address generated appears to be
random—there is no immediately obvious connection
between the key and the location of the corresponding
record, even though the key is used to determine the
location of the record. For this reason, hashing is
sometimes referred to as randomizing
 2. With hashing, two different keys may be
transformed to the same address, so two records may be
sent to the same place in the file. When this occurs, it is
called collision and some means must be found to deal
with it. The two or more records that result in the
samehome address are called as synonyms

7
 A problem arises, however, when the hash function
returns the same value when applied to two different keys
 To handle the situation, where two records need to be
hashed to the same address we can implement a table
structure, so as to have a room for two or more members at
the same index positions

8
 A function that maps a key into the range [0 to Max − 1],
the result of which is used as an index (or address) to hash
table for storing and retrieving record
 The address generated by hashing function is called as
home address
 All home addresses address to particular area of memory
and that area is called as prime area

9
 Bucket is an index position in hash table that can store
more than one record
 When the same index is mapped with two keys, then both
the records are stored in the same bucket

10
 The result of two keys hashing into the same address is
called collision

11
 Keys those hash to the same address are called synonyms

12
 The result of more keys hashing to the same address and if
there is no room in the bucket, then it is said that
overflow has occurred
 Collision and overflow are synonymous when the bucket is
of size 1

13
 When we allow records to be stored in potentially
unlimited space, it is called as open or external
hashing

14
 When we use fixed space for storage eventually limiting
the number of records to be stored, it is called as closed or
internal hashing

15
 Hash function is an arithmetic function that transforms a
key into an address and the address is used for storing and
retrieving a record

16
 The hash function that transforms different keys
into different addresses is called perfect hash
function
 The worth of hash function depends on how well it
avoids collision

17
 The maximum storage capacity that is maximum number
of records that can be accommodated is called as loading
density

18
Full table is the one in which all locations are occupied
Owing to the characteristics of hash functions, there
are always empty locations

19
 Load factor is the number of records stored in table
divided by maximum capacity of table, expressed in terms
of percentage

20
 Rehashing is with respect to closed hashing. When we try
to store the record with Key1 at bucket
 Hash(Key1) position and find that it already holds a
record, it is collision situation
 To handle collision, we use strategy to choose a sequence
of alternative locations Hash1(Key1), Hash2(Key1), …
within the bucket table so as to place the record with Key1
 This is called as rehashing

21
 Features of a Good Hash Function :
 Division Method
 Multiplication Method
 Extraction Method
 Mid-Square Hashing
 Folding Technique
 Rotation
 Universal Hashing

22
 The average performance of hashing depends on how the
hash function distributes the set of keys among the slots
 Assumption is that any given record is equally likely to
hash into any of the slots, independently of whether any
other record has been already hashed to it or not
 This assumption is called as simple uniform hashing
 A good hash function is the one which satisfies the
assumption of simple uniform hashing

23
 The average performance of hashing depends on how the
hash function distributes the set of keys among the slots
 Assumption is that any given record is equally likely to
hash into any of the slots, independently of whether any
other record has been already hashed to it or not
 This assumption is called as simple uniform hashing
 A good hash function is the one which satisfies the
assumption of simple uniform hashing

24
 Addresses generated from the key are uniformly and
randomly distributed
 Small variations in the value of key will cause large
variations in the record addresses to distribute records
(with similar keys) evenly
 The hashing function must minimize the collision

25
 One of the required features of the hash function is that
the resultant index must be within the table index range
 One simple choice for a hash function is to use the
modulus division indicated as MOD (the operator % in
C/C++)
 The function returns an integer
 If any parameter is NULL, the result is NULL
 Hash(Key) = Key % M

26
 The multiplication method works as:
 1. Multiply the key ‘Key’ by a constant A in the range 0
< A < 1 and extract the fractional part of Key ´ A
 2. Then multiply this value by M and take the floor of
the result
 Hash(Key) = M (Key XA MOD 1),
 where Key ´ A MOD 1 means the fractional part of
Key ´ A,
 that is,
 Key X A − Key X A and A = (sqrt (5) − 1/2 =
0.6180339887)

27
 When a portion of the key is used for the address
calculation, the technique is called as the extraction
method
 In digit extraction, few digits are selected and extracted
from the key which are used as the address

28
Key Hashed Address
345678 357
234137 243
952671 927

29
 The mid-square hashing suggests to take square of the key
and extract the middle digits of the squared key as address
 The difficulty is when the key is large. As the entire key
participates in the address calculation, if the key is large,
then it is very difficult to store the square of it as the
square of key should not exceed the storage limit
 So mid-square is used when the key size is less than or
equal to 4 digits

30
Key Square Hashed
Address
2341 5480281 802
1671 2792241 922
The difficulty of storing larger numbers square can be
overcome if for squaring we use few of digits of key instead of the
whole key

31
We can select a portion of key if key is larger in size and then
square the portion of it
Keys and addresses using extracting few digits, squaring
them, and again extracting mid
Key Square Hashed
Address
234137 234 x 234 = 027889 788
567187 567 x 567 = 321489 148

32
 In folding technique, the key is subdivided into subparts
that are combined or folded and then combined to form
the address
 For the key with digits, we can subdivide the digits in three
parts, add them up, and use the result as an address.
 Here the size of subparts of key could be as that of the
address

33
 There are two types of folding methods:
 Fold shift — Key value is divided into several parts of that
of the size of the address. Left, right, and middle parts are
added
 Fold boundary — Key value is divided into parts of that of
the size of the address
 Left and right parts are folded on fixed boundary
between them and the centre part

34
 For example, if the key is 987654321, it is understood as
Left 987 Centre 654 Right 321
 For fold shift, addition is
 987 + 654 + 321 = 1962
 Now discard digit 1 and the address is 962
 For fold boundary, addition of reverse part is
 789 + 456 + 123 = 1368
 Discard digit 1 and the address is 368

35
 When keys are serial, they vary in only last digit and this
leads to the creation of synonyms
 Rotating key would minimize this problem. This method
is used along with other methods
 Here, the key is rotated right by one digit and then use of
folding would avoid synonym
 For example,
 let the key be 120605, when it is rotated we get 512060
 Then further the address is calculated using any other
hash function

36
 The main idea behind universal hashing is to select the
hash function at random at run time from a carefully
designed set of functions
 Because of randomization, the algorithm can behave
differently on each execution; even for the same input
 This approach guarantees good average case performance,
no matter what keys are provided as input

37
 No hash function is perfect.
 If Hash(Key1) = Hash(Key2), then Key1 and Key2 are
synonyms and if bucket size is 1, we say that collision has
occurred
 As a consequence, we have to store the record Key2 at some
other location
 A search is made for a bucket in which a record is stored
containing Key2, using one of the several collision
resolution strategies

38
 Open addressing
 Linear probing
 Quadratic probing
 Double hashing, and
 Key offset
 Separate chaining (or linked list)
 Bucket hashing (defers collision but does not prevent it)

39
 In open addressing, when collision occurs, it is resolved by
finding an available empty location other than the home
address
 If Hash(Key) is not empty, the positions are probed in the
following sequence until an empty location is found
 When we reach the end of table, the search is wrapped
around to start and the search continues till the current
collide location
 N(Hash (Key) + C(1)), N(Hash (Key) + C(2)), …………,N(Hash
(Key) + C(i)), ….
 The most important factors to be taken care of to avoid
collision are the table size and choice of hash function

40
 A hash table in which a collision is resolved by putting the
item in the next empty place in following the occupied
place is called linear probing
 This strategy looks for the next free location until it is
found
 The function that we can use for probing linearly from the
next location is as follows:
 (Hash(x) + p(i)) MOD Max
 As p(i) = i for linear probing, the function becomes
 (Hash(x)+ i) MOD Max
 Initially i = 1, if the location is not empty then it
becomes 2, 3, 4, …, and so on till empty location is found.

41
 In open addressing, when collision occurs, it is resolved by
finding an available empty location other than the home
address
 If Hash(Key) is not empty, the positions are probed in the
following sequence until an empty location is found
 When we reach the end of table, the search is wrapped
around to start and the search continues till the current
collide location
 N(Hash (Key) + C(1)), N(Hash (Key) + C(2)), …………,N(Hash
(Key) + C(i)), ….
 The most important factors to be taken care of to avoid
collision are the table size and choice of hash function

42
 A hash table in which a collision is resolved by putting the
item in the next empty place in following the occupied
place is called linear probing
 This strategy looks for the next free location until it is
found
 The function that we can use for probing linearly from the
next location is as follows:
 (Hash(x) + p(i)) MOD Max
 As p(i) = i for linear probing, the function becomes
 (Hash(x)+ i) MOD Max
 Initially i = 1, if the location is not empty then it
becomes 2, 3, 4, …, and so on till empty location is found.

43
 With replacement
 Without replacement

44
 With replacement :
 If the slot is already occupied by the key there are two
possibilities, that is, either it is home address (collision)
or not key’s home address
 If the key’s actual address is different, then the new key
having the address at that slot is placed at that position
and the key with other address is placed in the next empty
position

45
 Without replacement :
 When some data is to be stored in hash table, and if the
slot is already occupied by the key then another empty
location is searched for a new record
 There are two possibilities when location is occupied—it is
its home address or not key’s home address.
 In both the cases, the without replacement strategy empty
position is searched for the key that is to be stored

46
 In quadratic probing, we add offset by amount square of
collision probe number
 In quadratic probing, the empty location is searched using the
following formula
 (Hash(Key) + i2) MOD Max where i lies between 1 to (Max − 1)/2
 Quadratic probing works much better than linear probing, but
to make full use of hash table, there are constraints on the
values of i and Max so that the address lies in table boundaries

47
 Double hashing uses two hash functions, one for accessing
the home address of a Key and the other for resolving the
conflict. The sequence for probing is generated in the
following sequence:
 (Hash1(Key), (Hash1(Key) + i ´ Hash2(Key)), …. i = 1, 2, 3,4,
…
 The resultant address is divided by modulo Max

48
Example :
Given the input {4371, 1323, 6173, 4199, 4344, 9699, 1889} and
hash function as Key % 10, show the results for the following:
1. Open addressing using linear probing
2. Open addressing using quadratic probing
3. Open addressing using double hashing h2 (x) = 7−(x MOD 7)

49
Initial
ly
Insert
4371
Insert
1323
Insert
6173
Insert
4199
Insert
4344
Insert
9699
Insert
1889
0 9699 9699
1 4371 4371 4371 4371 4371 4371 4371
2 1889
3 1323 1323 1323 1323 1323 1323
4 6173 6173 6173 6173 6173
5 4344 4344 4344
6
7
8
9 4199 4199 4199 4199

50
 Let us insert these keys using quadratic probing
 For 6173, the hashed address 6173 % 10 gives 3 and it is not empty,
hence using quadratic probing we get the address as follows:
 Hash(6173) = (6173 + 12) % 10 = 4 and as it is empty, the
key 6173 is stored there
 Now while inserting 4344, the location 4 is not empty and hence
quadratic probing generates the address as (4344 + 12) % 10 = 5
and as is empty 4344 is stored
 For key 9699, the address is (9699 + 12) % 10 = 0 and is empty so
store. While inserting 1889, the address (1889 + 12) % 10 = 0 is not
empty so probe again
 The address (1889 + 22) % 10 = 3 is not empty so probe again.
 The address(1889 + 32) % 10 = 8 is empty so store 1889 at location
8

51
 While inserting 6173, the address is Hash1(6173) = 6173 % 10
= 3 and 3 is not empty
 Let us use double hashing. Hence the address is as
follows:
 Hash(6173) = [Hash1(6173) + Hash2(6173)] % 1=
3 + (R − 6173 % R) ( let R be 7)= 3+ (7 − 6) = 4
 Since 4 is empty, we store 6173 at location 4
 Now let us store 4344. The address 4344 % 10 = 4 and as
location 4 is not empty, we use double hashing and we get
Hash(4344) = 7
 Now for 9699 double hashing generates address 2 and as it
is empty, we store it there.
 For key 1889, double hashing generates address 0 and as it
is empty, we store 1889 at location 0

52
 If table gets full, insertion using open addressing with
quadratic probing might fail or it might
 Take too much time. To find the solution for this is to
build another table that is about twice as big
 And scan down the entire original hash table, compute the
new hash value for each record, and
 Insert them in a new table
 For example, if table is of size 7 (Table 11.13) and hash
function is key % 7 then,

53
Insert 7,15,13,74,73
0 7
1 15
2
3 73
4 74
5
6

54
 This technique used to handle synonym is chaining that
chains together all the records that hash to the same
address. Instead of relocating synonyms, a linked list of
synonyms is created whose head is home address of
synonyms
 However, we need to handle pointers to form a chain of
synonyms

 The extra memory is needed for storing pointers

55
|
|
|
|
|
|
|
0
1
2
max-1
322 262
Fig 11.2 :An Example of Chaining

56
 Chaining
 Unlimited number of synonyms can be handled in
chaining
 Additional cost to be paid is overhead of multiple linked
lists
 Sequential search through chain takes more time
 Rehashing
 Limited but still a good number of synonyms are taken
care of
 The table size is doubled but no additional field of link is
to be maintained
 Searching is faster when compared to chaining

57
 An overflow is said to occur when a new identifier is
mapped or hashed into a full bucket
 When the bucket size is one, collision and overflow occur
simultaneously

58
 When a new identifier is hashed into a full bucket, we
need to find another bucket for this identifier
 The simplest solution is to find the closest unfilled
bucket.
 This is called as linear probing or linear open addressing

59
 Since the sizes of these lists are not known in advance, the
best way to maintain them is as linked chains
 In each slot, additional space is required for a link
 Each chain has a head node.
 The head node, however, usually is much smaller than the
other nodes, since it has to retain only a link
 As the list is accessed at random, the head nodes should
be sequential

60
A A2 A1 D A3 A4 GA G ZA E L …. Z
0 1 2 3 4 5 6 7 8 9 10 11 25
Fig 11.3 :Chaining
0
0
0
0
0
0
0
.
.
0
1
2
3
4
5
6
7
8
9
10
11
25
A4 0 A3 0A2A1
D 0
E 0
G GA 0
L 0
ZA Z 0

61
 If linear probing or separate chaining is used for collision
handling, then in case of collision, several blocks are
required to be examined to search a key and when table is
full, then expensive rehash should be used
 For fast searching and less disk access, extendible hashing
is used.
 It is a type of hash system, which treats a hash as a bit
string, and uses a trie for bucket lookup

62
 Many applications need a dynamic set of operations that supports only
Insert, Member (Search), and Delete. A keyed table is an effective data
structure for implementing them.
 Hashing is an excellent technique for implementing keyed tables. A hash
table is an array-based structure used to store <key, information> pairs.
 Hash tables are used to implement the insert and find in constant average
time. To store an item in a hash table, a hash function is applied to the key
of the item being stored, returning an index within the range of the hash
table.
 Hashing is a technique that is used for storing and retrieving information
associated with and that makes use of the individual characters or digits in
the key itself.
 A problem arises, however, when the hash function returns the same value
when applied to two different keys called collision. However, there are
various collision resolution techniques to overcome these problems.

63

11. Hashing - Data Structures using C++ by Varsha Patil

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to 11. Hashing - Data Structures using C++ by Varsha Patil (20)

More from widespreadpromotion (7)

Recently uploaded (20)

11. Hashing - Data Structures using C++ by Varsha Patil