The Multistage Algorithm in Data Analytics

Last Updated : 21 Jun, 2022

In this article, we are going to discuss the multistage algorithm in data analytics in detail. We will also cover the working of multistage algorithms.

The Multistage Algorithm: The Multistage Algorithm is the improved version of the PCY algorithm that uses certain consecutive hash tables to decrease farther the count of candidate pairs. The contradiction in both of the algorithms is that multistage takes more than two passes to discover the frequent pairs.

Working on the multistage algorithm :

First Pass: The first pass of multistage is identical to the first pass of PCY. After that pass, the frequent buckets are identified and encapsulated by a bitmap, again the same as in PCY. On the contrary, the second pass of multistage does not count the candidate pairs. Rather, it uses the accessible main memory for another hash table, using another hash function. After all the bitmap obtained from the first hash table takes up 1/32 of the accessible main memory whereas the second hash table has more or less as many buckets as the first.
Second Pass: At the point of the second pass of multistage, we again go through the folder of baskets. There is no want to count the items again. The multistage algorithm uses supplementary hash tables to lessen the number of candidate pairs. However, we must keep hold of the information about which items are frequent, since we need it on both the second and third passes. During the second pass, we hash unquestionable pairs of items to buckets of the second hash table. In this second pass, you will see a pair is hashed only if it is counted in the second pass of PCY experience the two quality, And It will hash {i, j} if and only if both i and j happen often together, and then that pair is hashed to a frequent bucket during the first pass. As an upshot, the sum of the counts in the second hash table should be remarkably less than the sum for the first pass. The outcome is that, even though the second hash table has only 31/32 of the number of buckets that the first table has, we anticipate there to be many fewer frequent buckets in the second hash table than in the first.
Final Pass: After the second pass, the second hash table is also encapsulated as a bitmap, and that bitmap is stored in the main memory. The two bitmaps together take up slightly less than 1/16th of the accessible main memory, so there is still a lot of space to count the candidate pairs on the third pass. A pair {i, j} is in C2 if and only if -
1. Both i and j both occur in the list of frequent items.
2. Pair {i, j} is hashed and transferred to a frequent bucket of the first hash table created.
3. Pair {i, j} is hashed and transferred to a frequent bucket of the second hash table created.
The third constraint is the divergence between multistage and PCY: It might be crystal clear that it is possible to enclose any number of passes between the first and last in the multistage algorithm. There is a restricting factor that each pass must reserve the bitmaps from each of the preceding passes. In due course, there is not enough space left in the main memory to do the counts. It doesn't affect how many passes we apply, the candidly frequent pairs will every time hash a frequent bucket, so there is no way to circumvent counting them.

Life Cycle Phases of Data Analytics

goelaparna1520

Improve

Article Tags :

The Multistage Algorithm in Data Analytics

Similar Reads

Thank You!

What kind of Experience do you want to share?