SlideShare a Scribd company logo
Huffman Coding
The idea behind Huffman coding is to find a way to compress
the storage of data using variable length codes. Our standard
model of storing data uses fixed length codes. For example,
each character in a text file is stored using 8 bits. There are
certain advantages to this system. When reading a file, we
know to ALWAYS read 8 bits at a time to read a single
character. But as you might imagine, this coding scheme is
inefficient. The reason for this is that some characters are
more frequently used than other characters. Let's say that the
character 'e' is used 10 times more frequently than the
character 'q'. It would then be advantageous for us to use a 7
bit code for e and a 9 bit code for q instead because that could
shorten our overall message length.
Huffman coding finds the optimal way to take advantage of
varying character frequencies in a particular file. On average,
using Huffman coding on standard files can shrink them
anywhere from 10% to 30% depending to the character
distribution. (The more skewed the distribution, the better
Huffman coding will do.)
The idea behind the coding is to give less frequent characters
and groups of characters longer codes. Also, the coding is
constructed in such a way that no two constructed codes are
prefixes of each other. This property about the code is crucial
with respect to easily deciphering the code.
Building a Huffman Tree
The easiest way to see how this algorithm works is to work
through an example. Let's assume that after scanning a file we
find the following character frequencies:
Character Frequency
'a' 12
'b' 2
'c' 7
'd' 13
'e' 14
'f' 85
Now, create a binary tree for each character that also stores
the frequency with which it occurs.
The algorithm is as follows: Find the two binary trees in the list
that store minimum frequencies at their nodes. Connect these
two nodes at a newly created common node that will store NO
character but will store the sum of the frequencies of all the
nodes connected below it. So our picture looks like follows:
9 12 'a' 13 'd' 14 'e' 85 'f'
/ 
2 'b' 7 'c'
Now, repeat this process until only one tree is left:
21
/ 
9 12 'a' 13 'd' 14 'e' 85 'f'
/ 
2 'b' 7 'c'
21 27
/  / 
9 12 'a' 13 'd' 14 'e' 85 'f'
/ 
2 'b' 7 'c'
48
/ 
21 27
/  / 
9 12 'a' 13 'd' 14 'e' 85 'f'
/ 
2 'b' 7 'c'
133
/ 
48 85 'f'
/ 
21 27
/  / 
9 12 'a' 13 'd' 14 'e'
/ 
2 'b' 7 'c'
Once the tree is built, each leaf node corresponds to a letter
with a code. To determine the code for a particular node, walk
a standard search path from the root to the leaf node in
question. For each step to the left, append a 0 to the code and
for each step right append a 1. Thus for the tree above we get
the following codes:
Letter Code
'a' 001
'b' 0000
'c' 0001
'd' 010
'e' 011
'f' 1
Why are we guaranteed that one code is NOT the prefix of
another?
Find a set of valid Huffman codes for a file with the given
character frequencies:
Character Frequency
'a' 15
'b' 7
'c' 5
'd' 23
'e' 17
'f' 19
Calculating Bits Saved
All we need to do for this calculation is figure out how many
bits are originally used to store the data and subtract from that
how many bits are used to store the data using the Huffman
code.
In the first example given, since we have six characters, let's
assume each is stored with a three bit code. Since there are 133
such characters, the total number of bits used is 3*133 = 399.
Now, using the Huffman coding frequencies we can calculate
the new total number of bits used:
Letter Code Frequency Total Bits
'a' 001 12 36
'b' 0000 2 8
'c' 0001 7 28
'd' 010 13 39
'e' 011 14 42
'f' 1 85 85
Total 238
Thus, we saved 399 - 238 = 161 bits, or nearly 40% storage
space. Of course there is a small detail we haven't taken into
account here. What is that?
Huffman Coding is an Optimal Prefix Code
Of all prefix codes for a file, Huffman coding produces an
optimal one. In all of our examples from class on Monday, we
found that Huffman coding saved us a fair percentage of
storage space. But, we can show that no other prefix code can
do better than Huffman coding.
First, we will show the following:
Let x and y be two characters with the least frequencies in a
file. Then there exists an optimal prefix code for C in which the
codewords for x and y have the same length and differ only in
the last bit.
Here is how we will prove this:
Assume that a tree T stores an optimal prefix code. Let and
characters a and b be sibling nodes stored at the maximum
depth of the tree. We will show that we can create T' with x
and y as siblings at the lowest depth of the tree such that the
number of bits used for the coding with T' is the same as with
T. (Let f(a) denote the frequency of the character a. Without
loss of generality, assume f(x) ≤ f(y) and f(a) ≤ f(b). It also
follows that f(x) ≤ f(a) and f(y) ≤ f(b). Let h be the height of the
tree T. Let x have a depth of dx in T and y have a depth of dx in
T.)
Create T' as follows: swap the nodes storing a and x, and then
swap the nodes storing b and y. Now, we have that the depth of
x and y in T' is h, the depth of a is dx and the depth of b is dy in
T'.
Now, let's calculate the change in the number of bits used for
the coding with tree T' with the coding in tree T. (Note: Since
all other codes remain unchanged, we only need to analyze the
total number of bits it takes to code a, b, x and y.)
# bits for tree T (for a,b,x and y) = hf(a) + hf(b) + dxf(x) dyf(y)
# bits for tree T' (for a, b, x, and y) = dxf(a) + dyf(b) + hf(x) +
hf(y).
Difference =
hf(a) + hf(b) + dxf(x) dyf(y) - (dxf(a) + dyf(b) + hf(x) + hf(y)) =
hf(a) + hf(b) + dxf(x) dyf(y) - dxf(a) - dyf*b) - hf(x) - hf(y) =
h(f(a) - f(x)) + h(f(b)-f(y)) + dx(f(x) - f(a)) + dy(f(y) - f(b)) =
h(f(a) - f(x)) + h(f(b)-f(y)) - dx(f(a) - f(x)) - dy(f(b) - f(y)) =
(h - dx)(f(a) - f(x)) + (h - dy)(f(b) - f(y))
Notice that all four of the terms above must be non-negative
since we know that h ≥ dx, h ≥ dy, f(a) ≥ f(x), and f(b) ≥ f(y).
Thus, it follows that this difference must be 0. Thus, the
number of bits to used in a code where x and y (the two
characters with lowest frequency) are siblings at maximum
depth of the coding tree is optimal.
In layman's terms, give me what you think is an optimal coding
tree, and I can create a new one from it with the two nodes
corresponding to low frequencies at the bottom of the tree.
To complete the proof, you'll notice that by construction,
Huffman coding ALWAYS makes sure that the nodes with the
lowest frequencies are at the bottom of the coding tree, all the
way through the construction. (You can't find any pair of
nodes for which this isn't true.) Technically, to carry out the
proof, you'd use induction, but we'll skip that for now...

More Related Content

What's hot (17)

PPTX
Huffman coding || Huffman Tree
SatishKumarInumarthi
 
PPTX
Huffman Coding
Muhammad Saqib Rehan
 
PPTX
Huffman coding || Huffman Tree
Gurunadh Guru
 
PDF
Huffman and Arithmetic coding - Performance analysis
Ramakant Soni
 
PPTX
Huffman Coding
Ehtisham Ali
 
PPTX
Huffman's Alforithm
Roohaali
 
PPTX
Shannon Fano
anithabalaprabhu
 
PPTX
Multimedia lossless compression algorithms
Mazin Alwaaly
 
PPTX
Huffman Codes
Md. Shafiuzzaman Hira
 
PPT
Huffman coding
George Ang
 
PPTX
Huffman coding
Nazmul Hyder
 
PPTX
Text compression
Sammer Qader
 
PPT
Lossless
anithabalaprabhu
 
PPTX
Data Compression - Text Compression - Run Length Encoding
MANISH T I
 
PPTX
Huffman coding
Nihal Gupta
 
PDF
Arithmetic Coding
anithabalaprabhu
 
PDF
Data Communication & Computer network: Shanon fano coding
Dr Rajiv Srivastava
 
Huffman coding || Huffman Tree
SatishKumarInumarthi
 
Huffman Coding
Muhammad Saqib Rehan
 
Huffman coding || Huffman Tree
Gurunadh Guru
 
Huffman and Arithmetic coding - Performance analysis
Ramakant Soni
 
Huffman Coding
Ehtisham Ali
 
Huffman's Alforithm
Roohaali
 
Shannon Fano
anithabalaprabhu
 
Multimedia lossless compression algorithms
Mazin Alwaaly
 
Huffman Codes
Md. Shafiuzzaman Hira
 
Huffman coding
George Ang
 
Huffman coding
Nazmul Hyder
 
Text compression
Sammer Qader
 
Data Compression - Text Compression - Run Length Encoding
MANISH T I
 
Huffman coding
Nihal Gupta
 
Arithmetic Coding
anithabalaprabhu
 
Data Communication & Computer network: Shanon fano coding
Dr Rajiv Srivastava
 

Similar to Huffman coding01 (20)

PPT
Greedy Algorithms Huffman Coding.ppt
Ruchika Sinha
 
PPTX
Farhana shaikh webinar_huffman coding
Farhana Shaikh
 
PDF
Implementation of Lossless Compression Algorithms for Text Data
BRNSSPublicationHubI
 
DOCX
Huffman Coding is a technique of compressing data
Kumari99
 
PPTX
Huffman analysis
Abubakar Sultan
 
PPTX
5c. huffman coding using greedy technique.pptx
Suma Raj
 
PPTX
Lecture-7-CS345A-2023 of Design and Analysis
ssuser9183b6
 
PPT
Huffman code presentation and their operation
HODCSE170941
 
PDF
Unit 2 Lecture notes on Huffman coding
Dr Piyush Charan
 
PDF
Huffman Encoding Algorithm - Concepts and Example
MaryJacob24
 
PPTX
Data structures' project
Behappy Seehappy
 
PDF
Data communication & computer networking: Huffman algorithm
Dr Rajiv Srivastava
 
PPT
Hufman coding basic
radthees
 
PDF
Huff
Reetika Singh
 
PPTX
Huffman.pptx
HarisMasood20
 
PDF
Huffman
keerthi vasan
 
PDF
Huffman
keerthi vasan
 
PPT
Huffman Tree And Its Application
Papu Kumar
 
PPT
Lossless
Vishal Suri
 
PPT
16_Greedy_Algorithms.ppt
DrAliKMattar
 
Greedy Algorithms Huffman Coding.ppt
Ruchika Sinha
 
Farhana shaikh webinar_huffman coding
Farhana Shaikh
 
Implementation of Lossless Compression Algorithms for Text Data
BRNSSPublicationHubI
 
Huffman Coding is a technique of compressing data
Kumari99
 
Huffman analysis
Abubakar Sultan
 
5c. huffman coding using greedy technique.pptx
Suma Raj
 
Lecture-7-CS345A-2023 of Design and Analysis
ssuser9183b6
 
Huffman code presentation and their operation
HODCSE170941
 
Unit 2 Lecture notes on Huffman coding
Dr Piyush Charan
 
Huffman Encoding Algorithm - Concepts and Example
MaryJacob24
 
Data structures' project
Behappy Seehappy
 
Data communication & computer networking: Huffman algorithm
Dr Rajiv Srivastava
 
Hufman coding basic
radthees
 
Huffman.pptx
HarisMasood20
 
Huffman
keerthi vasan
 
Huffman
keerthi vasan
 
Huffman Tree And Its Application
Papu Kumar
 
Lossless
Vishal Suri
 
16_Greedy_Algorithms.ppt
DrAliKMattar
 
Ad

More from Nv Thejaswini (12)

DOC
Mea notes
Nv Thejaswini
 
PPT
Appendix b 2
Nv Thejaswini
 
PPT
Chapter9 4
Nv Thejaswini
 
PPT
Lecture11
Nv Thejaswini
 
DOC
Branch and bound
Nv Thejaswini
 
DOC
Unit 5 jwfiles
Nv Thejaswini
 
DOC
Unit 4 jwfiles
Nv Thejaswini
 
DOC
Unit 3 daa
Nv Thejaswini
 
DOC
Unit 2 in daa
Nv Thejaswini
 
PPT
Ch8 of OS
Nv Thejaswini
 
PPTX
Presentation solar
Nv Thejaswini
 
Mea notes
Nv Thejaswini
 
Appendix b 2
Nv Thejaswini
 
Chapter9 4
Nv Thejaswini
 
Lecture11
Nv Thejaswini
 
Branch and bound
Nv Thejaswini
 
Unit 5 jwfiles
Nv Thejaswini
 
Unit 4 jwfiles
Nv Thejaswini
 
Unit 3 daa
Nv Thejaswini
 
Unit 2 in daa
Nv Thejaswini
 
Ch8 of OS
Nv Thejaswini
 
Presentation solar
Nv Thejaswini
 
Ad

Recently uploaded (20)

PPTX
原版一样(Acadia毕业证书)加拿大阿卡迪亚大学毕业证办理方法
Taqyea
 
PPTX
美国电子版毕业证南卡罗莱纳大学上州分校水印成绩单USC学费发票定做学位证书编号怎么查
Taqyea
 
PPTX
Day2 B2 Best.pptx
helenjenefa1
 
PPTX
What is Shot Peening | Shot Peening is a Surface Treatment Process
Vibra Finish
 
PPTX
Hashing Introduction , hash functions and techniques
sailajam21
 
PDF
Set Relation Function Practice session 24.05.2025.pdf
DrStephenStrange4
 
DOCX
CS-802 (A) BDH Lab manual IPS Academy Indore
thegodhimself05
 
PPTX
fatigue in aircraft structures-221113192308-0ad6dc8c.pptx
aviatecofficial
 
PPTX
Solar Thermal Energy System Seminar.pptx
Gpc Purapuza
 
PPTX
Depth First Search Algorithm in 🧠 DFS in Artificial Intelligence (AI)
rafeeqshaik212002
 
PDF
MAD Unit - 2 Activity and Fragment Management in Android (Diploma IT)
JappanMavani
 
PPTX
VITEEE 2026 Exam Details , Important Dates
SonaliSingh127098
 
DOC
MRRS Strength and Durability of Concrete
CivilMythili
 
PPTX
Mechanical Design of shell and tube heat exchangers as per ASME Sec VIII Divi...
shahveer210504
 
PPTX
Green Building & Energy Conservation ppt
Sagar Sarangi
 
PDF
Zilliz Cloud Demo for performance and scale
Zilliz
 
PPTX
Shinkawa Proposal to meet Vibration API670.pptx
AchmadBashori2
 
PDF
AI TECHNIQUES FOR IDENTIFYING ALTERATIONS IN THE HUMAN GUT MICROBIOME IN MULT...
vidyalalltv1
 
PPTX
The Role of Information Technology in Environmental Protectio....pptx
nallamillisriram
 
PPTX
Arduino Based Gas Leakage Detector Project
CircuitDigest
 
原版一样(Acadia毕业证书)加拿大阿卡迪亚大学毕业证办理方法
Taqyea
 
美国电子版毕业证南卡罗莱纳大学上州分校水印成绩单USC学费发票定做学位证书编号怎么查
Taqyea
 
Day2 B2 Best.pptx
helenjenefa1
 
What is Shot Peening | Shot Peening is a Surface Treatment Process
Vibra Finish
 
Hashing Introduction , hash functions and techniques
sailajam21
 
Set Relation Function Practice session 24.05.2025.pdf
DrStephenStrange4
 
CS-802 (A) BDH Lab manual IPS Academy Indore
thegodhimself05
 
fatigue in aircraft structures-221113192308-0ad6dc8c.pptx
aviatecofficial
 
Solar Thermal Energy System Seminar.pptx
Gpc Purapuza
 
Depth First Search Algorithm in 🧠 DFS in Artificial Intelligence (AI)
rafeeqshaik212002
 
MAD Unit - 2 Activity and Fragment Management in Android (Diploma IT)
JappanMavani
 
VITEEE 2026 Exam Details , Important Dates
SonaliSingh127098
 
MRRS Strength and Durability of Concrete
CivilMythili
 
Mechanical Design of shell and tube heat exchangers as per ASME Sec VIII Divi...
shahveer210504
 
Green Building & Energy Conservation ppt
Sagar Sarangi
 
Zilliz Cloud Demo for performance and scale
Zilliz
 
Shinkawa Proposal to meet Vibration API670.pptx
AchmadBashori2
 
AI TECHNIQUES FOR IDENTIFYING ALTERATIONS IN THE HUMAN GUT MICROBIOME IN MULT...
vidyalalltv1
 
The Role of Information Technology in Environmental Protectio....pptx
nallamillisriram
 
Arduino Based Gas Leakage Detector Project
CircuitDigest
 

Huffman coding01

  • 1. Huffman Coding The idea behind Huffman coding is to find a way to compress the storage of data using variable length codes. Our standard model of storing data uses fixed length codes. For example, each character in a text file is stored using 8 bits. There are certain advantages to this system. When reading a file, we know to ALWAYS read 8 bits at a time to read a single character. But as you might imagine, this coding scheme is inefficient. The reason for this is that some characters are more frequently used than other characters. Let's say that the character 'e' is used 10 times more frequently than the character 'q'. It would then be advantageous for us to use a 7 bit code for e and a 9 bit code for q instead because that could shorten our overall message length. Huffman coding finds the optimal way to take advantage of varying character frequencies in a particular file. On average, using Huffman coding on standard files can shrink them anywhere from 10% to 30% depending to the character distribution. (The more skewed the distribution, the better Huffman coding will do.) The idea behind the coding is to give less frequent characters and groups of characters longer codes. Also, the coding is constructed in such a way that no two constructed codes are prefixes of each other. This property about the code is crucial with respect to easily deciphering the code.
  • 2. Building a Huffman Tree The easiest way to see how this algorithm works is to work through an example. Let's assume that after scanning a file we find the following character frequencies: Character Frequency 'a' 12 'b' 2 'c' 7 'd' 13 'e' 14 'f' 85 Now, create a binary tree for each character that also stores the frequency with which it occurs. The algorithm is as follows: Find the two binary trees in the list that store minimum frequencies at their nodes. Connect these two nodes at a newly created common node that will store NO character but will store the sum of the frequencies of all the nodes connected below it. So our picture looks like follows: 9 12 'a' 13 'd' 14 'e' 85 'f' / 2 'b' 7 'c'
  • 3. Now, repeat this process until only one tree is left: 21 / 9 12 'a' 13 'd' 14 'e' 85 'f' / 2 'b' 7 'c' 21 27 / / 9 12 'a' 13 'd' 14 'e' 85 'f' / 2 'b' 7 'c' 48 / 21 27 / / 9 12 'a' 13 'd' 14 'e' 85 'f' / 2 'b' 7 'c' 133 / 48 85 'f' / 21 27 / / 9 12 'a' 13 'd' 14 'e' / 2 'b' 7 'c' Once the tree is built, each leaf node corresponds to a letter with a code. To determine the code for a particular node, walk
  • 4. a standard search path from the root to the leaf node in question. For each step to the left, append a 0 to the code and for each step right append a 1. Thus for the tree above we get the following codes: Letter Code 'a' 001 'b' 0000 'c' 0001 'd' 010 'e' 011 'f' 1 Why are we guaranteed that one code is NOT the prefix of another? Find a set of valid Huffman codes for a file with the given character frequencies: Character Frequency 'a' 15 'b' 7 'c' 5 'd' 23 'e' 17 'f' 19
  • 5. Calculating Bits Saved All we need to do for this calculation is figure out how many bits are originally used to store the data and subtract from that how many bits are used to store the data using the Huffman code. In the first example given, since we have six characters, let's assume each is stored with a three bit code. Since there are 133 such characters, the total number of bits used is 3*133 = 399. Now, using the Huffman coding frequencies we can calculate the new total number of bits used: Letter Code Frequency Total Bits 'a' 001 12 36 'b' 0000 2 8 'c' 0001 7 28 'd' 010 13 39 'e' 011 14 42 'f' 1 85 85 Total 238 Thus, we saved 399 - 238 = 161 bits, or nearly 40% storage space. Of course there is a small detail we haven't taken into account here. What is that?
  • 6. Huffman Coding is an Optimal Prefix Code Of all prefix codes for a file, Huffman coding produces an optimal one. In all of our examples from class on Monday, we found that Huffman coding saved us a fair percentage of storage space. But, we can show that no other prefix code can do better than Huffman coding. First, we will show the following: Let x and y be two characters with the least frequencies in a file. Then there exists an optimal prefix code for C in which the codewords for x and y have the same length and differ only in the last bit. Here is how we will prove this: Assume that a tree T stores an optimal prefix code. Let and characters a and b be sibling nodes stored at the maximum depth of the tree. We will show that we can create T' with x and y as siblings at the lowest depth of the tree such that the number of bits used for the coding with T' is the same as with T. (Let f(a) denote the frequency of the character a. Without loss of generality, assume f(x) ≤ f(y) and f(a) ≤ f(b). It also follows that f(x) ≤ f(a) and f(y) ≤ f(b). Let h be the height of the tree T. Let x have a depth of dx in T and y have a depth of dx in T.) Create T' as follows: swap the nodes storing a and x, and then swap the nodes storing b and y. Now, we have that the depth of x and y in T' is h, the depth of a is dx and the depth of b is dy in T'.
  • 7. Now, let's calculate the change in the number of bits used for the coding with tree T' with the coding in tree T. (Note: Since all other codes remain unchanged, we only need to analyze the total number of bits it takes to code a, b, x and y.) # bits for tree T (for a,b,x and y) = hf(a) + hf(b) + dxf(x) dyf(y) # bits for tree T' (for a, b, x, and y) = dxf(a) + dyf(b) + hf(x) + hf(y). Difference = hf(a) + hf(b) + dxf(x) dyf(y) - (dxf(a) + dyf(b) + hf(x) + hf(y)) = hf(a) + hf(b) + dxf(x) dyf(y) - dxf(a) - dyf*b) - hf(x) - hf(y) = h(f(a) - f(x)) + h(f(b)-f(y)) + dx(f(x) - f(a)) + dy(f(y) - f(b)) = h(f(a) - f(x)) + h(f(b)-f(y)) - dx(f(a) - f(x)) - dy(f(b) - f(y)) = (h - dx)(f(a) - f(x)) + (h - dy)(f(b) - f(y)) Notice that all four of the terms above must be non-negative since we know that h ≥ dx, h ≥ dy, f(a) ≥ f(x), and f(b) ≥ f(y). Thus, it follows that this difference must be 0. Thus, the number of bits to used in a code where x and y (the two characters with lowest frequency) are siblings at maximum depth of the coding tree is optimal. In layman's terms, give me what you think is an optimal coding tree, and I can create a new one from it with the two nodes corresponding to low frequencies at the bottom of the tree. To complete the proof, you'll notice that by construction, Huffman coding ALWAYS makes sure that the nodes with the
  • 8. lowest frequencies are at the bottom of the coding tree, all the way through the construction. (You can't find any pair of nodes for which this isn't true.) Technically, to carry out the proof, you'd use induction, but we'll skip that for now...