Feb 9. Huffman Encoding

Prefix Free Encoding

Suppose we wish to encode a set of letters $\mathcal{A}=\{a_1, \cdots, a_N\}$ in binary such that  we can unambiguously decode a binary stream into  $\mathcal{A}$. One possibility is to use a prefix free encoding. An encoding is prefix free if the code word for any $a_i$ is not the prefix of the code word for any other $a_j$.

Huffman Encoding

Suppose we are given the set $\mathcal{A}$ together with their frequencies $\{f_1, \cdots, f_N\}$. We wish to find a prefix free code such that total length of the code when used with these frequencies is minimized . Such a code is called a Huffman encoding of the set $\mathcal{A}$.

Suppose we assign a code word $c_i$ to every $s_i$ and let $b_i = |c_i|$. Then we wish to minimize $\min \sum_{i=1}^N f_i b_i$.

Observation: A prefix code maps to a full binary tree. One can see this by considering the tree where the left child of a node corresponds to a $0$ and the right child corresponds to a $1$. The $a_i$  corresponds to a leaf in this tree and the code-word $c_i$ for $a_i$ corresponds to the labels on the unique path from the root to $s_i$, while $b_i$ is the depth of $s_i$ in this tree. We will call such a tree $T$ a prefix tree. Note that the cost $c(T)$ of $T$ is given by $c(T) = \sum_{i=1}^N f_i*depth_T(a_i)$.

Greedy Algorithm for Finding a Huffman Encoding

Consider the following greedy algorithm for computing a Huffman encoding:

• Sort the letters according to their frequency so that $f_1 \ge f_2 \ge \cdots \ge f_N$.
• Pick $a_{N-1}$ and $a_N$ to have a common parent $a_{N-1}a_N$ with frequency $f_{N-1} + f_N$.
• Iterate till only one element is left

Lemma: There exists an optimal tree that combines $a_{N-1}$ and $a_N$ and they have maximum depth in this tree.

Proof: Let $T$ be an optimal prefix tree. Since $T$ is a full tree it has at least two leaves of maximum depth.  Let $a_i$ and $a_j$ be two leaves. If $a_N$ is not one of them then swap $a_i$ with $a_N$ (if $a_i \neq a_N$). Similarly swap $a_j$ with $a_{N-1}$ if $a_{N-1}$ is not in $\{a_i, a_j\}$. Note that the cost of this tree $T'$ can only be lower than $T$ since $a_{N-1}, a_N$ are the least frequent items.

Lemma: Let $T$ be an optimal tree for $\mathcal{A} = \{a_1,\cdots, a_N\}$with frequencies $\{f_1,\cdots, f_N\}$ and let $T'$ be an optimal tree for $\mathcal{A}' = \{a_1, \cdots, a_{N-1}a_N\}$  with frequencies $\{f_1,\cdots, f_{N-2}, f_{N-1} + f_N\}$.  Then $T'$ is a subtree of $T$.

Proof: Let $T'$ be the tree where we delete the leaves $a_{N-1}$ and $a_N$ and assign $a_{N-1}a_N$ to the parent of $a_{N-1}$ and $a_N$ in $T$. Note that $c(T') = c(T) - f_{N-1} - f_N$.  We claim that $T'$ must be an optimal tree for $\mathcal{A}'$ . Suppose not, and let $\tilde{T}$ be an optimal tree for $\mathcal{A}'$. Then add two leaves $a_{N-1}$ and $a_N$ to the node corresponding to $a_{N-1}a_N$ in $\tilde{T}$ to get the tree $\tilde{T}'$. Note that $c(\tilde{T}') = c(\tilde{T}) + f_{N-1} + f_N < c(T') + f_{N-1} +f_N = c(T)$. This contradicts the optimality of $T$ for $\mathcal{A}$.