Huffman Coding

Introduction:

Huffman Coding is a lossless data compression algorithm that uses variable-length codes to represent symbols’ frequency in the input data. The algorithm was first introduced by David A. Huffman while he was a Sc.D. student at MIT in 1950. The algorithm is widely used in data compression techniques and is a fundamental concept in the field of information theory.

Implementation

(* Huffman Coding *)

type 'a huffman_tree =
  | Leaf of 'a
  | Node of 'a huffman_tree * 'a huffman_tree
 
module HSet = Set.Make
  (struct
     type t = int * char huffman_tree (* pair of frequency and the tree *)
     let compare = compare
       (* We can use the built-in compare function to order this: it will order
          first by the first element (frequency) and then by the second (the tree),
          the latter of which we don't care about but which helps prevent elements
          from being equal, since Set does not allow duplicate elements *)
   end);;

let build_tree charFreqs =
  let leaves = HSet.of_list (List.map (fun (c,f) -> (f, Leaf c)) charFreqs) in
  let rec aux trees =
    let f1, a = HSet.min_elt trees in
    let trees' = HSet.remove (f1,a) trees in
    if HSet.is_empty trees' then
      a
    else
      let f2, b = HSet.min_elt trees' in
      let trees'' = HSet.remove (f2,b) trees' in
      let trees''' = HSet.add (f1 + f2, Node (a, b)) trees'' in
      aux trees'''
  in
  aux leaves

let rec print_tree code = function
  | Leaf c ->
      Printf.printf "%c	%s
" c (String.concat "" (List.rev code));
  | Node (l, r) ->
      print_tree ("0"::code) l;
      print_tree ("1"::code) r
 
let () =
  let str = "this is an example for huffman encoding" in
  let charFreqs = Hashtbl.create 42 in
  String.iter (fun c ->
      let old =
        try Hashtbl.find charFreqs c
        with Not_found -> 0 in
      Hashtbl.replace charFreqs c (old+1)
    ) str;

  let charFreqs = Hashtbl.fold (fun c f acc -> (c,f)::acc) charFreqs [] in
  let tree = build_tree charFreqs in
  print_string "Symbol	Huffman code
";
  print_tree [] tree

The implementation is written in OCaml programming language. The algorithm takes a string as its input, then it calculates the frequency of each character in the input string, and then it constructs a binary tree based on the calculated frequencies. This binary tree is used to represent each symbol in the input data with a binary code. The characters with higher frequency are represented by smaller-bit binary codes, and the characters with lower frequency are represented by longer bit codes.

Step-by-step Explanation:

The input string is provided to the algorithm.
The algorithm constructs a hash table to store the frequency of each input character.
The hash table is converted into a list of tuples, where each tuple contains a character and its corresponding frequency.
The list of tuples is converted into a set of leaves, where each leaf represents a symbol (character) and its frequency.
The algorithm constructs a binary tree by combining the two smallest trees in the set of leaves. This step is repeated until only one tree remains in the set.
The code for each character is generated by traversing the binary tree. The left child represents a ‘0,’ and the right child represents a ‘1.’ When arriving at a leaf, we convert the path traversed to reach it as the code for that character.

Complexity Analysis:

The time complexity of constructing the frequency hash table is O(N), where N is the length of the input string. The time complexity of constructing a list of tuples is also O(N), where N is the length of the list of tuples. The creation of a set of leaves has a time complexity of O(N log N), where N is the number of leaves (one for each character). The creation of the binary tree has a time complexity of O(N log N), where N is the number of leaves. The time complexity of generating each character’s code is O(log N), where N is the number of leaves in the binary tree.

In summary, the overall time complexity of the Huffman Coding algorithm is O(N log N), where N is the number of unique characters in the input string. This algorithm is relatively efficient and can compress text files by 20% to 90%, depending on the file size and character distribution.