Shannon entropy is a concept in information theory that measures the uncertainty or randomness associated with a set of data. It was introduced by Claude Shannon, a mathematician and electrical engineer, in his 1948 paper “A Mathematical Theory of Communication”.
Shannon entropy is widely used in cryptography, data compression, and other fields where information needs to be encoded and transmitted efficiently and securely.
(* Shannon Entropy *)
module CharMap = Map.Make(Char)
let entropy s =
let count map c =
CharMap.update c (function Some n -> Some (n +. 1.) | None -> Some 1.) map
and calc _ n sum =
sum +. n *. Float.log2 n
let sum = CharMap.fold calc (String.fold_left count CharMap.empty s) 0.
and len = float (String.length s) in
Float.log2 len -. sum /. len
let () =
entropy "1223334444" |> string_of_float |> print_endline
The implementation of the Shannon entropy algorithm is written in OCaml. It takes a string as input and returns the entropy of the string as a floating-point number.
- Define a module CharMap as a map from characters to floating-point numbers using the OCaml built-in Map library.
- Define the
entropyfunction which takes a string
sas an argument.
- Define the
countfunction which takes a CharMap
mapand a character
cas arguments. It counts the number of occurrences of character
cin the string and updates the map accordingly. If
cis not in the map, it is added with a count of 1.
- Define the
calcfunction which takes a character and its frequency in the string as arguments. It returns the product of frequency and the base-2 logarithm of frequency.
- In the
countover the input string
sstarting with an empty CharMap to get a mapping of each character to its frequency.
calcover the CharMap obtained in step 5 to calculate the Shannon entropy for the input string.
- Calculate the length of the input string as a float using
String.lengthand perform the final calculation to get the entropy as a floating-point number.
- The entropy value is returned from the
The Shannon entropy algorithm calculates the frequency of each character in a given string and then applies the Shannon entropy formula using that frequency distribution.
n be the length of the input string. Then:
String.fold_leftcall in step 5 takes
O(n)time to traverse the input string and update the character frequency map.
CharMap.foldcall in step 6 takes
O(k log k)time where
kis the number of distinct characters in the input string. In the worst case,
kcan be as large as the total number of Unicode characters which is around 143,000.
- The final calculation in step 7 takes constant time.
Therefore, the overall time complexity of the algorithm is
O(n + k log k). In practice, the
k log k term dominates since it is much larger than
n for most input strings.