>
In 0-error data compression, we proved that the infimum of expected length of codewords is equal to the entropy. We go further to explore entropy in therms of the asymptotic behavior of i.i.d. sequences, known as the Asymptotic Equipartition Property (AEP).
We consder a diecrete random sequence with each of the variables $X_k, k\geq 1$in a finite alphabet $\mathcal{X}$with probability distribution $p_X$. All of the $N$variables are i.i.d., i.e.,
\[\begin{align} X^n&=(X_1, X_2,\cdots, X_n)\\ p_{X^n}(x^n)&=\prod_{i=1}^{n}p_{X_i}(x_i)\\ p_{X_i}(x_i)&=p_X(x_i), \forall i=1,2,\cdots,n, \end{align}\]<a id="Equality">
sad </a>
where we note $p_X: \mathcal{X}\to [0,1]$the distribution of $X$, and (here) indicates that all the possibiility of $X_i, \forall i=[N]$, is the same as $P_{X_i}=P_X$.* Note: In the following relations, we use $P_X $to denote the distribution of any $X $(rather than $X^n $), and do not care whose distribution it is.
In the context, the logarithms are in the base 2 by default.
There are several senses that 2RVs can be considered to be equivalent.
Definition 1:
(Equivalence of RVs):
Let $X,Y $be 2 RVs. We say $X $and $Y $are:
#
Lemma 1:Let $X,Y $be RVs.
1. $M_X(t)=M_Y(t),\forall t\in \mathbb{R} $, then $X\stackrel{d}{=}Y $.
2. $X\stackrel{a.s.}{=}Y\iff d_{\infty}(X;Y)\stackrel{\Delta}{=} \text{ess }\sup_{\omega}|X(\omega)-Y(\omega)|=0 $.
#
The convergence of RV describes that a sequence of essentially random or unpredictable events (characterized by ${X_n}{n\in\mathbb{N}+} $) can be settled into patterns that are intuitively unchanging.
Definition 2:
(Convergence in Distribution):
A sequence $X_1,X_2,\cdots $of RVs is said to converge in distribution, or converge weakly to a RV $X $, denoted $X_n\stackrel{d}{\to}X $, or $X_n \Rightarrow X $iff $\lim\limits_{n\to\infty} F_{X_n}(x)=F_X(x)$holds for all $x\in\mathcal{X} $at which $F_X $is continuous.
#
The constraints on continuous points are essential. Here is a counterexample:
Example 1: Consider ${X_n}{n\in\mathbb{N}+}$that $X_n\sim U([1,1/n])$then the CDF converges to a degenerate RV $X=0 $. The contradictory is that $F_X(0)=1 $while $\forall n\in\mathbb{N}+,\quad F{X_n}(0)=0$, not convergence at the discontinuous point $x=0 $. #
weak convergence do not talk about the independence or correlated relations. Here are some lemmas related:
Lemma 2: (Scheff’s lemma): Let $f_n $a sequence of integrable function on a measure space $(X,\Sigma,\mu) $that converges a.e. to another integrable function $f $, then:
\[\begin{align} \int |f_n-f|d\mu\to 0\iff \int |f_n|d\mu \to\int |f|d\mu \end{align}\]So the a.e. pointwise convergence in PDF implies convergence in distribution. #
Lemma 3:
(Levy’s continuity theorem):
The sequence ${X_n} $converges in distribution to $X $iff the sequence of corresponding characteristic functions ${\phi_n} $converges pointwise to the characteristic function $\phi $of $X $.
#
Definition 3:
(Convergence in Probability):
A sequence $X_1,X_2,\cdots $of RVs is said to converge in probability to a RV $X $, denoted $X_n\stackrel{p}{\to}X $iff
#
Lemma 4: The following 3 interpreters of convergence in probability is equivalent:
$\forall\epsilon>0,\exists N\in\mathbb{N}_+,\forall n\geq N,\forall\delta>0, Pr{ | X_n-X | >\delta}<\epsilon$. |
$\forall\epsilon>0,\exists N\in\mathbb{N}_+,\forall n\geq N, Pr{ | X_n-X | >\epsilon}<\epsilon $. |
#
Proof: $1\iff 2 $by the definition of limitation, $2\to 3 $by letting $\delta=\epsilon $. We prove $3\to 2 $. To prove 2 holds $\forall\delta>0 $, we have the following 3 cases:
$\delta<\epsilon $. As $\delta>0 $, by 3 $\exists N\in\mathbb{N}_+,\forall n\geq N, Pr{ | X_n-X | >\delta}<\delta<\epsilon $. |
$\delta>\epsilon $. Then $\exists N\in\mathbb{N}_+,\forall n\geq N, Pr{ | X_n-X | >\delta}<Pr{ | X_n-X | >\epsilon}<\epsilon $. |
So $2\iff 3 $. #
Definition 4:
(Convergence Almost Everywhere):
A sequence $X_1,X_2,\cdots $of RVs is said to converge almost everywhere, or converges with probability 1 to a RV $X $, denoted $X_n\stackrel{a.s.}{\to}X $iff
or equivalently,
\[\begin{align} P\left(\{\omega\in\Omega:\lim\limits_{n\to\infty}X_n(\omega)=X(\omega)\}\right)=1\\ \forall \epsilon>0, P\left( \limsup_{n\to\infty}\{\omega\in\Omega:|X_n(\omega)-X(\omega)|>\epsilon \} \right)=0 \end{align}\]#
Definition 5:
(Convergence Everywhere):
A sequence $X_1,X_2,\cdots $of RVs is said to converge everywhere, or converges pointwosely to a RV $X $, denoted $X_n\to X $iff
or equivalently,
\[\begin{align} \{\omega\in\Omega:\lim\limits_{n\to\infty}X_n(\omega)=X(\omega)\}=\Omega \end{align}\]#
Definition 6:
(Convergence in Mean):
Let $r\geq 1 $, a sequence $X_1,X_2,\cdots $of RVs is said to converge in the $r $-th mean, or in the $L^r $-norm to a RV $X $, denoted $X_n\stackrel{L^r}{\to} X $iff the $r $-th absolute moments $E[|X_n|^r] $, and $E[|X|^r] $exists, and
#
Lemma 5:
(implimentations of RV convergences):
The relations between the above 4 kinds of convergences are shown as follows:
#
We review the definition of LLN, performance of large times of experiments that the average tends to the expected value (when it exists).* Note: The average of trials may not converge in some cases because of heavy tails, such as Cauchy distribution (where it has no expectation) and Pareto distribution when $\alpha<1 $(where it has infinite expectation). We mainly talk about i.i.d. Lebesgue integrable RV sequence $ {X_k}_{k\in\mathbb{N}} $generated by $X $, where $E[X]=\mu $, $Var(X)=\sigma^2 $
Theorem 1: (Khinchin’s weak LLN): The sample average converges in probability towards the expected value
\[\begin{align} \bar{X}_n\stackrel{p}{\to}\mu,\quad n\to\infty \end{align}\]#
Actually, finite variance is not necessary though it may simplify the related proof. Large or infinite variance may make the convergence slower but LLN also holds. The convergence of $\bar{X} $to a degenerate RV requires $Var(\bar{X})=\sigma^2/n $to converge to 0, which may be true even if $\sigma^2\to\infty $. In Chebyshev’s weak LLN, we release the restriction of i.i.d. sequence and finite variance:
Theorem 2: (Chebyshev’s weak LLN): For independent Lebesgue integrable RV sequence ${X_k}{k\in\mathbb{N}} $with equal expected value $\mu $, if $\lim{n\to\infty}Var(\bar{X}_n)=0 $, then:
\[\begin{align} \bar{X}_n\stackrel{p}{\to}\mu,\quad n\to\infty \end{align}\]# Reminds: weak LLN may hold when the expectation do not exist.
Example 2: Let ${X_k}_{k\in\mathbb{N}} $be an independent zero-mean Gaussian RV sequence with $Var(X_k)=\frac{2n}{\log (n+1)} $, which is not bounded.
\[\begin{align} Var(\bar{X})=Var(\frac{1}{n}\sum_{i=1}^nX_i)=\frac{1}{n^2}Var(\sum_{i=1}^nX_i)=\frac{1}{n^2}\sum_{i=1}^nVar(X_i)=\frac{1}{\log n}\to 0 \end{align}\]which satisfies the weak LLN. #
Theorem 3: (Kolmogorov’s strong LLN):
#
The differences between the weak law and the strong law are in that:
The weak law states that for a specified large n , the average $\bar{X}_{n}$is likely to be near $\mu $. Thus, it leaves open the possibility that $\left | \bar{X}_{n}-\mu\right | >\epsilon$happens an infinite number of times, although at infrequent intervals. (Not necessarily $\left | \bar{X}_{n}-\mu\right | \neq 0$for all $n $). |
The strong law shows that this almost surely will not occur. In particular, it implies that with probability 1, we have that for any $\epsilon>0$the inequality $\left | \bar{X}_{n}-\mu\right | <\epsilon$holds for all large enough $n $. |
The strong law does not hold in the following cases, but the weak law does.
Example 3:
then it has no expected value, but the weak law is true.
# For sequence of i.i.d. functions of RV: ${f(X_k,\theta)}_{k\in \mathbb{N}} $with parameter $\theta\in\Theta $, by the strong LLN we know that for fixed $\theta $, the sample mean converges to $E[f(X,\theta)] $. This is known as the pointwise convergence in $\theta$. To make the convergence happens uniformly in $\theta $, we have the Uniform LLN.
Theorem 4: (Uniform LNN): For i.i.d. functions of RV: ${f(X_k,\theta)}_{k\in \mathbb{N}} $with parameter $\theta\in\Theta $where $E[f(X,\theta)] $exists for all $\theta $, if :
Then $E[f(X, \theta)]$is continuous in $\theta $, and
\[\begin{align} \sup _{\theta \in \Theta}\left\|\frac{1}{n} \sum_{i=1}^{n} f\left(X_{i}, \theta\right)-\mathrm{E}[f(X, \theta)]\right\| \stackrel{\mathrm{a.s.}}{\longrightarrow} 0 \end{align}\]# This result is useful to derive consistency of a large class of estimators known as Extremum estimator.
Moreover, an intuitive notion of probability as a long-run relative frequency leads us to the Borel LNN:
Theorem 5:(Borel LLN): If an experiment $(\Omega,\mathcal{B},\mathbf{P}) $is repeated a large number of times independently and under identical conditions, then
\[\begin{align} n\to\infty, \frac{N_n(A)}{n}\stackrel{a.s.}{\to}P(A),\quad \forall A\in\mathcal{B} \end{align}\]#
Lemma 6: (Chebyshev’s Inequality):
\[\begin{align} Pr\{|X-E[X]\geq k\sigma\}\leq \frac{1}{k^2},\quad \forall k>0, \sigma^2=Var(X). \end{align}\]#
Lemma 7: (Markov’s Inequality): Let $X $be a non-negative RV, then
\[\begin{align} Pr\{X\geq a\}\leq \frac{E[X]}{a} \end{align}\]#
An intuition to Markov’s Inequality is that $E[X]=Pr{X<a}E[X | X<a]+Pr{X\geq a}E[X | X\geq a] $. So when $E[X | X<a]\geq 0 $where $X>0 $, for any possible distribution $p_X $, the value $E[X | X\geq a]\geq a $. Thus $E[X]\geq Pr{X\geq a}E[X | X\geq a]\geq Pr{X\geq a}a $. |
Theorem 6: (Weak AEP 1):
\[\begin{align} n\to\infty,\quad -\frac{1}{n}p(X^n)\stackrel{p}{\to} H(X) \end{align}\]or equivalently, let $\mathcal{W}_{[X]{\epsilon}}^n\stackrel{\Delta}{=}{x^n\in \mathcal{X}^n: | -\frac{\log p(x^n)}{n}-H(X) | \leq \epsilon} $, then |
$\forall \epsilon>0,\exists N\in\mathbb{N}_+,\forall n>N, (1-\epsilon)2^{n(H(X)-\epsilon)}\leq | \mathcal{W}_{[X]\epsilon}^n | \leq 2^{n(H(X)+\epsilon)}.$ |
We call the set $\mathcal{W}{[X]\epsilon}^n $the weekly typical set, $x^n\in \mathcal{W}{[X]\epsilon}^n $the weakly typical sequence, and $-\frac{\log p(x^n)}{n} $the empirical entropy of sequence $x^n $. #
Proof: By the week law of large numbers of i.i.d. sources, $n\to\infty, -\frac{1}{n}\log p(X^n)=-\frac{1}{n}\sum_{i=1}^{n}\log p(X_i)\stackrel{p}{\to}-E[\log p(X)]=H(X) $. #
The weak AEP shows that a small set (compared to $\mathcal{X}^n$, $\mathcal{W}_{[X]\epsilon}^n $is exponentially small) contains almost all probability of $\mathcal{X}^n $with nearly uniform distribution. So randomly choose a sequence from $\mathcal{X}^n$, the probability are near $\frac{1}{ | \mathcal{W}_{[X]\epsilon}^n | } $with high probability.* Note: The weekly typical sequence is different from the most likely sequence. It is not necessary fora a weakly typical set to include most likely sequences to possess the probability 1. |
Theorem 7:
(Cramer theorem):
Let ${X_k}_{k\in\mathbb{N}} $be a sequence of i.i.d. RVs with finite logarithmic moment generating functuion $\Lambda(t)=\log E[e^{tX}]<\infty $, then the Legendre transform of $\Lambda $satisfies:
#
Theorem 8:
(SMB Theorem):
Let ${X_n} $be a stationary ergodic process defined on a probability space $(\Omega,\mathcal{B}, \mathbf{P}) $, then the weak AEP for ${X_n} $shows that
where $H $is the entropy rate. #
The assumptions of stationarity/ergodicity/identical of RV is not necessary for the AEP to hold. An intuitive idea is that a form that LLN holds may be applied to weak AEP.
Theorem 9: For independent source ${X_k}_{k\in\mathbb{N}} $with bounded $Var(\log p(X_i)) $, then the weak AEP holds:
\[\begin{align} n\to\infty,\quad -\frac{1}{n}\log p(X^n)\stackrel{p}{\to}\bar{H}(X)\stackrel{\Delta}{=}\frac{1}{n}H(X_1,\cdots, X_n) \end{align}\]#
We introduce a new kind of AEP, which is related to Borel’s LLN. The following relations are called the method of types, as the term type is given as an equivalence relation and the probability of the equivalence class gives the definition of typical set and the corresponding strong AEP:
Definition 7:
(Empirical Distribution):
For a given random sequence $x^n\in \mathcal{X}^n$, The number of appearances of $a$ in the sequence:
where $\delta_{a,b}$is the Kronecker symbol and returns $1$ony if $a=b$. We denote:
\[\begin{align} Q(a)=P_{x^n}(a)=\frac{1}{n}N(a|x^n), \end{align}\]<a id="Empirical">
2 </a>
as the empirical distribution of any given $x^n$, and from (here), empirical distribution $Q$can also be seen as the frequency $P_{x^n}$. #
For i.i.d. sequence $X^n$with distribution $p_X$, and empirical distribution $P_{x^n}$in a specific $x^n\in\mathcal{X}^n$, we calculate the $Pr(X^n=x^n)$.
\[\begin{align} &\frac{1}{n}logPr(X^n=x^n)\\ =&\frac{1}{n}log\prod_{i=1}^{n}P_X(x_i)\\ =&\frac{1}{n}log\prod_{a\in\mathcal{X}}P_X(a)^{N(a|x^n)}\\ =&\sum_{a\in\mathcal{X}}\frac{N(a|x^n)}{n}logP_X(a)\\ =&\sum_{a\in\mathcal{X}}Q(a)logP_X(a)<a id="Cross Entropy">3</a> \end{align}\]This indicates that the probability of a random sequence only depends on the distribution of each random variable $X$and the frequency of the value $x^n$. Like the definition of emprical entropy in the weak AEP, we define the cross entropy:
Definition 8:
(Cross Entropy):
We define the cross entropy in an i.i.d. trial with $n $implements of RV $X $as:
#
From the definition (here),
\[\begin{align} Pr(X^n=x^n)=q^{-nH(Q,P_X)} \end{align}\]Easy to see that if all the frequency of $a\in\mathcal{X}$on different $x^n$s are the same, the values of probability distribution of $X^n$on point $x^n$s are the same. This allows us to divide the sample space $\mathcal{X}^n$into different parts and each part has the same empirical distribution, and in each part, sequences are automatically uniformly distributed. The partition is done only by $Q:\mathcal{X}\to \frac{[n]}{n}$. For a formal definition,
Definition 9:
(type):
For an i.i.d. trial of $n $complements of RV $X $over a finite alphabet $\mathcal{X} $, the type w.r.t. emperical distribution $Q $is defined by:
so any type $\mathcal{T}_Q^n \subseteq \mathcal{X}^n$is a set of sequences whose probability are numberially the same. # By combinatorial theory, the number of types is the munber of possible empirical distributions:
\[\begin{align} \binom{n+|\mathcal{X}|-1}{n}\leq (n+1)^{|\mathcal{X}|} \end{align}\]which is a ploynomial of n* Note: The question of the number of types is equivalent to the question, in which n different balls are independently labeled with $ | \mathcal{X} | $labels, or we choose n unordered balls from an urn containing $ | \mathcal{X} | $balls with replacement.} |
Definition 10:
(Rate of a Quantity):
For any quantity $M $which is exponentially large or exponentially small, we call the $\frac{1}{n}\log M $the rate of $M $, where the $\log $takes base of a convenient integer.
#
Lemma 8:
(Size of a Type):
For $n$i.i.d. experiments over the implement of $X\in\mathcal{X}$, the size of any type satisfies:
As the size of type only depends on the empirical distribution $Q=P_{x^n}$, this theorem introduces a natural constant for any given distribution $Q$:
\[\begin{align} H(Q)=H(Q,Q)=-\sum_{a\in\mathcal{X}}Q(a)logQ(a) \end{align}\]which can be an alternative definition of entropy. #
Proof: The size of any type can be shown combinatorially:
\[\begin{align} |\mathcal{T}_Q^n|=\frac{n!}{\prod_{a\in\mathcal{X}}[nQ(a)]!} \end{align}\]and noticing the Stirling approximation:
\[\begin{align} n^n\sim n!e^n\to \frac{1}{n}lnn!\sim ln n-1,\quad n\to \infty \end{align}\]We have:
\[\begin{align} &\frac{1}{n}log|\mathcal{T}_Q^n|\\ \sim &(lnn-1)-\sum_{a\in \mathcal{X}}Q(a)(ln[nQ(a)]-1)\\ =&(lnn-1)-(lnn-1)\sum_{a\in\mathcal{X}}Q(a)-\sum_{a\in\mathcal{X}}Q(a)logQ(a)\\ =&-\sum_{a\in\mathcal{X}}Q(a)logQ(a)=H(Q,Q), \quad n\to \infty <a id="entr">5</a> \end{align}\]#
It is worth noticing that $\frac{1}{n}log|\mathcal{T}Q^n|\sim H(Q)$is a exponential approximation, which allows an deviation of a ploynomial times of the size.* Note: This is true that for any ploynomial $P(x)\in \mathbb{R[x]} $, $\lim\limits{n\to \infty}\dfrac{e^x}{P(x)}=0$}. Specifically, for $\mathcal{T}_Q^n$with $Q$happens to be the distribution of $X$, i.e., $Q=P_X$,
\[\begin{align} \frac{1}{\binom{n+|\mathcal{X}|-1}{n}} \leq \sum_{x^n\in\mathcal{T}_Q^n}\prod_{i=1}^{n}p_{X}(x_i)=Q^n(\mathcal{T}_Q^n)\leq 1 \end{align}\]and by $here$, $\forall x^n\in \mathcal{T}Q^n, P{X^n}(x^n)=Q^n(x^n)=q^{-nH(Q)}$. So, we get the two-side bounds of the size of types: * Note: In fact we can choose any possible $P_X $as the number of types are independent with $p_X $. We choose $P_X=Q $to get a proper lower bound, as the probability will be maximum when $p_X=Q $
\[\begin{align} \frac{q^{nH(Q)}}{\binom{n+|\mathcal{X}|-1}{n}} \leq |\mathcal{T}_Q^n|\leq q^{nH(Q)}, <a id="specsize">6</a> \end{align}\]Lemma 9: (Probability of a type):\ For $n $i.i.d. implements of RV $X\sim p $, when $n\to\infty $,
\[\begin{align} \frac{1}{n}\log p_{X^n}(\mathcal{T}_Q^n)\sim D(Q||p) \end{align}\]As the probability of type depends on the empirical distribution $Q=P_{x^n}$and real distribution $p $, this theorem introduces a natural constant for any given distribution $Q$and $p $:
\[\begin{align} H(Q,p)-H(Q)=D(Q||p)=-\sum_{a\in\mathcal{X}}p(a)log\frac{p(a)}{Q(a)}=D(Q||p)\geq 0 \end{align}\]which can be an alternative definition of Kullback-Liebler Divergence (or relative entropy). #
Proof: We can simply calculate the probability of any given type $\mathcal{T}_Q^n$by the multiplication of the probability of any point in the type(as thay are uniformly distributed) and the size of the type, which is given in (here) and (here):
\[\begin{align} &\frac{1}{n}\log P_{X^n}(\mathcal{T}_Q^n)\\ =&\frac{1}{n}\log[|\mathcal{T}_Q^n|P_{X^n}(x^n:x^n\in \mathcal{T}_Q^n)]\\ \sim &\frac{1}{n}\log|\mathcal{T}_Q^n|+\frac{1}{n}logP_{X^n}(x^n:P_{x^n}=Q)\\ =&H(Q)-H(Q,p), \end{align}\]<a id="relativeentr">
7 </a>
To put it more specifically from (here),
\[\begin{align} \frac{q^{-nD(Q||p)}}{\binom{n+|\mathcal{X}|-1}{n}}\leq P_{X^n}(\mathcal{T}_Q^n)=|\mathcal{T}_Q^n|P_{X^n}(x^n:x^n\in \mathcal{T}_Q^n)\leq q^{-nD(Q||p)} \end{align}\]#
(here) shows that the difference between $H(Q)$and $H(Q,P_X)$(i.e., the difference of empirical distribution and real distribution) determines the probability of a certain type. (Compared to the size of a type, which is only determined by the empirical distribution). We can intuitively cosnsider $D(Q | P_X)$as the “distance” between 2 distributions. The large the distance of $Q$and $P_X$is, the less $P_{X^n}(\mathcal{T}_Q^n)$will be. As a specific case when $P=Q$, $D(Q | P_X)=0$and the probability gets maximum. |
Although it seems that when $P=Q$, $\frac{1}{n}logP_{X^n}(\mathcal{T}Q^n)\sim 0$, however, $ P{X^n}(\mathcal{T}_Q^n)\sim1$ is not correct as the former approximation is in a exponential sense. In fact, when $n\to \infty$ , the number of types “near” $P_X$, say, $\exists a\in \mathcal{X}, | Q(a)-P_X(a) | \leq \delta$, is no less than $[2n\delta]$ with nearly (the difference will be bounded in the following statement) the same probability, so the probability of type $P_{X^n}(\mathcal{T}{P_X}^n)\sim 1/2n\delta\to 0, \quad n\to \infty$. So we do not care any possibility of any *certain type* but a *range of types*. In this section we show that the possibility concentrates exactly around the maximum type $\mathcal{T}{P_X}^n$. |
Note: If only $a\in \mathcal{X} $changes, $Q(a)=\dfrac{N(a | x^n)}{n}\to N(a | x^n)=nQ(a)\in [ n(P_X(a)-\delta), n(P_X(a)+\delta) ] $, so basically there are no less than $[2n\delta] $different $N(a | x^n) $, and thus $[2n\delta] $different types. } |
Definition 11: (Strongly Typical Set): Let $X\sim p $be a RV defined on a finite alphabet $\mathcal{X} $. We define the the union of all “near maximum” types as the Strongly Typical Set:
\[\begin{align} \mathcal{T}_{[X]_\delta}^n=&\{x^n\in\mathcal{X}^n:||Q(a)-p_{X}(a)|\leq\delta,\forall a\in \mathcal{S}_X\wedge Q(a)=0,\forall a\notin \mathcal{S}_X\}\\ =&\bigcup_{Q:|Q(a)-P_X(a)|\leq\delta,\forall a\in \mathcal{X},\wedge Q(a)=0,\forall a\notin \mathcal{S}_X}\mathcal{T}_Q^n \end{align}\]or alternatively,
\[\begin{align} T_{[X]_{\delta^\prime}}^n=&\{x^n\in\mathcal{X}^n: \forall a\notin \mathcal{S}_X, Q(a)=0\wedge \sum_{x\in\mathcal{X}}||Q(x)-p(x)|\leq\delta^\prime\} \end{align}\]by letting $\delta’=\sqrt{|\mathcal{X}|}\delta $. The completement of strongly typical set is called the strongly atypical set, denoted by $\mathcal{T}{[X]\delta}^{nC}=\mathcal{X}^n-\mathcal{T}{[X]\delta}^n $($T_{[X]\delta}^{nC}=\mathcal{X}^n-T{[X]\delta}^n $resp.). For $x^n\in\mathcal{X}^n $, if $\exists x_i\in \mathcal{X}, |Q(x_i)-p(x_i)|\geq\delta $, then $x^n\in\mathcal{T}{[X]_\delta}^{nC} $. #
The strong AEP are acknowledged with 3 lemmas, analogous to the weak AEP:
Theorem 10: (Strong AEP I: Size of Strongly Typical Set):
\[\begin{align} \frac{1}{n}|\mathcal{T}_{[X]_\delta}^n|\sim H(p),\quad n\to\infty \end{align}\]#
Proof: First we give a lemma according to the continuity of $H(\cdot) $:
Lemma 10: due to the uniform continuity of entropy function
\[\begin{align} H:\{(q_1,\cdots,q_{|\mathcal{X}|})|\forall i\in[n], q_i\in [0,1],\sum_{i=1}^{|\mathcal{X}|}q_i=1\}\to\mathbb{R}_+, H(Q)=-\sum_{a\in\mathcal{X}}Q(a)logQ(a) \end{align}\]we can find $\delta_{\epsilon}$to make $ | H(Q)-H(p) | \leq \epsilon$, i,.e., |
#
We pick* Note: Attention that we pick $\delta $from $\epsilon $but not $\epsilon $from $\delta $. This $\delta $has a linear form of $\epsilon $, so they have the same convergence when $n\to\infty $.
\[\begin{align} \delta_{\epsilon}=min\{\frac{\epsilon}{-\sum_{a\in \mathcal{X}}logp(a)},\frac{\epsilon}{-\sum_{a\in \mathcal{X}}logQ(a)}\} \end{align}\]and $H(Q)$is bounded by:
\[\begin{align} H(Q)\leq &H(Q,P_X)=-\sum_{a\in\mathcal{X}}Q(a)logP_X(a)\\ \leq &-\sum_{a\in\mathcal{X}}(P_X(a)+\delta_\epsilon)logP_X(a)\\ \leq &-\delta_\epsilon\sum_{a\in\mathcal{X}}logP_X(a)-\sum_{a\in\mathcal{X}}P_X(a)logP_X(a)\\ =&\epsilon+H(P_X) \end{align}\] \[\begin{align} H(Q)=&-\sum_{a\in\mathcal{X}}Q(a)logQ(a)\\ \geq &-\sum_{a\in\mathcal{X}}(P_X(a)-\delta_\epsilon)logQ(a)\\ \geq & H(P_X)-\epsilon\frac{\sum_{a\in\mathcal{X}}logQ(a)}{\sum_{a\in\mathcal{X}}logQ(a)}\\ =&H(P_X)-\epsilon \end{align}\] \[\begin{align} |Q(a)-P_{X}(a)|\leq &\delta_{\epsilon},\forall a\in \mathcal{X}\to |H(Q)-H(P_X)|\leq \epsilon \end{align}\]Which ends the proof of lemma. We then calculate the upper bound of the size of typical set:
\[\begin{align} &\frac{1}{n}log|\mathcal{T}_{[Q]_{\delta_\epsilon}}^n|\\ =&\frac{1}{n}log\sum_{Q:|Q(a)-P_X(a)|\leq\delta,\forall a\in \mathcal{X}}|\mathcal{T}_Q^n|\\ \leq & \frac{1}{n}log\binom{n+|\mathcal{X}|-1}{n}+\max_{Q:|Q(a)-P_X(a)|\leq\delta,\forall a\in \mathcal{X}}\frac{1}{n}log|\mathcal{T}_{Q}^n|\\ = &H(Q)\leq H(P_X)+\epsilon, \quad n\to\infty \end{align}\]and the lower bound can be done by the same way:
\[\begin{align} &\frac{1}{n}log|\mathcal{T}_{[Q]_{\delta_\epsilon}}^n|\\ =&\frac{1}{n}log\sum_{Q:|Q(a)-P_X(a)|\leq\delta_\epsilon,\forall a\in \mathcal{X}}|\mathcal{T}_Q^n|\\ \geq &\min_{Q:|Q(a)-P_X(a)|\leq\delta_\epsilon,\forall a\in \mathcal{X}}\frac{1}{n}log|\mathcal{T}_{Q}^n|\\ = &H(Q)\geq H(P_X)-\epsilon, \quad n\to\infty \end{align}\]#
Theorem 11: (Strong AEP II: Probability of Strongly Typical Set):
\[\begin{align} \frac{1}{n}\log Pr\{X^n\in \mathcal{T}^{nC}_{[X]_{\delta_\epsilon}}\}\leq -\delta_\epsilon \end{align}\]As $\lim_{n\to\infty}\delta_\epsilon=0 $, the probability of Strongly typical set tends to 1. #
Proof: for all types $\mathcal{T}Q^n\subseteq \mathcal{T}{[Q]{\delta{\epsilon}}}^{nC}$ , $\exists a \in \mathcal{X}, |Q(a)-P_X(a)|\geq \delta_{\epsilon}$. Due to the continuity and monotony of $f(p)=logp$, we have:
Note: (here follows from the definition of derivatives, and is not accurate. In fact, $(logP_X(a)-logQ(a))= \frac{d(logp)}{dp} | {p=Q(a)}[P_X(a)-Q(a)]+o((P_X(a)-Q(a))^2)$is the Taylor formula with Peano remainder and the results should be $\delta\epsilon + Q(a)o(\delta^2) $. Here we omit the $o(\delta^2) $, and we will configure $\delta $in (here) to make $\lim\limits_{n\to \infty}nQ(a)o(\delta^2)\neq \infty $} |
Then we give the upper bound of atypical set:
\[\begin{align} &\frac{1}{n}logPr(X^n\in\mathcal{T}_{[Q]_{\delta_\epsilon}}^{nC})\\ =&\frac{1}{n}log\sum_{Q:\exists a\in \mathcal{X}, |Q(a)-p(a)|\geq\delta_\epsilon}|\mathcal{T}_Q^n|\\ \leq & \frac{1}{n}log\binom{n+|\mathcal{X}|-1}{n}+\max_{Q:|Q(a)-p(a)|\leq\delta,\forall a\in \mathcal{X}}\frac{1}{n}logPr(X^n\in\mathcal{T}_{Q}^{n})\\ =&-D(Q||p)\leq -\delta_\epsilon, <a id="proaty">9</a> \end{align}\]#
Theorem 12: (Strong AEP III: The probability of Strongly Typical Sequence):
\[\begin{align} \forall \delta>0,\exists \eta>0 s.t. -H(X)-\eta\leq\frac{1}{n}\log p(x^n)\leq -H(X)+\eta \end{align}\]#
Proof: From AEP I and AEP II, we have
\[\begin{align} -H(p)-\epsilon-\delta\leq \frac{1}{n}\log p(x^n)=-H(p,Q)=-H(Q)-D(Q||p)\leq -H(p)+\epsilon \end{align}\]and letting $\eta=\delta+\epsilon $. #
One to be noticed is that in order to get a proper $\epsilon$to make both $\epsilon$itself and $n\delta^2 $be sufficiently small and $n\delta_{\epsilon}$sufficiently large. For example,
\[\begin{align} \epsilon=n^p , \quad p\in(-1,-\dfrac{1}{2}], <a id="make">10</a> \end{align}\]is a proper setting.
Definition 12: (Substantial Set):\A set $\mathcal{A}\in \mathcal{X}^n$is called a substantial set iff
\[\begin{align} Pr(X^n\in\mathcal{A})=\mu>0 \end{align}\]#
and intuitively, the positive probability can only be valued in the typical set when $n\to \infty$, as (here) shows that :
\[\begin{align} Pr(X^n\in\mathcal{T}_{[Q]_{\delta_\epsilon}}^{nC})\to 0,\quad n\to \infty \end{align}\]To show this, we first calculate the probability of intersection of $\mathcal{A}$and typical set:
\[\begin{align} &Pr(X^n\in \mathcal{A}\cap \mathcal{T}_{[Q]_\delta}^n)\\ =&1-Pr(\overline{\mathcal{A}}\cup \mathcal{T}_{[Q]_{\delta_\epsilon}}^{nC})\\ \geq &1-Pr(X^n\in \overline{\mathcal{A}})-Pr(X^n\in \mathcal{T}_{[Q]_{\delta_\epsilon}}^{nC})\\ \geq &1-(1-\mu)-exp^{-n\delta_\epsilon}\to \mu,\quad n\to \infty \end{align}\]so the probability almostly distributed in the typical set. We can calculate a lower bound of $\mathcal{A}$by the property that the distribution in a typical set is uniform:
\[\begin{align} \frac{1}{n}log|\mathcal{A}|\geq &\frac{1}{n}log|\mathcal{A}\cap \mathcal{T}_{[Q]_\delta}^n|\\ \geq &\frac{1}{n}log(\mu-exp(-n\delta))+H(X)-\epsilon\\ \to & H(X)-\epsilon,\quad n\to \infty \end{align}\]So any non-zero part of strongly typical set has the same exponential approximation as the strongly typical set itself.
The strong typicality is stronger in the sense that is can imply the weak typicality.
Theorem 13: $\forall \delta>0, \exists \eta>0 $, s.t. $\mathcal{T}{[X]\delta}^{n}\subseteq \mathcal{W}{[X]\eta}^{n} $ # The converse is not true. A counter example is that for $X\sim p=(1/2,1/4,1/4) $, when $x^n $is a weakly typical sequence, then we need:
\[\begin{align} -\frac{1}{n}\log p(x)=-q(0)\log 1/2-q(1)\log 1/4 -q(2)\log 1/4\approx H(X) \end{align}\]by letting $q(0)=q(1)=0.5, q(2)=0 $, the sequence (without realization of $p(2) $) is weekly typical, but apparently not strongly typical.
Stronger as the strong typicality is, it can only be used for RV under finite alphabets while the weak one corresponds to the weak LLN. Unless specified, we always use the term “typicality” to denote the strong typicality.
Now we come into bivariate distribution scenario where 2 generic RVs $X\in\mathcal{X},Y\in\mathcal{Y} $are taken into consideration to produce the i.i.d. information source ${X_k,Y_k}_{k\in\mathbb{N}} $.
Definition 13: (Jointly typical set):\ Let ${X_k,Y_k}{k\in\mathbb{N}} $be an i.i.d. information source with 2 generic RVs $X\in\mathcal{X},Y\in \mathcal{Y} $under finite alphabets. The jointly typical set $\mathcal{T}{[XY]_\delta}^n $is
\[\begin{align} \mathcal{T}_{[XY]_\delta}^n=\{(x^n,y^n)&\in \mathcal{X}^n\times\mathcal{Y}^n: N(x,y;x^n,y^n)=0, \forall (x,y)\notin S_{X,Y}\wedge\\ &\left|\frac{N(x,y;x^n,y^n)}{n}-p(x,y)\right|\leq \delta,\forall (x,y)\in\mathcal{X}\times\mathcal{Y}\}\\ T_{[XY]_\delta}^n=\{(x^n,y^n)&\in \mathcal{X}^n\times\mathcal{Y}^n: N(x,y;x^n,y^n)=0, \forall (x,y)\notin S_{X,Y}\wedge\\ &\sum_{x\in\mathcal{X}}\sum_{y\in\mathcal{Y}}\left|\frac{N(x,y;x^n,y^n)}{n}-p(x,y)\right|\leq \delta\} \end{align}\]$\forall (x^n,y^n)\in\mathcal{T}{[XY]\delta}^n $is called jointly typical. #
Theorem 14:(Properities of Jointly Typical Set): * (Consistency): The jointly typical implies the typical separately:
\[\begin{align} \forall \delta>0, \exists \delta_1,\delta_2>0, \lim\limits_{n\to\infty}(\delta,\delta_1,\delta_2)=0, s.t.\quad \mathcal{T}_{[XY]_\delta}\subseteq\mathcal{T}_{[X]_\delta}^n\times\mathcal{T}_{[Y]_\delta}^n \end{align}\]#
Proof: The consistency of JT set can be proved by noticing $N(x;x^n)=\sum_{y\in\mathcal{Y}}N(x,y;x^n,y^n) $
\[\begin{align} \left|\frac{N(x;x^n)}{n}-p(x)\right|=\left|\sum_{y\in\mathcal{Y}}\left(\frac{N(x,y;x^n,y^n)}{n}-p(x,y)\right)\right|\leq |\mathcal{Y}|\delta=\delta_1 \end{align}\]So $\forall (x^n,y^n)\in\mathcal{T}{[XY]\delta}^n $, then $x^n\in\mathcal{T}{[X]\delta}^n $. Similarly,$\forall (x^n,y^n)\in\mathcal{T}{[XY]\delta}^n \to y^n\in\mathcal{T}{[Y]\delta}^n $.
For preservation, we notice that $N(y, (f(x_1),\cdots,f(x_n)))=\sum_{x\in f^{-1}[y]}N(x;x^n) $.
\[\begin{align} \left|\frac{N(y;y^n)}{n}-p(y)\right|=&\left|\sum_{x\in f^{-1}[y]}\frac{N(x;x^n)}{n}-p(x)\right|\\ \leq& \sum_{x\in f^{-1}[y]}\left|\frac{N(x;x^n)}{n}-p(x)\right|\leq |f^{-1}[y]|\delta=\delta^\prime \end{align}\]Therefore, $y^n=(f(x_1),\cdots,f(x_n))\in\mathcal{T}{[Y]{\delta^\prime}}^n $. #
Definition 14: Let $(X^n,Y^n)=((X_1,Y_1),\cdots,(X_n,Y_n)) $be i.i.d. sequence with generic RV $(X,Y) $, then there exists $\epsilon,\eta>0 $s.t. $\epsilon,\eta\to 0 $as $\delta\to 0 $, and
\[\begin{align} \forall (x^n,y^n)\in\mathcal{T}_{[XY]_\delta}^n,\quad -H(X,Y)-\epsilon\leq& \frac{1}{n}\log p(x^n,y^n)\leq -H(X,Y)+\epsilon\\ &\frac{1}{n}\log Pr\{(X^n,Y^n)\in\mathcal{T}_{[XY]_\delta}^n\}\leq -\delta\\ H(X,Y)-\eta\leq &\frac{1}{n}\log |\mathcal{T}_{[XY]_\delta}^n|\leq H(X,Y)+\eta \end{align}\]#
Proof: The proof is similar to the proof of strong AEP, omitted. #
From the JAEP, we know that when $n\to\infty $, $\frac{1}{n}\log | \mathcal{T}{[XY]\delta}^n | \to H(X,Y) $and $\frac{1}{n}\log | \mathcal{T}{[X]\delta}^n | \to H(X) $, and that $(X^n,Y^n) $(resp. $X^n $) is uniformly distributed in $\mathcal{T}{[XY]\delta}^n $(resp. $\mathcal{T}{[X]\delta}^n $). Then for each $x^n\in\mathcal{T}{[X]\delta}^n $, the number of typical sequence $(x^n,y^n) $are exponentially the same, each with $\frac{1}{n}\log \frac{ | \mathcal{T}{[XY]\delta}^n | }{ | \mathcal{T}{[X]\delta}^n | }\to H(Y | X) $. Thus we can define the typicality of $y^n $conditioning on a given $x^n\in \mathcal{T}{[X]\delta}^n $. |
Definition 15:
(Conditional Typical Set):
For any $x^n\in\mathcal{T}{[X]\delta}^n $, the sequence $y^n\in \mathcal{T}{[Y]\delta}^n $which makes $(x^n,y^n) $jointly typical are called the typical sequence conditioning on $x^n $, i.e.
The $\mathcal{T}{[Y|X]\delta}^n(x^n) $is called the typical set of $Y $conditioning on $x^n $. #
Theorem 15: (Conditional AEP):
\[\begin{align} \forall x^n\in\mathcal{T}_{[X]_\delta}^n&\text{ s.t.}|\mathcal{T}_{[Y|X]_\delta}^n(x^n)|\geq 1, \exists \eta>0, \eta\to 0 \text{ as } \delta\to 0\text{ we have: }\\ &\quad H(Y|X)-\eta\leq |\mathcal{T}_{[Y|X]_\delta}^n(x^n)|\leq H(Y|X)+\eta \end{align}\]#
Proof: $\forall x^n\in\mathcal{T}{[X]\delta}^n $, we have
\[\begin{align} p(x^n)=&\sum_{y^n\in \mathcal{Y}^n}p(x^n,y^n)\\ \geq& \sum_{y^n\in\mathcal{T}_{[Y|X]_\delta}^n(x^n)}p(x^n,y^n)=|\mathcal{T}_{[Y|X]_\delta}^n(x^n)|p(x^n,y^n), \quad\forall y^n\in\mathcal{T}_{[Y|X]_\delta}^n(x^n) \end{align}\]so the upper bound of the size can be derived by the bounds from JAEP and AEP:
\[\begin{align} \frac{1}{n}\log|\mathcal{T}_{[Y|X]_\delta}^n(x^n)|\leq& \frac{1}{n}\log (p(x^n)-p(x^n,y^n))\\ \leq& -H(X)+\eta_1-(-H(X,Y)-\eta_2)=H(Y|X)+\eta ,\quad \eta=\eta_1+\eta_2 \end{align}\]Actually, $\forall (x^n,y^n)\in\mathcal{T}{[XY]\delta}^n $with emperical distribution $K $, the size of the type with $K $is:
\[\begin{align} |\mathcal{T}_{K}^n|=\frac{n!}{\prod_{(x,Y)\in \mathcal{X}\times\mathcal{Y}}(nK(x,y))!}=\frac{n!}{\prod_{(x,Y)\in \mathcal{X}\times\mathcal{Y}}N(x,y;x^n,y^n)!} \end{align}\]for fixed $x^n\in \mathcal{T}{[X]\delta}^n $, the size reduces to:
\[\begin{align} \prod_{x\in\mathcal{X}}\frac{N(x;x^n)!}{\prod_{y\in\mathcal{Y}}N(x,y;x^n,y^n)!} \end{align}\]which can be a lower bound of the $ | \mathcal{T}{[XY]\delta}^n(x^n) | $. i.e., |
where (a) is the Strling’s approximation:
\[\begin{align} n\ln n-n< \ln n!<(n+1)\ln(n+1)-n \end{align}\]and (b), (c) are from the AEP and JAEP, $\lim\limits_{n\to\infty}\eta(n,\delta)=0 $. Thus the bounds are proved. #
The CAEP shows that the rate of size of $ | \mathcal{T}{[XY]\delta}^n(x^n) | $approximates $H(Y | X) $regardless of $x^n\in\mathcal{T}{[X]\delta}^n $. As a corollary, we show that such typical $x^n $that makes $\frac{1}{n} | \mathcal{T}_{[Y | X]_\delta^n}(x^n) | \to H(Y | X) $grows with $n $at almost the same rate as the number of typical $x^n\in\mathcal{T}{[X]\delta}^n $ |
Lemma 11: Let
\[\begin{align} \mathcal{S}_{[X]_\delta}^n=\{x^n\in\mathcal{T}_{[X]_\delta}^n: \mathcal{T}_{[Y|X]_\delta}^n(x^n)\neq\emptyset\} \end{align}\]then
\[\begin{align} &\frac{1}{n}\log|\mathcal{S}_{[X]_\delta}^n|\to H(X),\quad n\to\infty\\ &\exists \gamma(n)\to 0,\quad\frac{1}{n}\log Pr\{X^n\in\mathcal{S}_{[X]_\delta}^{nC}\}<\gamma \end{align}\]#
The typicality gives us an asymptotic perspective on the meaning of Shannon’s measures. Correspondingly, it is highly related to the Basic inequalities.
Example 4: Consider 3 Rvs $X,Y,Z $.
\[\begin{align} &(x^n,y^n,z^n)\in\mathcal{T}_{[XYZ]_\delta^n}\\ \stackrel{\text{Consistency}}{\Longrightarrow}&(x^n,z^n)\in\mathcal{T}_{[XZ]_\delta^n},\quad (y^n,z^n)\in\mathcal{T}_{[YZ]_\delta^n}\\ \stackrel{\text{CTypical}}{\Longrightarrow}&(x^n)\in\mathcal{T}_{[X|Z]_\delta^n}(z^n),\quad (y^n)\in\mathcal{T}_{[Y|Z]_\delta^n}(z^n)\\ \Longrightarrow & \mathcal{T}_{[XY|Z]_\delta^n}(z^n)\subseteq \mathcal{T}_{[X|Z]_\delta^n}(z^n)\times\mathcal{T}_{[Y|Z]_\delta^n}(z^n)\\ \Longrightarrow & |\mathcal{T}_{[XY|Z]_\delta^n}(z^n)|\leq |\mathcal{T}_{[X|Z]_\delta^n}(z^n)||\mathcal{T}_{[Y|Z]_\delta^n}(z^n)|\\ \Longrightarrow & \frac{1}{n}\log |\mathcal{T}_{[XY|Z]_\delta^n}(z^n)|\leq \frac{1}{n}\log|\mathcal{T}_{[X|Z]_\delta^n}(z^n)|+\frac{1}{n}\log|\mathcal{T}_{[Y|Z]_\delta^n}(z^n)|\\ \Longrightarrow & H(X,Y|Z)-\epsilon_1\leq H(X|Z)+\epsilon_2+ H(Y|Z)+\epsilon_3\\ \stackrel{n\to\infty, \epsilon_1,\epsilon_2,\epsilon_3\to 0}{\Longrightarrow}&H(X,Y|Z)\leq H(X|Z)+ H(Y|Z) \end{align}\]# A customized idea is summarized as: