Alec's c.d.f-sidestepping sampling method
- See generating samples from a distribution for and overview of methods used
Contents
Algorithm
Let [ilmath]f(x)[/ilmath] be a p.d.f for which computing [ilmath]F^{-1}(x)[/ilmath] is either difficult, impractical, impossible or otherwise unavailable, but where [ilmath]f(x)[/ilmath] itself is easy to compute.
Suppose that either:
- [ilmath]\exists a,b\in\mathbb{R}\big[\forall x\in[a,b]\subseteq\mathbb{R}[f(x)>0]\big][/ilmath], there exists a closed interval, [ilmath][a,b][/ilmath] for which [ilmath]f(x)>0[/ilmath] on over, or:
- [ilmath]\forall x\in\mathbb{R}[f(x)>0][/ilmath] - [ilmath]f(x)>0[/ilmath] everywhere
then we will sample [ilmath]f(x)[/ilmath] over [ilmath][a,b][/ilmath] or [ilmath]\mathbb{R} [/ilmath]
- herein we will use [ilmath]A:\eq[a,b][/ilmath] or [ilmath]A:\eq\mathbb{R} [/ilmath] for whichever case is chosen
Define the following function carefully:
- Define [ilmath]h:A\rightarrow\mathbb{R} [/ilmath] such that [ilmath]\forall x\in A[h(x)>f(x)][/ilmath]
Then the algorithm proceeds as follows:
Steps
To generate [ilmath]X[/ilmath], a sample from the distribution [ilmath]f[/ilmath]:
- Define [ilmath]g:A\rightarrow\mathbb{R} [/ilmath] by [math]g(x):\eq\frac{h(x)}{\int_Ah(t)\mathrm{d}t} [/math] - note that this makes [ilmath]g[/ilmath] itself a p.d.f
- Generate [ilmath]Y[/ilmath] on [ilmath]A[/ilmath] from [ilmath]g[/ilmath]
- if [ilmath]G:A\rightarrow [0,1]\subseteq\mathbb{R} [/ilmath], or [ilmath]G(x)[/ilmath], is the c.d.f of [ilmath]g[/ilmath] then [ilmath]G^{-1}:[0,1]\rightarrow A[/ilmath] is how we usually sample from the distribution, [ilmath]g[/ilmath], we generate a random number, say [ilmath]y[/ilmath] uniformly from [ilmath][0,1][/ilmath], then [ilmath]Y\eq G^{-1}(y)[/ilmath] is a random sample from [ilmath]g[/ilmath]
- Generate [ilmath]Z[/ilmath] as an (independent of [ilmath]Y[/ilmath]) uniformly on [ilmath][0,1][/ilmath]
- If [ilmath]Z h(Y) \le f(Y) [/ilmath] then:
- Define [ilmath]X:\eq Y[/ilmath]
- Else
- Goto step [ilmath]2[/ilmath]
Note that other methods may be used to generate either [ilmath]Y[/ilmath] or [ilmath]Z[/ilmath], it is only important that we generate them somehow.
It is possible (although extremely unlikely for high numbers) for this to loop a lot of times before generating a sample, that is the trade-off it makes compared with generating samples from the inverse of a cumulative distribution function. The better the [ilmath]h[/ilmath] approximation is to [ilmath]f[/ilmath] the higher the throughput of generated samples.
Also unlike generating samples from the inverse of a cumulative distribution function it requires at least 2 generated values unlike the inverse c.d.f generating method which requires only on generated uniform value on [ilmath][0,1][/ilmath] always.
Proof
The message provided is: