In directional statistics, the von Mises–Fisher distribution (named after Richard von Mises and Ronald Fisher), is a probability distribution on the ( p − 1 ) {\displaystyle (p-1)} -sphere in R p {\displaystyle \mathbb {R} ^{p}} . If p = 2 {\displaystyle p=2} the distribution reduces to the von Mises distribution on the circle.
The probability density function of the von Mises–Fisher distribution for the random p-dimensional unit vector x {\displaystyle \mathbf {x} } is given by:
f p ( x ; μ , κ ) = C p ( κ ) exp ( κ μ T x ) , {\displaystyle f_{p}(\mathbf {x} ;{\boldsymbol {\mu }},\kappa )=C_{p}(\kappa )\exp \left({\kappa {\boldsymbol {\mu }}^{\mathsf {T}}\mathbf {x} }\right),}where κ ≥ 0 , ‖ μ ‖ = 1 {\displaystyle \kappa \geq 0,\left\Vert {\boldsymbol {\mu }}\right\Vert =1}
C p ( κ ) = κ p / 2 − 1 ( 2 π ) p / 2 I p / 2 − 1 ( κ ) , {\displaystyle C_{p}(\kappa )={\frac {\kappa ^{p/2-1}}{(2\pi )^{p/2}I_{p/2-1}(\kappa )}},} and the normalization constant C p ( κ ) {\displaystyle C_{p}(\kappa )} is equal towhere I v {\displaystyle I_{v}} Bessel function of the first kind at order v {\displaystyle v} . If p = 3 {\displaystyle p=3} , the normalization constant reduces to
C 3 ( κ ) = κ 4 π sinh κ = κ 2 π ( e κ − e − κ ) . {\displaystyle C_{3}(\kappa )={\frac {\kappa }{4\pi \sinh \kappa }}={\frac {\kappa }{2\pi (e^{\kappa }-e^{-\kappa })}}.} denotes the modifiedThe parameters μ {\displaystyle {\boldsymbol {\mu }}} concentration parameter, respectively. The greater the value of κ {\displaystyle \kappa } , the higher the concentration of the distribution around the mean direction μ {\displaystyle {\boldsymbol {\mu }}} . The distribution is unimodal for κ > 0 {\displaystyle \kappa >0} , and is uniform on the sphere for κ = 0 {\displaystyle \kappa =0} .
and κ {\displaystyle \kappa } are called the mean direction andThe von Mises–Fisher distribution for p = 3 {\displaystyle p=3} It was first used to model the interaction of electric dipoles in an electric field. Other applications are found in geology, bioinformatics, and text mining.
is also called the Fisher distribution.In the textbook, Directional Statistics by Mardia and Jupp, the normalization constant given for the Von Mises Fisher probability density is apparently different from the one given here: C p ( κ ) {\displaystyle C_{p}(\kappa )} . In that book, the normalization constant is specified as:
C p ∗ ( κ ) = ( κ 2 ) p / 2 − 1 Γ ( p / 2 ) I p / 2 − 1 ( κ ) {\displaystyle C_{p}^{*}(\kappa )={\frac {({\frac {\kappa }{2}})^{p/2-1}}{\Gamma (p/2)I_{p/2-1}(\kappa )}}}where Γ {\displaystyle \Gamma } gamma function. This is resolved by noting that Mardia and Jupp give the density "with respect to the uniform distribution", while the density here is specified in the usual way, with respect to Lebesgue measure. The density (w.r.t. Lebesgue measure) of the uniform distribution is the reciprocal of the surface area of the (p-1)-sphere, so that the uniform density function is given by the constant:
C p ( 0 ) = Γ ( p / 2 ) 2 π p / 2 {\displaystyle C_{p}(0)={\frac {\Gamma (p/2)}{2\pi ^{p/2}}}} is theIt then follows that:
C p ∗ ( κ ) = C p ( κ ) C p ( 0 ) {\displaystyle C_{p}^{*}(\kappa )={\frac {C_{p}(\kappa )}{C_{p}(0)}}}While the value for C p ( 0 ) {\displaystyle C_{p}(0)} series expansion for I p / 2 − 1 ( κ ) {\displaystyle I_{p/2-1}(\kappa )} divided by κ p / 2 − 1 {\displaystyle \kappa ^{p/2-1}} has but one non-zero term at κ = 0 {\displaystyle \kappa =0} . (To evaluate that term, one needs to use the definition 0 0 = 1 {\displaystyle 0^{0}=1} .)
was derived above via the surface area, the same result may be obtained by setting κ = 0 {\displaystyle \kappa =0} in the above formula for C p ( κ ) {\displaystyle C_{p}(\kappa )} . This can be done by noting that theThe support of the Von Mises–Fisher distribution is the hypersphere, or more specifically, the ( p − 1 ) {\displaystyle (p-1)} , denoted as -sphere
S p − 1 = { x ∈ R p : ‖ x ‖ = 1 } {\displaystyle S^{p-1}=\left\{\mathbf {x} \in \mathbb {R} ^{p}:\left\|\mathbf {x} \right\|=1\right\}}This is a ( p − 1 ) {\displaystyle (p-1)} manifold embedded in p {\displaystyle p} -dimensional Euclidean space, R p {\displaystyle \mathbb {R} ^{p}} .
-dimensionalStarting from a normal distribution with isotropic covariance κ − 1 I {\displaystyle \kappa ^{-1}\mathbf {I} } and mean μ {\displaystyle {\boldsymbol {\mu }}} of length r > 0 {\displaystyle r>0} , whose density function is:
G p ( x ; μ , κ ) = ( κ 2 π ) p exp ( − κ ( x − μ ) ′ ( x − μ ) 2 ) , {\displaystyle G_{p}(\mathbf {x} ;{\boldsymbol {\mu }},\kappa )=\left({\sqrt {\frac {\kappa }{2\pi }}}\right)^{p}\exp \left(-\kappa {\frac {(\mathbf {x} -{\boldsymbol {\mu }})'(\mathbf {x} -{\boldsymbol {\mu }})}{2}}\right),}the Von Mises–Fisher distribution is obtained by conditioning on ‖ x ‖ = 1 {\displaystyle \left\|\mathbf {x} \right\|=1}
( x − μ ) ′ ( x − μ ) = x ′ x + μ ′ μ − 2 μ ′ x , {\displaystyle (\mathbf {x} -{\boldsymbol {\mu }})'(\mathbf {x} -{\boldsymbol {\mu }})=\mathbf {x} '\mathbf {x} +{\boldsymbol {\mu }}'{\boldsymbol {\mu }}-2{\boldsymbol {\mu }}'\mathbf {x} ,} . By expandingand using the fact that the first two right-hand-side terms are fixed, the Von Mises-Fisher density, f p ( x ; r − 1 μ , r κ ) {\displaystyle f_{p}(\mathbf {x} ;r^{-1}{\boldsymbol {\mu }},r\kappa )}
is recovered by recomputing the normalization constant by integrating x {\displaystyle \mathbf {x} } over the unit sphere. If r = 0 {\displaystyle r=0} , we get the uniform distribution, with density f p ( x ; 0 , 0 ) {\displaystyle f_{p}(\mathbf {x} ;{\boldsymbol {0}},0)} .More succinctly, the restriction of any isotropic multivariate normal density to the unit hypersphere, gives a Von Mises-Fisher density, up to normalization.
This construction can be generalized by starting with a normal distribution with a general covariance matrix, in which case conditioning on ‖ x ‖ = 1 {\displaystyle \left\|\mathbf {x} \right\|=1} Fisher-Bingham distribution.
gives theA series of N independent unit vectors x i {\displaystyle x_{i}} are drawn from a von Mises–Fisher distribution. The maximum likelihood estimates of the mean direction μ {\displaystyle \mu } is simply the normalized arithmetic mean, a sufficient statistic:
μ = x ¯ / R ¯ , where x ¯ = 1 N ∑ i N x i , and R ¯ = ‖ x ¯ ‖ , {\displaystyle \mu ={\bar {x}}/{\bar {R}},{\text{where }}{\bar {x}}={\frac {1}{N}}\sum _{i}^{N}x_{i},{\text{and }}{\bar {R}}=\|{\bar {x}}\|,}Use the modified Bessel function of the first kind to define
A p ( κ ) = I p / 2 ( κ ) I p / 2 − 1 ( κ ) . {\displaystyle A_{p}(\kappa )={\frac {I_{p/2}(\kappa )}{I_{p/2-1}(\kappa )}}.}Then:
κ = A p − 1 ( R ¯ ) . {\displaystyle \kappa =A_{p}^{-1}({\bar {R}}).}Thus κ {\displaystyle \kappa }
A p ( κ ) = ‖ ∑ i N x i ‖ N = R ¯ . {\displaystyle A_{p}(\kappa )={\frac {\left\|\sum _{i}^{N}x_{i}\right\|}{N}}={\bar {R}}.} is the solution toA simple approximation to κ {\displaystyle \kappa }
κ ^ = R ¯ ( p − R ¯ 2 ) 1 − R ¯ 2 , {\displaystyle {\hat {\kappa }}={\frac {{\bar {R}}(p-{\bar {R}}^{2})}{1-{\bar {R}}^{2}}},} is (Sra, 2011)A more accurate inversion can be obtained by iterating the Newton method a few times
κ ^ 1 = κ ^ − A p ( κ ^ ) − R ¯ 1 − A p ( κ ^ ) 2 − p − 1 κ ^ A p ( κ ^ ) , {\displaystyle {\hat {\kappa }}_{1}={\hat {\kappa }}-{\frac {A_{p}({\hat {\kappa }})-{\bar {R}}}{1-A_{p}({\hat {\kappa }})^{2}-{\frac {p-1}{\hat {\kappa }}}A_{p}({\hat {\kappa }})}},} κ ^ 2 = κ ^ 1 − A p ( κ ^ 1 ) − R ¯ 1 − A p ( κ ^ 1 ) 2 − p − 1 κ ^ 1 A p ( κ ^ 1 ) . {\displaystyle {\hat {\kappa }}_{2}={\hat {\kappa }}_{1}-{\frac {A_{p}({\hat {\kappa }}_{1})-{\bar {R}}}{1-A_{p}({\hat {\kappa }}_{1})^{2}-{\frac {p-1}{{\hat {\kappa }}_{1}}}A_{p}({\hat {\kappa }}_{1})}}.}For N ≥ 25, the estimated spherical standard error of the sample mean direction can be computed as:
σ ^ = ( d N R ¯ 2 ) 1 / 2 {\displaystyle {\hat {\sigma }}=\left({\frac {d}{N{\bar {R}}^{2}}}\right)^{1/2}}where
d = 1 − 1 N ∑ i N ( μ T x i ) 2 {\displaystyle d=1-{\frac {1}{N}}\sum _{i}^{N}\left(\mu ^{T}x_{i}\right)^{2}}It is then possible to approximate a 100 ( 1 − α ) % {\displaystyle 100(1-\alpha )\%} confidence interval (a confidence cone) about μ {\displaystyle \mu } with semi-vertical angle:
q = arcsin ( e α 1 / 2 σ ^ ) , {\displaystyle q=\arcsin \left(e_{\alpha }^{1/2}{\hat {\sigma }}\right),} a spherical where e α = − ln ( α ) . {\displaystyle e_{\alpha }=-\ln(\alpha ).}For example, for a 95% confidence cone, α = 0.05 , e α = − ln ( 0.05 ) = 2.996 , {\displaystyle \alpha =0.05,e_{\alpha }=-\ln(0.05)=2.996,}
and thus q = arcsin ( 1.731 σ ^ ) . {\displaystyle q=\arcsin(1.731{\hat {\sigma }}).}The expected value of the Von Mises–Fisher distribution is not on the unit hypersphere, but instead has a length of less than one. This length is given by A p ( κ ) {\displaystyle A_{p}(\kappa )}
A p ( κ ) μ {\displaystyle A_{p}(\kappa ){\boldsymbol {\mu }}} as defined above. For a Von Mises–Fisher distribution with mean direction μ {\displaystyle {\boldsymbol {\mu }}} and concentration κ > 0 {\displaystyle \kappa >0} , the expected value is: .For κ = 0 {\displaystyle \kappa =0}
, the expected value is at the origin. For finite κ > 0 {\displaystyle \kappa >0} , the length of the expected value is strictly between zero and one and is a monotonic rising function of κ {\displaystyle \kappa } .The empirical mean (arithmetic average) of a collection of points on the unit hypersphere behaves in a similar manner, being close to the origin for widely spread data and close to the sphere for concentrated data. Indeed, for the Von Mises–Fisher distribution, the expected value of the maximum-likelihood estimate based on a collection of points is equal to the empirical mean of those points.
The expected value can be used to compute differential entropy and KL divergence.
The differential entropy of VMF ( μ , κ ) {\displaystyle {\text{VMF}}({\boldsymbol {\mu }},\kappa )}
⟨ − log f p ( x ; μ , κ ) ⟩ x ∼ VMF ( μ , κ ) = − log f p ( A p ( κ ) μ ; μ , κ ) = − log C p ( κ ) − κ A p ( κ ) {\displaystyle {\bigl \langle }-\log f_{p}(\mathbf {x} ;{\boldsymbol {\mu }},\kappa ){\bigr \rangle }_{\mathbf {x} \sim {\text{VMF}}({\boldsymbol {\mu }},\kappa )}=-\log f_{p}(A_{p}(\kappa ){\boldsymbol {\mu }};{\boldsymbol {\mu }},\kappa )=-\log C_{p}(\kappa )-\kappa A_{p}(\kappa )} is:where the angle brackets denote expectation. Notice that the entropy is a function of κ {\displaystyle \kappa }
only.The KL divergence between VMF ( μ 0 , κ 0 ) {\displaystyle {\text{VMF}}({\boldsymbol {\mu _{0}}},\kappa _{0})}
⟨ log f p ( x ; μ 0 , κ 0 ) f p ( x ; μ 1 , κ 1 ) ⟩ x ∼ VMF ( μ 0 , κ 0 ) = log f p ( A p ( κ 0 ) μ 0 ; μ 0 , κ 0 ) f p ( A p ( κ 0 ) μ 0 ; μ 1 , κ 1 ) {\displaystyle {\Bigl \langle }\log {\frac {f_{p}(\mathbf {x} ;{\boldsymbol {\mu _{0}}},\kappa _{0})}{f_{p}(\mathbf {x} ;{\boldsymbol {\mu _{1}}},\kappa _{1})}}{\Bigr \rangle }_{\mathbf {x} \sim {\text{VMF}}({\boldsymbol {\mu _{0}}},\kappa _{0})}=\log {\frac {f_{p}(A_{p}(\kappa _{0}){\boldsymbol {\mu _{0}}};{\boldsymbol {\mu _{0}}},\kappa _{0})}{f_{p}(A_{p}(\kappa _{0}){\boldsymbol {\mu _{0}}};{\boldsymbol {\mu _{1}}},\kappa _{1})}}} and VMF ( μ 1 , κ 1 ) {\displaystyle {\text{VMF}}({\boldsymbol {\mu _{1}}},\kappa _{1})} is:Von Mises-Fisher (VMF) distributions are closed under orthogonal linear transforms. Let U {\displaystyle \mathbf {U} } orthogonal matrix. Let x ∼ VMF ( μ , κ ) {\displaystyle \mathbf {x} \sim {\text{VMF}}({\boldsymbol {\mu }},\kappa )} and apply the invertible linear transform: y = U x {\displaystyle \mathbf {y} =\mathbf {Ux} } . The inverse transform is x = U ′ y {\displaystyle \mathbf {x} =\mathbf {U'y} } , because the inverse of an orthogonal matrix is its transpose: U − 1 = U ′ {\displaystyle \mathbf {U} ^{-1}=\mathbf {U} '} . The Jacobian of the transform is U {\displaystyle \mathbf {U} } , for which the absolute value of its determinant is 1, also because of the orthogonality. Using these facts and the form of the VMF density, it follows that:
y ∼ VMF ( U μ , κ ) . {\displaystyle \mathbf {y} \sim {\text{VMF}}(\mathbf {U} {\boldsymbol {\mu }},\kappa ).} be a p {\displaystyle p} -by- p {\displaystyle p}One may verify that since μ {\displaystyle {\boldsymbol {\mu }}}
and x {\displaystyle \mathbf {x} } are unit vectors, then by the orthogonality, so are U μ {\displaystyle \mathbf {U} {\boldsymbol {\mu }}} and y {\displaystyle \mathbf {y} } .An algorithm for drawing pseudo-random samples from the Von Mises Fisher (VMF) distribution was given by Ulrich and later corrected by Wood. An implementation in R is given by Hornik and Grün; and a fast Python implementation is described by Pinzón and Jung.
To simulate from a VMF distribution on the ( p − 1 ) {\displaystyle (p-1)} unitsphere, S p − 1 {\displaystyle S^{p-1}} , with mean direction μ ∈ S p − 1 {\displaystyle {\boldsymbol {\mu }}\in S^{p-1}} , these algorithms use the following radial-tangential decomposition for a point x ∈ S p − 1 ⊂ R p {\displaystyle \mathbf {x} \in S^{p-1}\subset \mathbb {R} ^{p}} :
x = t μ + 1 − t 2 v {\displaystyle \mathbf {x} =t{\boldsymbol {\mu }}+{\sqrt {1-t^{2}}}\mathbf {v} } -dimensionalwhere v ∈ R p {\displaystyle \mathbf {v} \in \mathbb {R} ^{p}}
f radial ( t ; κ , p ) = ( κ / 2 ) ν Γ ( 1 2 ) Γ ( ν + 1 2 ) I ν ( κ ) e t κ ( 1 − t 2 ) ν − 1 2 {\displaystyle f_{\text{radial}}(t;\kappa ,p)={\frac {(\kappa /2)^{\nu }}{\Gamma ({\frac {1}{2}})\Gamma (\nu +{\frac {1}{2}})I_{\nu }(\kappa )}}e^{t\kappa }(1-t^{2})^{\nu -{\frac {1}{2}}}} lives in the tangential ( p − 2 ) {\displaystyle (p-2)} -dimensional unit-subsphere that is centered at and perpendicular to μ {\displaystyle {\boldsymbol {\mu }}} ; while t ∈ {\displaystyle t\in } . To draw a sample x {\displaystyle \mathbf {x} } from a VMF with parameters μ {\displaystyle {\boldsymbol {\mu }}} and κ {\displaystyle \kappa } , v {\displaystyle \mathbf {v} } must be drawn from the uniform distribution on the tangential subsphere; and the radial component, t {\displaystyle t} , must be drawn independently from the distribution with density:where ν = p 2 − 1 {\displaystyle \nu ={\frac {p}{2}}-1}
I ν ( κ ) = ( κ / 2 ) ν Γ ( 1 2 ) Γ ( ν + 1 2 ) ∫ − 1 1 e t κ ( 1 − t 2 ) ν − 1 2 d t {\displaystyle I_{\nu }(\kappa )={\frac {(\kappa /2)^{\nu }}{\Gamma ({\frac {1}{2}})\Gamma (\nu +{\frac {1}{2}})}}\int _{-1}^{1}e^{t\kappa }(1-t^{2})^{\nu -{\frac {1}{2}}}\,dt} . The normalization constant for this density may be verified by using:as given in Appendix 1 (A.3) in Directional Statistics. Drawing the t {\displaystyle t} samples from this density by using a rejection sampling algorithm is explained in the above references. To draw the uniform v {\displaystyle \mathbf {v} } samples perpendicular to μ {\displaystyle {\boldsymbol {\mu }}} , see the algorithm in, or otherwise a Householder transform can be used as explained in Algorithm 1 in.
To generate a Von Mises–Fisher distributed pseudo-random spherical 3-D unit vector X s {\textstyle \mathbf {X} _{s}} on the S 2 {\textstyle S^{2}} sphere for a given μ {\textstyle \mu } and κ {\textstyle \kappa } , define
X s = {\displaystyle \mathbf {X} _{s}=}
where θ {\textstyle \theta }
is the polar angle, ϕ {\textstyle \phi } the azimuthal angle, and r = 1 {\textstyle r=1} the distance to the center of the spherefor μ = {\textstyle \mathbf {\mu } =}
the pseudo-random triplet is then given byX s = {\displaystyle \mathbf {X} _{s}=}
where V {\textstyle V} continuous uniform distribution U ( a , b ) {\textstyle U(a,b)} with lower bound a {\textstyle a} and upper bound b {\textstyle b}
is sampled from theV ∼ U ( 0 , 2 π ) {\displaystyle V\sim U(0,2\pi )}
and
W = cos θ = 1 + 1 κ ( ln ξ + ln ( 1 − ξ − 1 ξ e − 2 κ ) ) {\displaystyle W=\cos \theta =1+{\frac {1}{\kappa }}(\ln \xi +\ln(1-{\frac {\xi -1}{\xi }}e^{-2\kappa }))}
where ξ {\textstyle \xi }
is sampled from the standard continuous uniform distribution U ( 0 , 1 ) {\textstyle U(0,1)}ξ ∼ U ( 0 , 1 ) {\displaystyle \xi \sim U(0,1)}
here, W {\textstyle W}
should be set to W = 1 {\textstyle W=1} when ξ = 0 {\textstyle \mathbf {\xi } =0} and X s {\textstyle \mathbf {X} _{s}} rotated to match any other desired μ {\textstyle \mu } .For p = 3 {\displaystyle p=3}
p ( θ ) = ∫ d 2 x f ( x ; μ , κ ) δ ( θ − arc cos ( μ T x ) ) {\displaystyle p(\theta )=\int d^{2}xf(x;{\boldsymbol {\mu }},\kappa )\,\delta \left(\theta -{\text{arc cos}}({\boldsymbol {\mu }}^{\mathsf {T}}\mathbf {x} )\right)} , the angle θ between x {\displaystyle \mathbf {x} } and μ {\displaystyle {\boldsymbol {\mu }}} satisfies cos θ = μ T x {\displaystyle \cos \theta ={\boldsymbol {\mu }}^{\mathsf {T}}\mathbf {x} } . It has the distribution ,which can be easily evaluated as
p ( θ ) = 2 π C 3 ( κ ) sin θ e κ cos θ {\displaystyle p(\theta )=2\pi C_{3}(\kappa )\,\sin \theta \,e^{\kappa \cos \theta }} .For the general case, p ≥ 2 {\displaystyle p\geq 2}
cos θ = t = μ T x {\displaystyle \cos \theta =t={\boldsymbol {\mu }}^{\mathsf {T}}\mathbf {x} } , the distribution for the cosine of this angle:is given by f radial ( t ; κ , p ) {\displaystyle f_{\text{radial}}(t;\kappa ,p)} above.
, as explainedWhen κ = 0 {\displaystyle \kappa =0}
, the Von Mises–Fisher distribution, VMF ( μ , κ ) {\displaystyle {\text{VMF}}({\boldsymbol {\mu }},\kappa )} on S p − 1 {\displaystyle S^{p-1}} simplifies to the uniform distribution on S p − 1 ⊂ R p {\displaystyle S^{p-1}\subset \mathbb {R} ^{p}} . The density is constant with value C p ( 0 ) {\displaystyle C_{p}(0)} . Pseudo-random samples can be generated by generating samples in R p {\displaystyle \mathbb {R} ^{p}} from the standard multivariate normal distribution, followed by normalization to unit norm.For 1 ≤ i ≤ p {\displaystyle 1\leq i\leq p}
f i ( x i ; p ) = f radial ( x i ; κ = 0 , p ) = ( 1 − x i 2 ) p − 1 2 − 1 B ( 1 2 , p − 1 2 ) {\displaystyle f_{i}(x_{i};p)=f_{\text{radial}}(x_{i};\kappa =0,p)={\frac {(1-x_{i}^{2})^{{\frac {p-1}{2}}-1}}{B{\bigl (}{\frac {1}{2}},{\frac {p-1}{2}}{\bigr )}}}} , let x i {\displaystyle x_{i}} be any component of x ∈ S p − 1 {\displaystyle \mathbf {x} \in S^{p-1}} . The marginal distribution for x i {\displaystyle x_{i}} has the density:where B ( α , β ) {\displaystyle B(\alpha ,\beta )} beta function. This distribution may be better understood by highlighting its relation to the beta distribution:
x i 2 ∼ Beta ( 1 2 , p − 1 2 ) and x i + 1 2 ∼ Beta ( p − 1 2 , p − 1 2 ) {\displaystyle {\begin{aligned}x_{i}^{2}&\sim {\text{Beta}}{\bigl (}{\frac {1}{2}},{\frac {p-1}{2}}{\bigr )}&&{\text{and}}&{\frac {x_{i}+1}{2}}&\sim {\text{Beta}}{\bigl (}{\frac {p-1}{2}},{\frac {p-1}{2}}{\bigr )}\end{aligned}}} is thewhere the Legendre duplication formula is useful to understand the relationships between the normalization constants of the various densities above.
Note that the components of x ∈ S p − 1 {\displaystyle \mathbf {x} \in S^{p-1}}
are not independent, so that the uniform density is not the product of the marginal densities; and x {\displaystyle \mathbf {x} } cannot be assembled by independent sampling of the components.In machine learning, especially in image classification, to-be-classified inputs (e.g. images) are often compared using cosine similarity, which is the dot product between intermediate representations in the form of unitvectors (termed embeddings). The dimensionality is typically high, with p {\displaystyle p} at least several hundreds. The deep neural networks that extract embeddings for classification should learn to spread the classes as far apart as possible and ideally this should give classes that are uniformly distributed on S p − 1 {\displaystyle S^{p-1}} . For a better statistical understanding of across-class cosine similarity, the distribution of dot-products between unitvectors independently sampled from the uniform distribution may be helpful.
Let
x
,
y
∈
S
p
−
1
{\displaystyle \mathbf {x} ,\mathbf {y} \in S^{p-1}}
be unitvectors in
R
p
{\displaystyle \mathbb {R} ^{p}}
, independently sampled from the uniform distribution. Define:
where t {\displaystyle t} above; the distribution for r {\displaystyle r} is symmetric beta and the distribution for s {\displaystyle s} is symmetric logistic-beta:
r ∼ Beta ( p − 1 2 , p − 1 2 ) , s ∼ B σ ( p − 1 2 , p − 1 2 ) {\displaystyle {\begin{aligned}r&\sim {\text{Beta}}{\bigl (}{\frac {p-1}{2}},{\frac {p-1}{2}}{\bigr )},&s&\sim B_{\sigma }{\bigl (}{\frac {p-1}{2}},{\frac {p-1}{2}}{\bigr )}\end{aligned}}} is the dot-product and r , s {\displaystyle r,s} are transformed versions of it. Then the distribution for t {\displaystyle t} is the same as the marginal component distribution givenThe means and variances are:
E = 0 , E = 1 2 , E = 0 , {\displaystyle {\begin{aligned}E&=0,&E&={\frac {1}{2}},&E&=0,\end{aligned}}}and
var = 1 p , var = 1 4 p , var = 2 ψ ′ ( p − 1 2 ) ≈ 4 p − 1 {\displaystyle {\begin{aligned}{\text{var}}&={\frac {1}{p}},&{\text{var}}&={\frac {1}{4p}},&{\text{var}}&=2\psi '{\bigl (}{\frac {p-1}{2}}{\bigr )}\approx {\frac {4}{p-1}}\end{aligned}}}where ψ ′ = ψ ( 1 ) {\displaystyle \psi '=\psi ^{(1)}} polygamma function. The variances decrease, the distributions of all three variables become more Gaussian, and the final approximation gets better as the dimensionality, p {\displaystyle p} , is increased.
is the firstThe matrix von Mises-Fisher distribution (also known as matrix Langevin distribution) has the density
f n , p ( X ; F ) ∝ exp ( tr ( F T X ) ) {\displaystyle f_{n,p}(\mathbf {X} ;\mathbf {F} )\propto \exp(\operatorname {tr} (\mathbf {F} ^{\mathsf {T}}\mathbf {X} ))}supported on the Stiefel manifold of n × p {\displaystyle n\times p} orthonormal p-frames X {\displaystyle \mathbf {X} } , where F {\displaystyle \mathbf {F} } is an arbitrary n × p {\displaystyle n\times p} real matrix.
Ulrich, in designing an algorithm for sampling from the VMF distribution, makes use of a family of distributions named after and explored by John G. Saw. A Saw distribution is a distribution on the ( p − 1 ) {\displaystyle (p-1)} -sphere, S p − 1 {\displaystyle S^{p-1}} , with modal vector μ ∈ S p − 1 {\displaystyle {\boldsymbol {\mu }}\in S^{p-1}} and concentration κ ≥ 0 {\displaystyle \kappa \geq 0} , and of which the density function has the form:
f Saw ( x ; μ , κ ) = g ( κ x ′ μ ) K p ( κ ) {\displaystyle f_{\text{Saw}}(\mathbf {x} ;{\boldsymbol {\mu }},\kappa )={\frac {g(\kappa \mathbf {x} '{\boldsymbol {\mu }})}{K_{p}(\kappa )}}}where g {\displaystyle g} above-mentioned radial-tangential decomposition generalizes to the Saw family and the radial compoment, t = x ′ μ {\displaystyle t=\mathbf {x} '{\boldsymbol {\mu }}} has the density:
f Saw-radial ( t ; κ ) = 2 π p / 2 Γ ( p / 2 ) g ( κ t ) ( 1 − t 2 ) ( p − 3 ) / 2 B ( 1 2 , p − 1 2 ) K p ( κ ) . {\displaystyle f_{\text{Saw-radial}}(t;\kappa )={\frac {2\pi ^{p/2}}{\Gamma (p/2)}}{\frac {g(\kappa t)(1-t^{2})^{(p-3)/2}}{B{\bigl (}{\frac {1}{2}},{\frac {p-1}{2}}{\bigr )}K_{p}(\kappa )}}.} is a non-negative, increasing function; and where K P ( κ ) {\displaystyle K_{P}(\kappa )} is the normalization constant. Thewhere B {\displaystyle B}
is the beta function. Also notice that the left-hand factor of the radial density is the surface area of S p − 1 {\displaystyle S^{p-1}} .By setting g ( κ x ′ μ ) = e κ x ′ μ {\displaystyle g(\kappa \mathbf {x} '{\boldsymbol {\mu }})=e^{\kappa \mathbf {x} '{\boldsymbol {\mu }}}}
, one recovers the VMF distribution.The definition of the Von Mises-Fisher distribution can be extended to include also the case where p = 1 {\displaystyle p=1}
f 1 ( x ∣ μ , κ ) = e κ μ x e − κ + e κ = σ ( 2 κ μ x ) {\displaystyle f_{1}(x\mid \mu ,\kappa )={\frac {e^{\kappa \mu x}}{e^{-\kappa }+e^{\kappa }}}=\sigma (2\kappa \mu x)} , so that the support is the 0-dimensional hypersphere, which when embedded into 1-dimensional Euclidean space is the discrete set, { − 1 , 1 } {\displaystyle \{-1,1\}} . The mean direction is μ ∈ { − 1 , 1 } {\displaystyle \mu \in \{-1,1\}} and the concentration is κ ≥ 0 {\displaystyle \kappa \geq 0} . The probability mass function, for x ∈ { − 1 , 1 } {\displaystyle x\in \{-1,1\}} is:where σ ( z ) = 1 / ( 1 + e − z ) {\displaystyle \sigma (z)=1/(1+e^{-z})} logistic sigmoid. In the uniform case, at κ = 0 {\displaystyle \kappa =0} , this simplifies to the Rademacher distribution.
is the