Density estimation and sample analysis

The inference.pdf module provides tools for analysing sample data, including density estimation and highest-density interval calculation. Example code for GaussianKDE and UnimodalPdf can be found in the density estimation jupyter notebook demo.

GaussianKDE

class inference.pdf.GaussianKDE(sample: ndarray, bandwidth: float = None, cross_validation: bool = False, max_cv_samples=5000)

Construct a GaussianKDE object, which can be called as a function to return the estimated PDF of the given sample.

GaussianKDE uses Gaussian kernel-density estimation to estimate the PDF associated with a given sample.

Parameters:

sample – 1D array of samples from which to estimate the probability distribution
bandwidth (float) – Width of the Gaussian kernels used for the estimate. If not specified, an appropriate width is estimated based on sample data.
cross_validation (bool) – Indicate whether cross-validation should be used to estimate the bandwidth in place of the simple ‘rule of thumb’ estimate which is normally used.
max_cv_samples (int) – The maximum number of samples to be used when estimating the bandwidth via cross-validation. The computational cost scales roughly quadratically with the number of samples used, and can become prohibitive for samples of size in the tens of thousands and up. Instead, if the sample size is greater than max_cv_samples, the cross-validation is performed on a sub-sample of this size.

__call__(x: ndarray) → ndarray

Evaluate the estimate of the probability distribution function (PDF) at the given parameter values.

Parameters:: x – axis location(s) at which to evaluate the estimate.
Returns:: values of the PDF estimate at the specified locations.

cdf(x: ndarray) → ndarray

Evaluate the estimate of the cumulative distribution function (CDF) at the given parameter values.

Parameters:: x – axis location(s) at which to evaluate the estimate.
Returns:: values of the PDF estimate at the specified locations.

interval(fraction: float) → tuple[float, float]

Calculates the ‘highest-density interval’, the shortest single interval which contains a chosen fraction of the total probability.

Parameters:: fraction – Fraction of the total probability contained by the interval. The given value must be between 0 and 1.
Returns:: A tuple of the lower and upper limits of the highest-density interval in the form (lower_limit, upper_limit).

plot_summary(filename=None, show=True, label=None)

Plot the estimated PDF along with summary statistics.

Parameters:

filename (str) – Filename to which the plot will be saved. If unspecified, the plot will not be saved.
show (bool) – Boolean value indicating whether the plot should be displayed in a window. (Default is True)
label (str) – The label to be used for the x-axis on the plot as a string.

DiffusionKDE

class inference.pdf.DiffusionKDE(sample: ndarray, limits: tuple[float, float] = None)

Construct a DiffusionKDE object, which can be called as a function to return the estimated PDF of the given sample.

DiffusionKDE uses the diffusion-based kernel density estimation method of Botev et al. (2010) to estimate the PDF associated with a given sample.

Parameters:

sample – 1D array of samples from which to estimate the probability distribution.
limits – Lower and upper bounds of the interval on which the density estimate is constructed, given as a tuple (lower, upper). If not specified, the interval is set to (min - range/2, max + range/2).

__call__(x: ndarray) → ndarray

Evaluate the estimate of the probability distribution function (PDF) at the given parameter values.

Parameters:: x – Axis location(s) at which to evaluate the estimate.
Returns:: Values of the PDF estimate at the specified locations.

cdf(x: ndarray) → ndarray

Evaluate the estimate of the cumulative distribution function (CDF) at the given parameter values.

Parameters:: x – Axis location(s) at which to evaluate the estimate.
Returns:: Values of the CDF estimate at the specified locations.

interval(fraction: float) → tuple[float, float]

Calculates the ‘highest-density interval’, the shortest single interval which contains a chosen fraction of the total probability.

Parameters:: fraction – Fraction of the total probability contained by the interval. The given value must be between 0 and 1.
Returns:: A tuple of the lower and upper limits of the highest-density interval in the form (lower_limit, upper_limit).

plot_summary(filename=None, show=True, label=None)

Plot the estimated PDF along with summary statistics.

Parameters:

filename (str) – Filename to which the plot will be saved. If unspecified, the plot will not be saved.
show (bool) – Boolean value indicating whether the plot should be displayed in a window. (Default is True)
label (str) – The label to be used for the x-axis on the plot as a string.

UnimodalPdf

class inference.pdf.UnimodalPdf(sample: ndarray)

Construct a UnimodalPdf object, which can be called as a function to return the estimated PDF of the given sample.

The UnimodalPdf class is designed to robustly estimate univariate, unimodal probability distributions given a sample drawn from that distribution. This is a parametric method based on a heavily modified student-t distribution, which is extremely flexible.

Parameters:: sample – 1D array of samples from which to estimate the probability distribution.

__call__(x: ndarray) → ndarray

Evaluate the PDF estimate at a set of given axis positions.

Parameters:: x – axis location(s) at which to evaluate the estimate.
Returns:: values of the PDF estimate at the specified locations.

interval(fraction: float) → tuple[float, float]

Calculates the ‘highest-density interval’, the shortest single interval which contains a chosen fraction of the total probability.

Parameters:: fraction – Fraction of the total probability contained by the interval. The given value must be between 0 and 1.
Returns:: A tuple of the lower and upper limits of the highest-density interval in the form (lower_limit, upper_limit).

plot_summary(filename=None, show=True, label=None)

Plot the estimated PDF along with summary statistics.

Parameters:

filename (str) – Filename to which the plot will be saved. If unspecified, the plot will not be saved.
show (bool) – Boolean value indicating whether the plot should be displayed in a window. (Default is True)
label (str) – The label to be used for the x-axis on the plot as a string.

sample_hdi

inference.pdf.sample_hdi(sample: ndarray, fraction: float) → ndarray

Estimate the highest-density interval(s) for a given sample.

This function computes the shortest possible interval which contains a chosen fraction of the elements in the given sample.

Parameters:

sample – A sample for which the interval will be determined. If the sample is given as a 2D numpy array, the interval calculation will be distributed over the second dimension of the array, i.e. given a sample array of shape (m, n) the highest-density intervals are returned as an array of shape (2, n).
fraction (float) – The fraction of the total probability to be contained by the interval.

Returns:

The lower and upper bounds of the highest-density interval(s) as a numpy array.