Density estimation and sample analysis

The inference.pdf module provides tools for analysing sample data, including density estimation and highest-density interval calculation. Example code for GaussianKDE and UnimodalPdf can be found in the density estimation jupyter notebook demo.

GaussianKDE

class inference.pdf.GaussianKDE(sample, bandwidth=None, cross_validation=False, max_cv_samples=5000)

Construct a GaussianKDE object, which can be called as a function to return the estimated PDF of the given sample.

GaussianKDE uses Gaussian kernel-density estimation to estimate the PDF associated with a given sample.

Parameters

sample – 1D array of samples from which to estimate the probability distribution
bandwidth (float) – Width of the Gaussian kernels used for the estimate. If not specified, an appropriate width is estimated based on sample data.
cross_validation (bool) – Indicate whether or not cross-validation should be used to estimate the bandwidth in place of the simple ‘rule of thumb’ estimate which is normally used.
max_cv_samples (int) – The maximum number of samples to be used when estimating the bandwidth via cross-validation. The computational cost scales roughly quadratically with the number of samples used, and can become prohibitive for samples of size in the tens of thousands and up. Instead, if the sample size is greater than max_cv_samples, the cross-validation is performed on a sub-sample of this size.

__call__(x_vals)

Evaluate the PDF estimate at a set of given axis positions.

Parameters: x_vals – axis location(s) at which to evaluate the estimate.
Returns: values of the PDF estimate at the specified locations.

interval(frac=0.95)

Calculate the highest-density interval(s) which contain a given fraction of total probability.

Parameters: frac (float) – Fraction of total probability contained by the desired interval(s).
Returns: A list of tuples which specify the intervals.

mode: The mode of the pdf, calculated automatically when an instance of GaussianKDE is created.

plot_summary(filename=None, show=True, label=None)

Plot the estimated PDF along with summary statistics.

Parameters

filename (str) – Filename to which the plot will be saved. If unspecified, the plot will not be saved.
show (bool) – Boolean value indicating whether the plot should be displayed in a window. (Default is True)
label (str) – The label to be used for the x-axis on the plot as a string.

UnimodalPdf

class inference.pdf.UnimodalPdf(sample)

Construct a UnimodalPdf object, which can be called as a function to return the estimated PDF of the given sample.

The UnimodalPdf class is designed to robustly estimate univariate, unimodal probability distributions given a sample drawn from that distribution. This is a parametric method based on an heavily modified student-t distribution, which is extremely flexible.

Parameters: sample – 1D array of samples from which to estimate the probability distribution

__call__(x)

Evaluate the PDF estimate at a set of given axis positions.

Parameters: x – axis location(s) at which to evaluate the estimate.
Returns: values of the PDF estimate at the specified locations.

plot_summary(filename=None, show=True, label=None)

Plot the estimated PDF along with summary statistics.

Parameters

filename (str) – Filename to which the plot will be saved. If unspecified, the plot will not be saved.
show (bool) – Boolean value indicating whether the plot should be displayed in a window. (Default is True)
label (str) – The label to be used for the x-axis on the plot as a string.

sample_hdi

inference.pdf.sample_hdi(sample: ndarray, fraction: float, allow_double=False)

Estimate the highest-density interval(s) for a given sample.

This function computes the shortest possible interval which contains a chosen fraction of the elements in the given sample.

Parameters

sample – A sample for which the interval will be determined.
fraction (float) – The fraction of the total probability to be contained by the interval.
allow_double (bool) – When set to True, a double-interval is returned instead if one exists whose total length is meaningfully shorter than the optimal single interval.

Returns

Tuple(s) specifying the lower and upper bounds of the highest-density interval(s).