Density estimation and sample analysis

The inference.pdf module provides tools for analysing sample data, including density estimation and highest-density interval calculation. Example code for GaussianKDE and UnimodalPdf can be found in the density estimation jupyter notebook demo.

GaussianKDE

class inference.pdf.GaussianKDE(sample, bandwidth=None, cross_validation=False, max_cv_samples=5000)

Construct a GaussianKDE object, which can be called as a function to return the estimated PDF of the given sample.

GaussianKDE uses Gaussian kernel-density estimation to estimate the PDF associated with a given sample.

Parameters
  • sample – 1D array of samples from which to estimate the probability distribution

  • bandwidth (float) – Width of the Gaussian kernels used for the estimate. If not specified, an appropriate width is estimated based on sample data.

  • cross_validation (bool) – Indicate whether or not cross-validation should be used to estimate the bandwidth in place of the simple ‘rule of thumb’ estimate which is normally used.

  • max_cv_samples (int) – The maximum number of samples to be used when estimating the bandwidth via cross-validation. The computational cost scales roughly quadratically with the number of samples used, and can become prohibitive for samples of size in the tens of thousands and up. Instead, if the sample size is greater than max_cv_samples, the cross-validation is performed on a sub-sample of this size.

__call__(x_vals)

Evaluate the PDF estimate at a set of given axis positions.

Parameters

x_vals – axis location(s) at which to evaluate the estimate.

Returns

values of the PDF estimate at the specified locations.

interval(frac=0.95)

Calculate the highest-density interval(s) which contain a given fraction of total probability.

Parameters

frac (float) – Fraction of total probability contained by the desired interval(s).

Returns

A list of tuples which specify the intervals.

mode

The mode of the pdf, calculated automatically when an instance of GaussianKDE is created.

plot_summary(filename=None, show=True, label=None)

Plot the estimated PDF along with summary statistics.

Parameters
  • filename (str) – Filename to which the plot will be saved. If unspecified, the plot will not be saved.

  • show (bool) – Boolean value indicating whether the plot should be displayed in a window. (Default is True)

  • label (str) – The label to be used for the x-axis on the plot as a string.

UnimodalPdf

class inference.pdf.UnimodalPdf(sample)

Construct a UnimodalPdf object, which can be called as a function to return the estimated PDF of the given sample.

The UnimodalPdf class is designed to robustly estimate univariate, unimodal probability distributions given a sample drawn from that distribution. This is a parametric method based on an heavily modified student-t distribution, which is extremely flexible.

Parameters

sample – 1D array of samples from which to estimate the probability distribution

__call__(x)

Evaluate the PDF estimate at a set of given axis positions.

Parameters

x – axis location(s) at which to evaluate the estimate.

Returns

values of the PDF estimate at the specified locations.

plot_summary(filename=None, show=True, label=None)

Plot the estimated PDF along with summary statistics.

Parameters
  • filename (str) – Filename to which the plot will be saved. If unspecified, the plot will not be saved.

  • show (bool) – Boolean value indicating whether the plot should be displayed in a window. (Default is True)

  • label (str) – The label to be used for the x-axis on the plot as a string.

sample_hdi

inference.pdf.sample_hdi(sample: ndarray, fraction: float, allow_double=False)

Estimate the highest-density interval(s) for a given sample.

This function computes the shortest possible interval which contains a chosen fraction of the elements in the given sample.

Parameters
  • sample – A sample for which the interval will be determined.

  • fraction (float) – The fraction of the total probability to be contained by the interval.

  • allow_double (bool) – When set to True, a double-interval is returned instead if one exists whose total length is meaningfully shorter than the optimal single interval.

Returns

Tuple(s) specifying the lower and upper bounds of the highest-density interval(s).