API Reference

class fastkde.fastKDE.Timer(n=None)[source]
fastkde.fastKDE.conditional(input_vars, conditioning_vars, **kwargs)[source]

Estimates the conditional PDF of input_vars given conditioning_vars

input_vars : A vector of input values, or a list of such vectors

conditioning_vars : A vector of conditioning values, or a list of such vectors

use_xarrayIf True, returns an xarray DataArray object; otherwise

returns a tuple of numpy arrays. If None, defaults to True if xarray is installed.

var_namesA list of names for the input variables, starting with

the conditioning variables. If None, the variables are named var0, var1, etc.

do_xarray_subsetIf True, returns a subset of the xarray DataArray limited

to the range of the input variables. If False, returns the full xarray DataArray. If None, defaults to True.

**kwargsAny additional keyword arguments get passed

directly to fastKDE.fastKDE() or fastKDE.estimateConditionals(); see the docstrings of fastKDE.fastKDE() and fastKDE.estimateConditionals() for details of kwargs.

Note the following two arguments have different default values here:

positive_shift=True by default, and
peak_frac = 0.01 by default.
Returns:

(cPDF, axes) where cPDF is the PDF(input_vars | conditioning_vars), and axes is a list of axis vectors giving the points at which cPDF is defined.

If N conditioning_vars were provided, then axes[0:N-1] corresponds to the variables provided in conditioning_vars, in the order they were provided; axes[N:M-1] corresponds to the M input_vars provided, in the order provided.

Ex:

```python

import pylab as PP
from numpy import *

# ***************************
#  Generate random samples
# ***************************
#  Stochastically sample from the function underlyingFunction() (a sigmoid):
#  sample the absicissa values from a gamma distribution
#  relate the ordinate values to the sample absicissa values and add
#  noise from a normal distribution

# Set the number of samples
numSamples = int(1e6)

# Define a sigmoid function
def underlyingFunction(x,x0=305,y0=200,yrange=4):
    return (yrange/2)*tanh(x-x0) + y0

xp1,xp2,xmid = 5,2,305  # Set gamma distribution parameters
yp1,yp2 = 0,12          # Set normal distribution parameters (mean and std)

# Generate random samples of X from the gamma distribution
x = -(random.gamma(xp1,xp2,int(numSamples))-xp1*xp2) + xmid
# Generate random samples of y from x and add normally distributed noise
y = underlyingFunction(x) + random.normal(loc=yp1,scale=yp2,size=numSamples)

# ***************************
#  Calculate the conditional
# ***************************
pOfYGivenX,axes = fastKDE.conditional(y,x)

```

class fastkde.fastKDE.fastKDE(data=None, axes=None, log_axes=False, num_points_per_sigma=10, num_points=None, do_approximate_ecf=True, ecf_precision=1, do_save_transformed_kernel=False, do_fft=True, do_save_marginals=True, be_verbose=False, frac_contiguous_hyper_volumes=1, num_contiguous_hyper_volumes=None, positive_shift=True, axis_expansion_factor=1.0)[source]

Estimates the density function of a given dataset using the self-consistent method of Bernacchia and Pigolotti (2011, J. R. Statistic Soc. B.). Prior to estimating the PDF, the data are standardized to have a mean of 0 and a variance of 1.

Standardization is done so that PDFs of varying widths can be calculated on a unified grid; the original PDF can be re-obtained by scaling, offsetting, and renormalizing the calculated PDF. Assuming the PDF is reasonably narrow, then most of the information in the PDF should be contained in the returned domain. The width of the domain is set in terms of multiples of unit standard deviations of the data; the default is 20-sigma.

input:

data (array_like)the data from which to estimate the PDF. Should be 1-

or 2-dimensional. If 2-dimensional, this flags calculation of an N-dimensional PDF. The first index should refer to each variable and the second index the observations of the variables.

axesthe axis-values of the estimated PDF. They must be evenly

spaced and they should have a length that is a power of two plus one (e.g., 33).

log_axesFlags whether axes should be log spaced (i.e., the

PDF is calculated based on log(data) and then transformed back to sample space). Should be a logical value (True or False) or a list of logical values with an item for each variable (i.e, len(log_axes) == shape(data)[0]) specifying which axes should use log spacing. If only True or False is given, that value is used for all variables.

num_points_per_sigmathe number of points on the data grid per

standard deviation; this influences the total size of the axes that are automatically calculated if no other aspects of the grid are specified.

num_pointsthe number of points to use for the pdf grid. If

provided as a scalar, each axis will have the same number of points. Otherwise, it should be an iterable with a value for each axis length. Axis lengths should be a power of two plus one (e.g., 33)

deltaXif given, this specifies the spacing between domain

values.

do_approximate_ecfflags whether to approximate the ECF

using a (much faster) FFT. In tests, this is accurate to ~1e-14 over low frequencies, but is inaccurate to ~1e-2 for the highest ~5% of frequencies.

ecf_precisionsets the precision of the approximate ECF. If set

to 2, it uses double precision accuracy; 1 otherwise

do_fftflags whether to calculate phiSC and its FFT to

obtain pdf

do_save_marginalsflags whether to calculate and save the marginal

distributions

frac_contiguous_hyper_volumesthe fraction of contiguous hypervolumes of

the ECF, that are above the ECF threshold, to use in the density estimate

num_contiguous_hyper_volumeslike frac_contiguous_hyper_volumes, but

specify an integer number to use. frac_contiguous_hyper_volumes will be ignored if this is provided as an argument.

positive_shifttranslate the PDF vertically such that the estimate

is positive or 0 everywhere

axis_expansion_factorsets the amount by which the KDE grid will be expanded

relative to the original min-max spread for each variable: 1.0 means a 100% (2x) expansion in the range. Such an expansion is necessary to avoid kernel power from one end of the grid leaking into the opposite end due to the perioidicity of the Fourier transform.

applyBernacchiaFilter(doFlushArrays=True)[source]

Given an ECF, calculate the self-consistent density in fourier-space by applying the BP11 filter.

estimateConditionals(variables, data, peak_frac=0.0, reApplyFilter=False)[source]

For a multidimensional PDF, estimates the conditional P(x_i | x_j).

input:

variablesA integer or tuple of array indicies indicating the

variables on which to condition e.g., For a 2D PDF,

obj.estimateConditionals(1) estimates P(x_0 | x_1) from the joint PDF P(x_0,x_1) that is the result of the self-consistent density estimate.

For a 3D PDF:

obj.estimateConditionals( (0,2) ) estimates P( x_0, x_2 | x_1) from P(x_0,x_1,x_2)

If all possible variables are listed, the copula is returned instead.

If negative values are provided, variables are wrapped (i.e., index -1 indicates the last variable)

dataThe data original used to create the fastKDE object.

This is needed to calculated the various marginals required in the conditional computation.

peak_fracThe fractional threshold below which to truncate the

marginal PDF (to avoid divding by small numbers); this is the fraction of the height of the mode.

reapplyFilter : Flags whether to reapply the ECF filter to the conditional

output:

Returns P( x_i | x_j )

findBadDistributionInds()[source]

Find indices of the optimal distribution that are below 0.0

findGoodDistributionInds()[source]

Find indices of the optimal distribution that are above 0.0

getCopula(data=None, peak_frac=0.0)[source]

Estimates the copula of the underlying PDF

getTransformedAxes()[source]

Returns a copy of the axes. This function exists for backward compatibility

getTransformedCopula(data=None)[source]

A wrapper for getCopula; this function is deprecated.

getTransformedPDF()[source]

Returns a copy of the PDF. This function exists for backward compatibility

reApplyFilter(pdf)[source]

Reapplies the filter to a PDF estimate.

This is used, e.g., to remove high-frequency noise that results from calculting the conditionals.

fastkde.fastKDE.next_highest_power_of_two(number)[source]

Returns the nearest power of two that is greater than or equal to number

fastkde.fastKDE.pdf(*args, **kwargs)[source]

Estimate the self-consistent kernel density estimate of the input data

input:

var0 : An input variable.

var1, var2…Additional input varibles whose length corresponds

to the length of var0. As input variables are added, the dimensionality of the resulting PDF increases (e.g., supplying var0 and var1 results in a 2D PDF).

use_xarrayIf True, returns an xarray DataArray object; otherwise

returns a tuple of numpy arrays. If None, defaults to True if xarray is installed.

var_namesA list of names for the input variables. If None, the

variables are named var0, var1, etc.

do_xarray_subsetIf True, returns a subset of the xarray DataArray limited

to the range of the input variables. If False, returns the full xarray DataArray. If None, defaults to True.

**kwargsAny additional keyword arguments get passed

directly to fastKDE.fastKDE(); see the docstring of fastKDE.fastKDE() for details of kwargs.

returns:

pdf,axesThe pdf and the axes of the PDF (i.e., this is

analogous to hist,bins for a histogram).

If there are multiple input variables, the axes variable is a list of the axes, with each axis corresponding to an input variable.

Warning

The computational expense and the memory requirement of this method grows exponentially with the number of input variables. It is likely that memory errors will occur if a moderately large number of variables are used (e.g., four or more). Try modifying num_points to reduce memory usage if needed.

Note

If use_xarray is True, the number of points in the returned xarray object coordinates will not match the num_points argument; set do_xarray_subset = False if this is not desired

fastkde.fastKDE.pdf_at_points(*args, **kwargs)[source]

Estimate the self-consistent kernel density estimate of the input data at a fixed set of points.

input:

var1 : An input variable.

var2, var3…Additional input varibles whose length

corresponds to the length of var1. As input variables are added, the dimensionality of the resulting PDF increases (e.g., supplying var1 and var2 results in a 2D PDF).

list_of_pointsPoints at which the PDF should be estimated.

Points should be provided as a list of tuples, where each tuple contains a point at which the PDF should be estimated. If not provided, the input data points will be used by default.

**kwargsAny additional keyword arguments get passed

directly to fastKDE.fastKDE(); see the docstring of fastKDE.fastKDE() for details of kwargs.

returns:

pdfThe pdf evaluated at list_of_points (or at the

input data points if list_of_points was not provided)

NOTE: The computational expense and the memory requirement of this method grows exponentially with the number of input variables.

NOTE: pdf_at_points() is potentially slow relative to pdf() becuase it does not take advantage of the inverse FFT for transforming from Fourier space to data space. However, if few input points are requested, it may actually be faster.

fastkde.plot.calculate_probability_contour(pdf, axes, pvals, axis=0)[source]

Calculates the probability level below which the PDF integrates to pval.

input:

pdf : a conditional PDF from fastKDE

axes : a list of axis values returned from fastKDE

pvalsthe integrated amount of PDF that should be outside a

contour of constant PDF (scalar or array-like)

output:

an array of PDF values for which the integral of the PDF outside the PDF value should integrate to pval.

Each value in the array corresponds to a value within axes[axis] (nominally, axis should be the variable on which the PDF is conditioned).

fastkde.plot.cumulative_integral(pdf, axes, integration_axes=None, reverse_axes=None)[source]

Calculates the cumulative integral of the pdf, given the axis values

input:

pdfa PDF

(presumably from the .pdf member of a fastKDE object)

axesa list of PDF axes

(presumably from the .axes member of a fastKDE object)

integration_axesthe axes along which to integrate

(default is all). These have the same ordering as axes in the axes input variable.

reverse_axes : axes along which to reverse the direction of the cumulative calculation, if any

output:

returns an array with the same shape as PDF, but with the integral calculated cumulatively

fastkde.plot.pair_plot(var_list, conditional=False, fig_size=None, var_names=None, draw_scatter=None, axis_limits=None, log_scale=False, auto_show=True, cdf_levels=[0.1, 0.25, 0.5, 0.75, 0.9], axis_lengths=None)[source]

Generate a multivariate pair plot of the input data.

input:

var_lista list of input variables (all must be vectors of the

same length) with at least two variables

conditional : flags whether to plot the bivariate PDFs as conditionals.

var_names : a list of strings corresponding to variable names

fig_size : The size of the figure (if none, it is set to 10,10)

draw_scatterflag whether to draw scatter plots on top of the

PDFs. If None, this turns on/off automatically, depending on the number of points (1000 is the upper cutoff)

axis_limits : limits for all variables (applies to all axes)

auto_showflags whether to automatically call PP.show() (no

values are returned if this flag is on)

cdf_levelsa set of CDF levels (betwen 0 and 1 exclusive) for

which to plot PDF contours (i.e., a value of 0.5 means that 50% of the PDF falls outside the corresponding contour of constant PDF that will be plotted). If None is given, matplotlib will choose PDF levels.

log_scaleflags whether to use the log_axes argument for

fastKDE. If a single bool value, it applies to all variables; if a list, each item in the list corresponds to a variable

axis_lengthsThe length of axis variables. If axis_lengths is

None, then fastKDE automatically sets axis lengths

output:

NOTE: no values are returned if auto_show is True

fig, axs, marginal_vals, marginal_pdfs, bivariate_pdfs

fig, axs : a matplotlib figure and axis matrix object

marginal_vals : the values at which the PDFs (both marginal and bivariate) are defined

marginal_pdfs : the marginal PDFs of the input variables

bivariate_pdfs : the bivariate PDFs

class fastkde.empiricalCharacteristicFunction.ECF(input_data, tgrids, precision=2, use_fft_approximation=True, be_verbose=False)[source]

Calculates the empirical characteristic function of arbitrary sets of variables.

Uses either the direct Fourier transform or nuFFT method (described by O’Brien et al. (2014, J. Roy. Stat. Soc. C) to calculate the Fourier transform of the data to yield the ECF.

input:

input_dataThe input data.

Array like with shape = (nvariables,npoints).

tgridsThe frequency-space grids to which to transform the

data

A list of frequency arrays for each variable dimension.

use_fft_approximationFlag whether to use the nuFFT approximation

to the DFT

be_verbose : Flags whether to be verbose

output:

An ECF object. The ECF itself is stored in self.ECF

class fastkde.convergence_tests.Timer[source]

A simple timer class

Example:

```python
myTimer = Timer()
with myTimer:
    doSomething()

print myTimer.duration
```
fastkde.convergence_tests.asciiToPoints(text, convertCharacter=' ')[source]

Given ascii art text, convert the whitespace into a list of point

fastkde.convergence_tests.conditionalPDF(y, x, x0=305, y0=200, yrange=4, xmid=305, xp1=5, xp2=2, yp1=0, yp2=None)[source]
fastkde.convergence_tests.estimatePDFWrapper(arg, **kwargs)[source]

A timed wrapper for doing a PDF estimate and evaluating its error

fastkde.convergence_tests.genCovMat(varianceList, correlationDict)[source]

Generates a covariance matrix, given a list of variances for all variables and a dict of correlation coefficients for all pairs of variables.

The dict’s keys must be tuples that describe the pair of points for the correlation coefficient.

For example, the following would be valid input for for a 3 variable distribution:

```python
varianceList = [0.1, 10.0, 100.0]
correlationDict = { (0,1) :  0.9,                             (0,2) :  0.0,                             (1,2) : -0.4 }

covMat = genCovMat(varianceList,correlationDict)
```
fastkde.convergence_tests.getModeCurve(conditional, axes)[source]

Extract the mode curve from a 2D conditional (assumes conditioning on axis 0)

fastkde.convergence_tests.jointXY(y, x, x0=305, y0=200, yrange=4, xmid=305, xp1=5, xp2=2, yp1=0, yp2=None)[source]
fastkde.convergence_tests.marginalX(y, x, x0=305, y0=200, yrange=4, xmid=305, xp1=5, xp2=2, yp1=0, yp2=None)[source]
fastkde.convergence_tests.powLaw(x, a, c)[source]
fastkde.convergence_tests.stochasticSample(x0=305, y0=200, yrange=4, numSamples=1000.0, xmid=305, xp1=5, xp2=2, yp1=0, yp2=None)[source]
class fastkde.convergence_tests.testConditional(**kwargs)[source]

A test for transition PDF convergence of the conditional

pdfStandard(axes)[source]
class fastkde.convergence_tests.testDistribution(**kwargs)[source]

A generic distribution–meant to be overridden–for testing the fast KDE method

doTesting(numSamplesMax=2097152, numRepetitions=30, scKWArgs={'frac_contiguous_hyper_volumes': 1, 'num_points': 257, 'positive_shift': False}, numProcs=1)[source]
generatePlots(saveType=None, show=True)[source]

Generate plots of the error rate and the timing

pdfStandard(axes)[source]
sampleFromDistribution(numSamples=2097152)[source]
class fastkde.convergence_tests.testMixtureModel(**kwargs)[source]
pdfStandard(axes)[source]

Returns the value of the multivariate normal distribution at location axes

sampleFromDistribution(numSamples=2097152)[source]

Samples from a multivariate normal distribution

class fastkde.convergence_tests.testNormal1D(**kwargs)[source]
pdfStandard(axes)[source]

Returns the value of the normal distribution at location axes

sampleFromDistribution(numSamples=2097152)[source]

Samples from a random normal distribution

class fastkde.convergence_tests.testNormal2D(**kwargs)[source]
pdfStandard(axes)[source]

Returns the value of the bivariate normal distribution at location axes

sampleFromDistribution(numSamples=2097152)[source]

Samples from a bivariate normal distribution

class fastkde.convergence_tests.testNormalND(**kwargs)[source]
pdfStandard(axes)[source]

Returns the value of the multivariate normal distribution at location axes

sampleFromDistribution(numSamples=2097152)[source]

Samples from a multivariate normal distribution

class fastkde.convergence_tests.transitionPDF(**kwargs)[source]

A test for transition PDF convergence

pdfStandard(axes)[source]
sampleFromDistribution(numSamples=2097152)[source]
fastkde.convergence_tests.underlyingFunction(x, x0=305, y0=200, yrange=4)[source]

A sigmoid with a transition centered on x0

Given an N-dimensional array, find contiguous areas of the array satisfiying a given condition and return a list of contiguous indices for each contiguous area.

input:

input_array(array-like) an array from which to search

contiguous areas

search_thresholdThe threshold for defining fill regions

(input_array > search_threshold)

wrap_dimensionsA list of dimensions in which searching

should have a wraparound condition

output:

An unordered list, where each item corresponds to a unique contiguous area for which input_array > search_threshold, and where the contents of each item are a list of array indicies that access the elements of the array for a given contiguous area.

fastkde.floodFillSearch.sort_by_distance_from_center(inds, var_shape)

Takes sets of indicies [e.g., from flood_fill_searchC.flood_fill_search()] and sorts them by distance from the center of the array from which the indices were taken.

input:

indsa list of tuples of numpy ndarrays (of type integer and

rank 1), where each tuple item contains a vector of indices for each index of an array. Each list item should conform to the output of the numpy where() function. It is assumed that each set of indices represents a contiguous portion of an array.

var_shape : the shape of the variable from which inds originate

returns:

A sorted version of inds, where the items are sorted by the distance of the contiguous area relative to the center of the array whose shape is var_shape. The first item is the closest to the center of the array.

fastkde.nufft.calc_t_from_x(xpoints)

Calculates frequency points given a signal in real space.

fastkde.nufft.calc_x_from_t(tpoints)

Calculates real space points given a set of hermetian frequency points.

fastkde.nufft.dft_points(frequency_grids, ordinates, outputPoints, missing_freq_val=-1e+20, be_verbose=False)

Calculates the unnormalized direct inverse Fourier transform of abscissa, ordinate pairs

input:

frequency_gridsabscissa values.

A numpy array of shape (ndimensions,npoints) (assumed to be real)

ordinatesordinates values.

A numpy array of shape (npoints)

outputPointsThe real-space points at which to calcualte the DFT

A masked numpy array of shape (ndimensions,noutputpoints).

be_verboseFlags whether to print to STDOUT as the method progresses

(int 0=don’t print 1=print)

output:

The iDFT of abscissa/ordinate pairs. Calculated in one dimension as:

iDFT[x] = sum( [ a[i] * exp(-j * x * t[i] ) for i in range(N) ] )

for each of the x points in output_points (and where j = sqrt(-1)).

fastkde.nufft.idft(abscissas, ordinates, frequency_grids, missing_freq_val=-1e+20, be_verbose=False)

Calculates the unnormalized direct Fourier transform of abscissa, ordinate pairs

input:

abscissasabscissa values.

A numpy array of shape (ndimensions,npoints) (assumed to be real)

ordinatesordinates values.

A numpy array of shape (npoints)

frequency_gridsThe frequency grids on which to calcualte the DFT

A masked numpy array of shape (ndimensions,ntmax), where ntmax is the length of the longest frequency grid.

missing_freq_valA value indicating missing frequency values.

This is used to allow each dimension to have different sized frequency spaces; dimensions with smaller frequency spaces (than ntmax) should be padded at the end with missing_freq_val.

be_verboseFlags whether to print to STDOUT as the method progresses

(int 0=don’t print 1=print)

output:

The DFT of abscissa/ordinate pairs. Calculated in one dimension as:

DFT[t] = sum( [ a[i] * exp(j * x[i] * t ) for i in range(N) ] )

for each of the t points in frequency_grids (and where j = sqrt(-1)).

fastkde.nufft.nuifft(abscissas, ordinates, frequency_grids, missing_freq_val=-1e+20, precision=2, be_verbose=0)

Approximates the unnormalized direct Fourier transform of abscissa, ordinate pairs using the non-uniform FFT method.

abscissasabscissa values.

A numpy array of shape (ndimensions,npoints) (assumed to be real)

ordinatesordinates values.

A numpy array of shape (npoints)

frequency_gridsThe frequency grids on which to calcualte the DFT

A masked numpy array of shape (ndimensions,ntmax), where ntmax is the length of the longest frequency grid.

missing_freq_valA value indicating missing frequency values. This is used to

allow each dimension to have different sized frequency spaces; dimensions with smaller frequency spaces (than ntmax) should be padded at the end with missing_freq_val.

precisionSets the precision of the approximation. 1 for floating point

and 2 for double precision accuracy.

be_verboseFlags whether to print to STDOUT as the method progresses

(int 0=don’t print 1=print)

The DFT of abscissa/ordinate pairs. Calculated in one dimension as:

DFT[t] = sum( [ a[i] * exp(j * x[i] * t ) for i in range(N) ] )

for each of the t points in frequency_grids (and where j = sqrt(-1)).

fastkde.nufft.vprint(msg, be_verbose)

Prints only if be_verbose is True