causalnex.estimator.EMSingleLatentVariable

class causalnex.estimator.EMSingleLatentVariable(sm, data, lv_name, node_states, initial_params='random', seed=22, box_constraints=None, priors=None, non_missing_data_factor=1, n_jobs=1)[source]

Bases: object

This class uses Expectation-Maximization to learn parameters of a single latent variable in a bayesian network. We do so by also allowing the user to CONSTRAINT the optimisation These are elements that help the algorithm find a local optimal point closer to the point we think

The setting is: Input:

  • a StructureModel representing the whole network or any sub-graph containing the Markov Blanket of the LV

  • data as a dataframe. The LV must be in the dataframe, with missing values represented by `np.nan`s

  • constraints:
    • Box - A hard constraint; forbids the solution to be outside of certain boundaries

    • Priors - establishes Dirichlet priors to every parameter

run:
  • using the method run or manually alternating over E and M steps)

Result:
  • CPTs involving the latent variable, learnt by EM, found in the attribute cpds

  • CPTs not involving the LV not learned (They must be learned separately by MLE.

This is faster and the result is the same)

Example:

em = EMSingleLatentVariable(sm=sm, data=data, lv_name=lv_name, node_states=node_states) em.run() # run EM until convergence # or run E and M steps separately for i in range(10): # Run EM 10 times

em.e_step() em.m_step()

Attributes

Methods

EMSingleLatentVariable.__delattr__(name, /)

Implement delattr(self, name).

EMSingleLatentVariable.__dir__()

Default dir() implementation.

EMSingleLatentVariable.__eq__(value, /)

Return self==value.

EMSingleLatentVariable.__format__(format_spec, /)

Default object formatter.

EMSingleLatentVariable.__ge__(value, /)

Return self>=value.

EMSingleLatentVariable.__getattribute__(name, /)

Return getattr(self, name).

EMSingleLatentVariable.__gt__(value, /)

Return self>value.

EMSingleLatentVariable.__hash__()

Return hash(self).

EMSingleLatentVariable.__init__(sm, data, …)

type sm

StructureModel

EMSingleLatentVariable.__init_subclass__

This method is called when a class is subclassed.

EMSingleLatentVariable.__le__(value, /)

Return self<=value.

EMSingleLatentVariable.__lt__(value, /)

Return self<value.

EMSingleLatentVariable.__ne__(value, /)

Return self!=value.

EMSingleLatentVariable.__new__(**kwargs)

Create and return a new object.

EMSingleLatentVariable.__reduce__()

Helper for pickle.

EMSingleLatentVariable.__reduce_ex__(protocol, /)

Helper for pickle.

EMSingleLatentVariable.__repr__()

Return repr(self).

EMSingleLatentVariable.__setattr__(name, …)

Implement setattr(self, name, value).

EMSingleLatentVariable.__sizeof__()

Size of object in memory, in bytes.

EMSingleLatentVariable.__str__()

Return str(self).

EMSingleLatentVariable.__subclasshook__

Abstract classes can override this to customize issubclass().

EMSingleLatentVariable._build_lookup(node, …)

Build lookup table based on an individual data record/instance

EMSingleLatentVariable._check_box_constraints(…)

Checks if the box constraints are passed in the right format and if they are valid

EMSingleLatentVariable._check_initial_params_dict()

Checks initial parameter dictionary

EMSingleLatentVariable._check_priors(priors)

Checks if the priors are passed in the right format and if they are valid

EMSingleLatentVariable._get_markov_blanket_data(df)

Keeps only features the belong to the latent variable’s Markov blanket + groups and counts identical rows + multiplies non missing data counts by a factor

EMSingleLatentVariable._initialise_network_cpds()

Initialise all the CPDs according to the choice made in the constructor.

EMSingleLatentVariable._initialise_node_cpd(…)

Initialise the CPD of a specified node

EMSingleLatentVariable._initialize_sufficient_stats(node)

Likelihood of node and parents, initialized with zeros (or prior values) and then increased from data.

EMSingleLatentVariable._normalise(df)

Normalises dataframe

EMSingleLatentVariable._stopping_criteria()

Maximum change, in absolute values, between parameters of last EM iteration and params of current EM iteration

EMSingleLatentVariable._update_sufficient_stats(lookup)

Update expected sufficient statistics based on a given dataframe

EMSingleLatentVariable.apply_box_constraints()

if CPDs fall outside the box constraints created, bring them back to inside the constraints.

EMSingleLatentVariable.compute_total_likelihood()

This computes the LOG likelihood of the whole dataset (or MAP, if priors given) for the current parameter steps

EMSingleLatentVariable.e_step()

Performs the Expectation step.

EMSingleLatentVariable.get_default_box(sm, …)

Get boxes with min = 0 and max = 1 for all parameters.

EMSingleLatentVariable.get_default_priors(sm, …)

The default dirichlet priors (zero values)

EMSingleLatentVariable.m_step()

Maximization step.

EMSingleLatentVariable.run(n_runs[, …])

Runs E and M steps until convergence (stopping_delta) or max iterations is reached (n_runs)

__init__(sm, data, lv_name, node_states, initial_params='random', seed=22, box_constraints=None, priors=None, non_missing_data_factor=1, n_jobs=1)[source]
Parameters
  • sm (StructureModel) – structure. Only requirement is: must contain all edges in the Markov Blanket of the latent variable. Note: all variable names must be non empty strings

  • data (DataFrame) – dataframe, must contain all variables in the Markov Blanket of the latent variable. Include one column with the latent variable name, filled with np.nan for missing info about LV. If some data is present about the LV, create complete columns.

  • lv_name (str) – name of latent variable

  • node_states (Dict[str, list]) – dictionary mapping variable name and list of states

  • initial_params (Union[str, Dict[str, DataFrame]]) – way to initialise parameters. Can be: - “random”: random values (default) - if a dictionary of dataframes is provided, this will be used as the initialisation

  • seed (int) – seed for the random generator (used if iitialise parameters randomly)

  • box_constraints (Optional[Dict[str, Tuple[DataFrame, DataFrame]]]) – minimum and maximum values for each model parameter. Specified with a dictionary mapping: - Node - two dataframes, in order: Min(P(Node|Par(Node))) and Max(P(Node|Par(Node)))

  • priors (Optional[Dict[str, DataFrame]]) – priors, provided as a mapping Node -> dataframe with Dirichilet priors for P(Node|Par(Node))

  • non_missing_data_factor (int) – This is a weight added to the non-missing data samples. The effect is as if the amount of data provided was bigger. Empirically, it helps to set the factor to 10 if the non missing data is ~1% of the dataset

  • n_jobs (int) – If -1 all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used. Thus for n_jobs = -2, all CPUs but one are used.

apply_box_constraints()[source]

if CPDs fall outside the box constraints created, bring them back to inside the constraints.

compute_total_likelihood()[source]

This computes the LOG likelihood of the whole dataset (or MAP, if priors given) for the current parameter steps

Return type

float

Returns

Total likelihood over dataset

e_step()[source]

Performs the Expectation step. This boils down to computing the expected sufficient statistics M[X, U] for every “valid” node X, where U = Par(X)

Return type

Dict[str, DataFrame]

Returns

The expected sufficient statistics of each node X

static get_default_box(sm, node_states, lv_name)[source]

Get boxes with min = 0 and max = 1 for all parameters.

Parameters
  • sm (StructureModel) – model structure

  • node_states (Dict[str, list]) – node states

  • lv_name (str) – name of latent variable

Returns

the first being the lower value constraint and the second

Return type

Dictionary with a tuple of two elements

the maximum value constraint

static get_default_priors(sm, node_states, lv_name)[source]

The default dirichlet priors (zero values)

Parameters
  • sm (StructureModel) – model structure

  • node_states (Dict[str, list]) – node states

  • lv_name (str) – name of latent variable

Return type

Dict[str, Tuple[DataFrame, DataFrame]]

Returns

Dictionary with pd dataframes initialized with zeros

m_step()[source]

Maximization step. It boils down to normalising the likelihood table previously created

$$ theta_{[X | U]} = M[X, U] / M[U] = M[X, U] / sum_X M[X, U] $$

Return type

Dict[str, DataFrame]

Returns

New updated CPDs

run(n_runs, stopping_delta=0.0, verbose=0)[source]

Runs E and M steps until convergence (stopping_delta) or max iterations is reached (n_runs)

Parameters
  • n_runs (int) – max number of EM alternations

  • stopping_delta (float) – if max difference in current - last iteration CPDS < stopping_delta => convergence reached

  • verbose (int) – amount of printing