causalnex.estimator.EMSingleLatentVariable¶

class causalnex.estimator.EMSingleLatentVariable(sm, data, lv_name, node_states, initial_params='random', seed=22, box_constraints=None, priors=None, non_missing_data_factor=1, n_jobs=1)[source]¶

Bases: object

This class uses Expectation-Maximization to learn parameters of a single latent variable in a bayesian network. We do so by also allowing the user to CONSTRAINT the optimisation These are elements that help the algorithm find a local optimal point closer to the point we think

The setting is: Input:

a StructureModel representing the whole network or any sub-graph containing the Markov Blanket of the LV

data as a dataframe. The LV must be in the dataframe, with missing values represented by `np.nan`s

constraints:

Box - A hard constraint; forbids the solution to be outside of certain boundaries

Priors - establishes Dirichlet priors to every parameter

run:

using the method run or manually alternating over E and M steps)

Result:

CPTs involving the latent variable, learnt by EM, found in the attribute cpds
CPTs not involving the LV not learned (They must be learned separately by MLE.

This is faster and the result is the same)

Example:

em = EMSingleLatentVariable(sm=sm, data=data, lv_name=lv_name, node_states=node_states) em.run() # run EM until convergence # or run E and M steps separately for i in range(10): # Run EM 10 times

em.e_step() em.m_step()

Attributes

Methods

`EMSingleLatentVariable.__delattr__`(name, /)	Implement delattr(self, name).
`EMSingleLatentVariable.__dir__`()	Default dir() implementation.
`EMSingleLatentVariable.__eq__`(value, /)	Return self==value.
`EMSingleLatentVariable.__format__`(format_spec, /)	Default object formatter.
`EMSingleLatentVariable.__ge__`(value, /)	Return self>=value.
`EMSingleLatentVariable.__getattribute__`(name, /)	Return getattr(self, name).
`EMSingleLatentVariable.__gt__`(value, /)	Return self>value.
`EMSingleLatentVariable.__hash__`()	Return hash(self).
`EMSingleLatentVariable.__init__`(sm, data, …)	type sm `StructureModel`
`EMSingleLatentVariable.__init_subclass__`	This method is called when a class is subclassed.
`EMSingleLatentVariable.__le__`(value, /)	Return self<=value.
`EMSingleLatentVariable.__lt__`(value, /)	Return self<value.
`EMSingleLatentVariable.__ne__`(value, /)	Return self!=value.
`EMSingleLatentVariable.__new__`(**kwargs)	Create and return a new object.
`EMSingleLatentVariable.__reduce__`()	Helper for pickle.
`EMSingleLatentVariable.__reduce_ex__`(protocol, /)	Helper for pickle.
`EMSingleLatentVariable.__repr__`()	Return repr(self).
`EMSingleLatentVariable.__setattr__`(name, …)	Implement setattr(self, name, value).
`EMSingleLatentVariable.__sizeof__`()	Size of object in memory, in bytes.
`EMSingleLatentVariable.__str__`()	Return str(self).
`EMSingleLatentVariable.__subclasshook__`	Abstract classes can override this to customize issubclass().
`EMSingleLatentVariable._build_lookup`(node, …)	Build lookup table based on an individual data record/instance
`EMSingleLatentVariable._check_box_constraints`(…)	Checks if the box constraints are passed in the right format and if they are valid
`EMSingleLatentVariable._check_initial_params_dict`()	Checks initial parameter dictionary
`EMSingleLatentVariable._check_priors`(priors)	Checks if the priors are passed in the right format and if they are valid
`EMSingleLatentVariable._get_markov_blanket_data`(df)	Keeps only features the belong to the latent variable’s Markov blanket + groups and counts identical rows + multiplies non missing data counts by a factor
`EMSingleLatentVariable._initialise_network_cpds`()	Initialise all the CPDs according to the choice made in the constructor.
`EMSingleLatentVariable._initialise_node_cpd`(…)	Initialise the CPD of a specified node
`EMSingleLatentVariable._initialize_sufficient_stats`(node)	Likelihood of node and parents, initialized with zeros (or prior values) and then increased from data.
`EMSingleLatentVariable._normalise`(df)	Normalises dataframe
`EMSingleLatentVariable._stopping_criteria`()	Maximum change, in absolute values, between parameters of last EM iteration and params of current EM iteration
`EMSingleLatentVariable._update_sufficient_stats`(lookup)	Update expected sufficient statistics based on a given dataframe
`EMSingleLatentVariable.apply_box_constraints`()	if CPDs fall outside the box constraints created, bring them back to inside the constraints.
`EMSingleLatentVariable.compute_total_likelihood`()	This computes the LOG likelihood of the whole dataset (or MAP, if priors given) for the current parameter steps
`EMSingleLatentVariable.e_step`()	Performs the Expectation step.
`EMSingleLatentVariable.get_default_box`(sm, …)	Get boxes with min = 0 and max = 1 for all parameters.
`EMSingleLatentVariable.get_default_priors`(sm, …)	The default dirichlet priors (zero values)
`EMSingleLatentVariable.m_step`()	Maximization step.
`EMSingleLatentVariable.run`(n_runs[, …])	Runs E and M steps until convergence (stopping_delta) or max iterations is reached (n_runs)

__init__(sm, data, lv_name, node_states, initial_params='random', seed=22, box_constraints=None, priors=None, non_missing_data_factor=1, n_jobs=1)[source]¶

Parameters

sm (StructureModel) – structure. Only requirement is: must contain all edges in the Markov Blanket of the latent variable. Note: all variable names must be non empty strings
data (DataFrame) – dataframe, must contain all variables in the Markov Blanket of the latent variable. Include one column with the latent variable name, filled with np.nan for missing info about LV. If some data is present about the LV, create complete columns.
lv_name (str) – name of latent variable
node_states (Dict[str, list]) – dictionary mapping variable name and list of states
initial_params (Union[str, Dict[str, DataFrame]]) – way to initialise parameters. Can be: - “random”: random values (default) - if a dictionary of dataframes is provided, this will be used as the initialisation
seed (int) – seed for the random generator (used if iitialise parameters randomly)
box_constraints (Optional[Dict[str, Tuple[DataFrame, DataFrame]]]) – minimum and maximum values for each model parameter. Specified with a dictionary mapping: - Node - two dataframes, in order: Min(P(Node|Par(Node))) and Max(P(Node|Par(Node)))
priors (Optional[Dict[str, DataFrame]]) – priors, provided as a mapping Node -> dataframe with Dirichilet priors for P(Node|Par(Node))
non_missing_data_factor (int) – This is a weight added to the non-missing data samples. The effect is as if the amount of data provided was bigger. Empirically, it helps to set the factor to 10 if the non missing data is ~1% of the dataset
n_jobs (int) – If -1 all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used. Thus for n_jobs = -2, all CPUs but one are used.

apply_box_constraints()[source]¶: if CPDs fall outside the box constraints created, bring them back to inside the constraints.

compute_total_likelihood()[source]¶

This computes the LOG likelihood of the whole dataset (or MAP, if priors given) for the current parameter steps

Return type: float
Returns: Total likelihood over dataset

e_step()[source]¶

Performs the Expectation step. This boils down to computing the expected sufficient statistics M[X, U] for every “valid” node X, where U = Par(X)

Return type: Dict[str, DataFrame]
Returns: The expected sufficient statistics of each node X

static get_default_box(sm, node_states, lv_name)[source]¶

Get boxes with min = 0 and max = 1 for all parameters.

Parameters

sm (StructureModel) – model structure
node_states (Dict[str, list]) – node states
lv_name (str) – name of latent variable

Returns

the first being the lower value constraint and the second

Return type

Dictionary with a tuple of two elements

the maximum value constraint

static get_default_priors(sm, node_states, lv_name)[source]¶

The default dirichlet priors (zero values)

Parameters

sm (StructureModel) – model structure
node_states (Dict[str, list]) – node states
lv_name (str) – name of latent variable

Return type

Dict[str, Tuple[DataFrame, DataFrame]]

Returns

Dictionary with pd dataframes initialized with zeros

m_step()[source]¶

Maximization step. It boils down to normalising the likelihood table previously created

$$ theta_{[X | U]} = M[X, U] / M[U] = M[X, U] / sum_X M[X, U] $$

Return type: Dict[str, DataFrame]
Returns: New updated CPDs

run(n_runs, stopping_delta=0.0, verbose=0)[source]¶

Runs E and M steps until convergence (stopping_delta) or max iterations is reached (n_runs)

Parameters

n_runs (int) – max number of EM alternations
stopping_delta (float) – if max difference in current - last iteration CPDS < stopping_delta => convergence reached
verbose (int) – amount of printing