causalnex.estimator.EMSingleLatentVariable¶
-
class
causalnex.estimator.
EMSingleLatentVariable
(sm, data, lv_name, node_states, initial_params='random', seed=22, box_constraints=None, priors=None, non_missing_data_factor=1, n_jobs=1)[source]¶ Bases:
object
This class uses Expectation-Maximization to learn parameters of a single latent variable in a bayesian network. We do so by also allowing the user to CONSTRAINT the optimisation These are elements that help the algorithm find a local optimal point closer to the point we think
The setting is: Input:
a StructureModel representing the whole network or any sub-graph containing the Markov Blanket of the LV
data as a dataframe. The LV must be in the dataframe, with missing values represented by `np.nan`s
- constraints:
Box - A hard constraint; forbids the solution to be outside of certain boundaries
Priors - establishes Dirichlet priors to every parameter
- run:
using the method run or manually alternating over E and M steps)
- Result:
CPTs involving the latent variable, learnt by EM, found in the attribute cpds
CPTs not involving the LV not learned (They must be learned separately by MLE.
This is faster and the result is the same)
- Example:
em = EMSingleLatentVariable(sm=sm, data=data, lv_name=lv_name, node_states=node_states) em.run() # run EM until convergence # or run E and M steps separately for i in range(10): # Run EM 10 times
em.e_step() em.m_step()
Attributes
Methods
EMSingleLatentVariable.__delattr__
(name, /)Implement delattr(self, name).
EMSingleLatentVariable.__dir__
()Default dir() implementation.
EMSingleLatentVariable.__eq__
(value, /)Return self==value.
EMSingleLatentVariable.__format__
(format_spec, /)Default object formatter.
EMSingleLatentVariable.__ge__
(value, /)Return self>=value.
EMSingleLatentVariable.__getattribute__
(name, /)Return getattr(self, name).
EMSingleLatentVariable.__gt__
(value, /)Return self>value.
EMSingleLatentVariable.__hash__
()Return hash(self).
EMSingleLatentVariable.__init__
(sm, data, …)- type sm
StructureModel
EMSingleLatentVariable.__init_subclass__
This method is called when a class is subclassed.
EMSingleLatentVariable.__le__
(value, /)Return self<=value.
EMSingleLatentVariable.__lt__
(value, /)Return self<value.
EMSingleLatentVariable.__ne__
(value, /)Return self!=value.
EMSingleLatentVariable.__new__
(**kwargs)Create and return a new object.
EMSingleLatentVariable.__reduce__
()Helper for pickle.
EMSingleLatentVariable.__reduce_ex__
(protocol, /)Helper for pickle.
EMSingleLatentVariable.__repr__
()Return repr(self).
EMSingleLatentVariable.__setattr__
(name, …)Implement setattr(self, name, value).
EMSingleLatentVariable.__sizeof__
()Size of object in memory, in bytes.
EMSingleLatentVariable.__str__
()Return str(self).
EMSingleLatentVariable.__subclasshook__
Abstract classes can override this to customize issubclass().
EMSingleLatentVariable._build_lookup
(node, …)Build lookup table based on an individual data record/instance
EMSingleLatentVariable._check_box_constraints
(…)Checks if the box constraints are passed in the right format and if they are valid
EMSingleLatentVariable._check_initial_params_dict
()Checks initial parameter dictionary
EMSingleLatentVariable._check_priors
(priors)Checks if the priors are passed in the right format and if they are valid
EMSingleLatentVariable._get_markov_blanket_data
(df)Keeps only features the belong to the latent variable’s Markov blanket + groups and counts identical rows + multiplies non missing data counts by a factor
EMSingleLatentVariable._initialise_network_cpds
()Initialise all the CPDs according to the choice made in the constructor.
EMSingleLatentVariable._initialise_node_cpd
(…)Initialise the CPD of a specified node
EMSingleLatentVariable._initialize_sufficient_stats
(node)Likelihood of node and parents, initialized with zeros (or prior values) and then increased from data.
EMSingleLatentVariable._normalise
(df)Normalises dataframe
EMSingleLatentVariable._stopping_criteria
()Maximum change, in absolute values, between parameters of last EM iteration and params of current EM iteration
EMSingleLatentVariable._update_sufficient_stats
(lookup)Update expected sufficient statistics based on a given dataframe
if CPDs fall outside the box constraints created, bring them back to inside the constraints.
This computes the LOG likelihood of the whole dataset (or MAP, if priors given) for the current parameter steps
Performs the Expectation step.
Get boxes with min = 0 and max = 1 for all parameters.
The default dirichlet priors (zero values)
Maximization step.
EMSingleLatentVariable.run
(n_runs[, …])Runs E and M steps until convergence (stopping_delta) or max iterations is reached (n_runs)
-
__init__
(sm, data, lv_name, node_states, initial_params='random', seed=22, box_constraints=None, priors=None, non_missing_data_factor=1, n_jobs=1)[source]¶ - Parameters
sm (
StructureModel
) – structure. Only requirement is: must contain all edges in the Markov Blanket of the latent variable. Note: all variable names must be non empty stringsdata (
DataFrame
) – dataframe, must contain all variables in the Markov Blanket of the latent variable. Include one column with the latent variable name, filled with np.nan for missing info about LV. If some data is present about the LV, create complete columns.lv_name (
str
) – name of latent variablenode_states (
Dict
[str
,list
]) – dictionary mapping variable name and list of statesinitial_params (
Union
[str
,Dict
[str
,DataFrame
]]) – way to initialise parameters. Can be: - “random”: random values (default) - if a dictionary of dataframes is provided, this will be used as the initialisationseed (
int
) – seed for the random generator (used if iitialise parameters randomly)box_constraints (
Optional
[Dict
[str
,Tuple
[DataFrame
,DataFrame
]]]) – minimum and maximum values for each model parameter. Specified with a dictionary mapping: - Node - two dataframes, in order: Min(P(Node|Par(Node))) and Max(P(Node|Par(Node)))priors (
Optional
[Dict
[str
,DataFrame
]]) – priors, provided as a mapping Node -> dataframe with Dirichilet priors for P(Node|Par(Node))non_missing_data_factor (
int
) – This is a weight added to the non-missing data samples. The effect is as if the amount of data provided was bigger. Empirically, it helps to set the factor to 10 if the non missing data is ~1% of the datasetn_jobs (
int
) – If -1 all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used. Thus for n_jobs = -2, all CPUs but one are used.
-
apply_box_constraints
()[source]¶ if CPDs fall outside the box constraints created, bring them back to inside the constraints.
-
compute_total_likelihood
()[source]¶ This computes the LOG likelihood of the whole dataset (or MAP, if priors given) for the current parameter steps
- Return type
float
- Returns
Total likelihood over dataset
-
e_step
()[source]¶ Performs the Expectation step. This boils down to computing the expected sufficient statistics M[X, U] for every “valid” node X, where U = Par(X)
- Return type
Dict
[str
,DataFrame
]- Returns
The expected sufficient statistics of each node X
-
static
get_default_box
(sm, node_states, lv_name)[source]¶ Get boxes with min = 0 and max = 1 for all parameters.
- Parameters
sm (
StructureModel
) – model structurenode_states (
Dict
[str
,list
]) – node stateslv_name (
str
) – name of latent variable
- Returns
the first being the lower value constraint and the second
- Return type
Dictionary with a tuple of two elements
the maximum value constraint
-
static
get_default_priors
(sm, node_states, lv_name)[source]¶ The default dirichlet priors (zero values)
- Parameters
sm (
StructureModel
) – model structurenode_states (
Dict
[str
,list
]) – node stateslv_name (
str
) – name of latent variable
- Return type
Dict
[str
,Tuple
[DataFrame
,DataFrame
]]- Returns
Dictionary with pd dataframes initialized with zeros
-
m_step
()[source]¶ Maximization step. It boils down to normalising the likelihood table previously created
$$ theta_{[X | U]} = M[X, U] / M[U] = M[X, U] / sum_X M[X, U] $$
- Return type
Dict
[str
,DataFrame
]- Returns
New updated CPDs
-
run
(n_runs, stopping_delta=0.0, verbose=0)[source]¶ Runs E and M steps until convergence (stopping_delta) or max iterations is reached (n_runs)
- Parameters
n_runs (
int
) – max number of EM alternationsstopping_delta (
float
) – if max difference in current - last iteration CPDS < stopping_delta => convergence reachedverbose (
int
) – amount of printing