DataGenerator#

class mofaflex.tl.DataGenerator(n_features, n_samples=1000, likelihoods=None, n_fully_shared_factors=2, n_partially_shared_factors=15, n_private_factors=3, factor_size_params=None, factor_size_dist='Uniform', n_active_factors=1.0, n_response=0, nmf=None)#

Bases: object

Generator class for creating synthetic multi-view data with latent factors.

This class generates synthetic data with specified properties including shared and private latent factors, different likelihoods, and optional covariates and response variables.

n_features#

List of feature counts for each view.

n_samples#

Number of samples to generate.

n_views#

Number of views in the dataset.

n_fully_shared_factors#

Number of factors shared across all views.

n_partially_shared_factors#

Number of factors shared between some views.

n_private_factors#

Number of factors unique to individual views.

n_covariates#

Number of observed covariates.

likelihoods#

List of likelihood types for each view.

factor_size_params#

Parameters for factor size distribution.

factor_size_dist#

Type of distribution for factor sizes.

n_active_factors#

Number or fraction of active factors.

nmf#

List indicating which views should use non-negative matrix factorization.

Attributes Summary

missing_y

Generated data with non-missing values replaced with np.nan.

n_factors

Total number of factors.

noisy_w_mask

Gene set mask describing co-expressed genes, with added noise.

w

Generated weights.

w_mask

Gene set mask describing co-expressed genes.

y

Generated data.

z

Generated latent factors.

Methods Summary

generate([rng, all_combs, overwrite])

Generate synthetic data.

generate_missingness([rng, ...])

Mark observations as missing.

get_noisy_mask([rng, noise_fraction, ...])

Generate a noisy version of w_mask, the mask describing co-expressed genes.

normalize([with_std])

Normalize data with a Gaussian likelihood to zero mean and optionally unit variance.

permute_factors(new_factor_order)

Permute factors.

permute_features(new_feature_order)

Permute features.

to_mudata([noisy])

Export the generated data as a MuData object.

Attributes Documentation

missing_y#

Generated data with non-missing values replaced with np.nan.

n_factors#

Total number of factors.

noisy_w_mask#

Gene set mask describing co-expressed genes, with added noise.

w#

Generated weights.

w_mask#

Gene set mask describing co-expressed genes.

y#

Generated data.

z#

Generated latent factors.

Methods Documentation

generate(rng=Generator(PCG64) at 0x70BFAC9EFD80, all_combs=False, overwrite=False)#

Generate synthetic data.

Parameters:
  • rng (Generator (default: Generator(PCG64) at 0x70BFAC9EFD80)) – The random number generator.

  • all_combs (bool (default: False)) – Wether to generate all combinations of active factors and views. If True, the model will have 1 shared factor, n_views private factors, and 2**n_views - n_views - 2 partially shared factors.

  • overwrite (bool (default: False)) – Whether to overwrite already generated data

generate_missingness(rng=Generator(PCG64) at 0x70BFAC888AC0, n_partial_samples=0, n_partial_features=0, missing_fraction_partial_features=0.0, random_fraction=0.0)#

Mark observations as missing.

Parameters:
  • rng (Generator (default: Generator(PCG64) at 0x70BFAC888AC0)) – The random number generator.

  • n_partial_samples (int (default: 0)) – Number of samples marked as missing in at least one random view. If the model has only one view, this has no effect.

  • n_partial_features (int (default: 0)) – Number of features marked as missing in some samples.

  • missing_fraction_partial_features (float (default: 0.0)) – Fraction of samples marked as missing due to n_partial_features.

  • random_fraction (float (default: 0.0)) – Fraction of all observations marked as missing at random.

get_noisy_mask(rng=Generator(PCG64) at 0x70C0361598C0, noise_fraction=0.1, informed_view_indices=None)#

Generate a noisy version of w_mask, the mask describing co-expressed genes.

Noisy in this context means that some annotations are wrong, i.e. some genes active in a particular factor are marked as inactive, and some genes inactive in a factor are marked as active.

Parameters:
  • rng (Generator (default: Generator(PCG64) at 0x70C0361598C0)) – The random number generator.

  • noise_fraction (float (default: 0.1)) – Fraction of active genes per factor that will be marked as inactive. The same number of inactive genes will be marked as active.

  • informed_view_indices (Iterable[int] | None (default: None)) – Indices of views that will be used to benchmark informed models. Noisy masks will be generated only for those views. For uninformed views, th enoisy masks will be filled with False.

Return type:

list[ndarray[tuple[Any, ...], dtype[bool]]]

Returns:

A list with a noisy mask for each view.

normalize(with_std=False)#

Normalize data with a Gaussian likelihood to zero mean and optionally unit variance.

Parameters:

with_std (bool (default: False)) – If True, also normalize to unit variance. Otherwise, only shift to zero mean.

permute_factors(new_factor_order)#

Permute factors.

Parameters:

new_factor_order (Iterable[int]) – New ordering of factors.

permute_features(new_feature_order)#

Permute features.

Parameters:

new_feature_order (Sequence[Iterable[int]]) – New ordering of features.

to_mudata(noisy=False)#

Export the generated data as a MuData object.

The AnnData objects generated for each view will have their weights in .varm["w"] and the gene set mask in .varm["w_mask"]. The latent factors will be in `.obsm["z"] of the MuData object, the likelihoods in .uns["likelihoods"] and the number of active factors in .uns["n_active_factors"].

Parameters:

noisy (default: False) – Whether to export the noisy or noise-free gene set mask.

Return type:

MuData