DataGenerator#
- class mofaflex.tl.DataGenerator(n_features, n_samples=1000, likelihoods=None, n_fully_shared_factors=2, n_partially_shared_factors=15, n_private_factors=3, factor_size_params=None, factor_size_dist='Uniform', n_active_factors=1.0, n_response=0, nmf=None)#
Bases:
objectGenerator class for creating synthetic multi-view data with latent factors.
This class generates synthetic data with specified properties including shared and private latent factors, different likelihoods, and optional covariates and response variables.
- n_features#
List of feature counts for each view.
- n_samples#
Number of samples to generate.
- n_views#
Number of views in the dataset.
Number of factors shared across all views.
Number of factors shared between some views.
- n_private_factors#
Number of factors unique to individual views.
- n_covariates#
Number of observed covariates.
- likelihoods#
List of likelihood types for each view.
- factor_size_params#
Parameters for factor size distribution.
- factor_size_dist#
Type of distribution for factor sizes.
- n_active_factors#
Number or fraction of active factors.
- nmf#
List indicating which views should use non-negative matrix factorization.
Attributes Summary
Generated data with non-missing values replaced with
np.nan.Total number of factors.
Gene set mask describing co-expressed genes, with added noise.
Generated weights.
Gene set mask describing co-expressed genes.
Generated data.
Generated latent factors.
Methods Summary
generate([rng, all_combs, overwrite])Generate synthetic data.
generate_missingness([rng, ...])Mark observations as missing.
get_noisy_mask([rng, noise_fraction, ...])Generate a noisy version of
w_mask, the mask describing co-expressed genes.normalize([with_std])Normalize data with a Gaussian likelihood to zero mean and optionally unit variance.
permute_factors(new_factor_order)Permute factors.
permute_features(new_feature_order)Permute features.
to_mudata([noisy])Export the generated data as a
MuDataobject.Attributes Documentation
- missing_y#
Generated data with non-missing values replaced with
np.nan.
- n_factors#
Total number of factors.
- noisy_w_mask#
Gene set mask describing co-expressed genes, with added noise.
- w#
Generated weights.
- w_mask#
Gene set mask describing co-expressed genes.
- y#
Generated data.
- z#
Generated latent factors.
Methods Documentation
- generate(rng=Generator(PCG64) at 0x70BFAC9EFD80, all_combs=False, overwrite=False)#
Generate synthetic data.
- Parameters:
rng (
Generator(default:Generator(PCG64) at 0x70BFAC9EFD80)) – The random number generator.all_combs (
bool(default:False)) – Wether to generate all combinations of active factors and views. IfTrue, the model will have 1 shared factor,n_viewsprivate factors, and2**n_views - n_views - 2partially shared factors.overwrite (
bool(default:False)) – Whether to overwrite already generated data
- generate_missingness(rng=Generator(PCG64) at 0x70BFAC888AC0, n_partial_samples=0, n_partial_features=0, missing_fraction_partial_features=0.0, random_fraction=0.0)#
Mark observations as missing.
- Parameters:
rng (
Generator(default:Generator(PCG64) at 0x70BFAC888AC0)) – The random number generator.n_partial_samples (
int(default:0)) – Number of samples marked as missing in at least one random view. If the model has only one view, this has no effect.n_partial_features (
int(default:0)) – Number of features marked as missing in some samples.missing_fraction_partial_features (
float(default:0.0)) – Fraction of samples marked as missing due ton_partial_features.random_fraction (
float(default:0.0)) – Fraction of all observations marked as missing at random.
- get_noisy_mask(rng=Generator(PCG64) at 0x70C0361598C0, noise_fraction=0.1, informed_view_indices=None)#
Generate a noisy version of
w_mask, the mask describing co-expressed genes.Noisy in this context means that some annotations are wrong, i.e. some genes active in a particular factor are marked as inactive, and some genes inactive in a factor are marked as active.
- Parameters:
rng (
Generator(default:Generator(PCG64) at 0x70C0361598C0)) – The random number generator.noise_fraction (
float(default:0.1)) – Fraction of active genes per factor that will be marked as inactive. The same number of inactive genes will be marked as active.informed_view_indices (
Iterable[int] |None(default:None)) – Indices of views that will be used to benchmark informed models. Noisy masks will be generated only for those views. For uninformed views, th enoisy masks will be filled withFalse.
- Return type:
- Returns:
A list with a noisy mask for each view.
- normalize(with_std=False)#
Normalize data with a Gaussian likelihood to zero mean and optionally unit variance.
- Parameters:
with_std (
bool(default:False)) – IfTrue, also normalize to unit variance. Otherwise, only shift to zero mean.
- permute_factors(new_factor_order)#
Permute factors.
- permute_features(new_feature_order)#
Permute features.
- to_mudata(noisy=False)#
Export the generated data as a
MuDataobject.The
AnnDataobjects generated for each view will have their weights in.varm["w"]and the gene set mask in.varm["w_mask"]. The latent factors will be in `.obsm["z"]of theMuDataobject, the likelihoods in.uns["likelihoods"]and the number of active factors in.uns["n_active_factors"].- Parameters:
noisy (default:
False) – Whether to export the noisy or noise-free gene set mask.- Return type: