| Title: | Clustering-Informed Shared-Structure VAE for Imputation |
|---|---|
| Description: | Implements the Clustering-Informed Shared-Structure Variational Autoencoder ('CISS-VAE'), a deep learning framework for missing data imputation introduced in Khadem Charvadeh et al. (2025) <doi:10.1002/sim.70335>. The model accommodates all three types of missing data mechanisms: Missing Completely At Random (MCAR), Missing At Random (MAR), and Missing Not At Random (MNAR). While it is particularly well-suited to MNAR scenarios, where missingness patterns carry informative signals, 'CISS-VAE' also functions effectively under MAR assumptions. |
| Authors: | Yasin Khadem Charvadeh [aut], Kenneth Seier [aut], Katherine S. Panageas [aut], Danielle Vaithilingam [aut, cre], Mithat Gönen [aut], Yuan Chen [aut] |
| Maintainer: | Danielle Vaithilingam <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 1.0.1 |
| Built: | 2026-05-14 15:30:36 UTC |
| Source: | https://github.com/ciss-vae/rciss-vae |
Performs hyperparameter optimization for CISS-VAE using Optuna with support for both tunable and fixed parameters.
autotune_cissvae( data, index_col = NULL, val_proportion = 0.1, replacement_value = 0, cols_ignore = NULL, imputable_matrix = NULL, binary_feature_mask = NULL, categorical_column_map = NULL, clusters, save_model_path = NULL, save_search_space_path = NULL, n_trials = 20, study_name = "vae_autotune", device_preference = "cuda", show_progress = FALSE, optuna_dashboard_db = NULL, load_if_exists = TRUE, seed = 42, verbose = FALSE, constant_layer_size = FALSE, evaluate_all_orders = FALSE, max_exhaustive_orders = 100, num_hidden_layers = c(1, 4), hidden_dims = c(64, 512), latent_dim = c(10, 100), latent_shared = c(TRUE, FALSE), output_shared = c(TRUE, FALSE), lr = c(1e-04, 0.001), decay_factor = c(0.9, 0.999), weight_decay = 0.001, beta = 0.01, num_epochs = 500, batch_size = 4000, num_shared_encode = c(0, 1, 3), num_shared_decode = c(0, 1, 3), encoder_shared_placement = c("at_end", "at_start", "alternating", "random"), decoder_shared_placement = c("at_start", "at_end", "alternating", "random"), refit_patience = 2, refit_loops = 100, epochs_per_loop = 500, reset_lr_refit = c(TRUE, FALSE), debug = FALSE, columns_ignore = NULL )autotune_cissvae( data, index_col = NULL, val_proportion = 0.1, replacement_value = 0, cols_ignore = NULL, imputable_matrix = NULL, binary_feature_mask = NULL, categorical_column_map = NULL, clusters, save_model_path = NULL, save_search_space_path = NULL, n_trials = 20, study_name = "vae_autotune", device_preference = "cuda", show_progress = FALSE, optuna_dashboard_db = NULL, load_if_exists = TRUE, seed = 42, verbose = FALSE, constant_layer_size = FALSE, evaluate_all_orders = FALSE, max_exhaustive_orders = 100, num_hidden_layers = c(1, 4), hidden_dims = c(64, 512), latent_dim = c(10, 100), latent_shared = c(TRUE, FALSE), output_shared = c(TRUE, FALSE), lr = c(1e-04, 0.001), decay_factor = c(0.9, 0.999), weight_decay = 0.001, beta = 0.01, num_epochs = 500, batch_size = 4000, num_shared_encode = c(0, 1, 3), num_shared_decode = c(0, 1, 3), encoder_shared_placement = c("at_end", "at_start", "alternating", "random"), decoder_shared_placement = c("at_start", "at_end", "alternating", "random"), refit_patience = 2, refit_loops = 100, epochs_per_loop = 500, reset_lr_refit = c(TRUE, FALSE), debug = FALSE, columns_ignore = NULL )
data |
Data frame or matrix containing the input data |
index_col |
String name of index column to preserve (optional) |
val_proportion |
Proportion of non-missing data to hold out for validation. |
replacement_value |
Numeric value used to replace missing entries before model input. |
cols_ignore |
Character vector of column names to exclude from imputation scoring. |
imputable_matrix |
Logical matrix indicating entries allowed to be imputed. |
binary_feature_mask |
Logical vector marking which columns are binary. |
categorical_column_map |
Named list indicating what categorical column dummy variables belong to. |
clusters |
Integer vector specifying cluster assignments for each row. |
save_model_path |
Optional path to save the best model's state_dict |
save_search_space_path |
Optional path to save search space configuration |
n_trials |
Number of Optuna trials to run |
study_name |
Name identifier for the Optuna study |
device_preference |
Preferred device ("cuda", "mps", "cpu") |
show_progress |
Whether to display Rich progress bars during training |
optuna_dashboard_db |
RDB storage URL/file for Optuna dashboard |
load_if_exists |
Whether to load existing study from storage |
seed |
Base random seed for reproducible results |
verbose |
Whether to print detailed diagnostic information |
constant_layer_size |
Whether all hidden layers use same dimension |
evaluate_all_orders |
Whether to test all possible layer arrangements |
max_exhaustive_orders |
Max arrangements to test when evaluate_all_orders = TRUE |
|
Numeric(2) vector: (min, max) for number of hidden layers |
|
|
Numeric vector: hidden layer dimensions to test |
|
latent_dim |
Numeric(2) vector: (min, max) for latent dimension |
latent_shared |
Logical vector: whether latent space is shared across clusters |
output_shared |
Logical vector: whether output layer is shared across clusters |
lr |
Numeric(2) vector: (min, max) learning rate range |
decay_factor |
Numeric(2) vector: (min, max) LR decay factor range |
weight_decay |
Weight decay (L2 penalty) used in Adam optimizer. |
beta |
Numeric: KL divergence weight (fixed or range) |
num_epochs |
Integer: number of initial training epochs (fixed or range) |
batch_size |
Integer: mini-batch size (fixed or range) |
num_shared_encode |
Numeric vector: numbers of shared encoder layers to test |
num_shared_decode |
Numeric vector: numbers of shared decoder layers to test |
encoder_shared_placement |
Character vector: placement strategies for encoder shared layers |
decoder_shared_placement |
Character vector: placement strategies for decoder shared layers |
refit_patience |
Integer: early stopping patience for refit loops |
refit_loops |
Integer: maximum number of refit loops |
epochs_per_loop |
Integer: epochs per refit loop |
reset_lr_refit |
Logical vector: whether to reset LR before refit |
debug |
Logical; if TRUE, additional metadata is returned for debugging. |
columns_ignore |
Alias of cols_ignore. Kept for continuity. |
A named list with the following components:
A data frame containing the imputed values.
The fitted CISS-VAE model object
The ClusterDataset object used
The vector of cluster assignments
An optuna study object containing the trial results
A data frame of trial results
Validation dataset used
Imputed values of validation dataset
Use cluster_on_missing() or cluster_on_missing_prop() for cluster assignments.
Use GPU computation when available; call check_devices() to see available devices.
Adjust batch_size based on memory (larger is faster but uses more memory).
Set verbose = TRUE or show_progress = TRUE to monitor training.
Explore the optuna-dashboard (see vignette optunadb) for hyperparameter importance.
For binary features, set names(binary_feature_mask) <- colnames(data).
## Requires a working Python environment via reticulate ## Examples are wrapped in try() to avoid failures on CRAN check systems try({ reticulate::use_virtualenv("cissvae_environment", required = TRUE) data(df_missing) data(clusters) ## Run autotuning aut <- autotune_cissvae( data = df_missing, index_col = "index", clusters = clusters$clusters, n_trials = 3, study_name = "comprehensive_vae_autotune", device_preference = "cpu", seed = 42, ## Hyperparameter search space num_hidden_layers = c(2, 5), hidden_dims = c(64, 512), latent_dim = c(10, 100), latent_shared = c(TRUE, FALSE), output_shared = c(TRUE, FALSE), lr = c(0.01, 0.1), decay_factor = c(0.99, 1.0), beta = c(0.01, 0.1), num_epochs = c(5, 20), batch_size = c(1000, 4000), num_shared_encode = c(0, 1, 2), num_shared_decode = c(0, 1, 2), ## Placement strategies encoder_shared_placement = c( "at_end", "at_start", "alternating", "random" ), decoder_shared_placement = c( "at_start", "at_end", "alternating", "random" ), refit_patience = 2, refit_loops = 10, epochs_per_loop = 5, reset_lr_refit = c(TRUE, FALSE) ) ## Visualize architecture plot_vae_architecture( aut$model, title = "Optimized CISSVAE Architecture" ) })## Requires a working Python environment via reticulate ## Examples are wrapped in try() to avoid failures on CRAN check systems try({ reticulate::use_virtualenv("cissvae_environment", required = TRUE) data(df_missing) data(clusters) ## Run autotuning aut <- autotune_cissvae( data = df_missing, index_col = "index", clusters = clusters$clusters, n_trials = 3, study_name = "comprehensive_vae_autotune", device_preference = "cpu", seed = 42, ## Hyperparameter search space num_hidden_layers = c(2, 5), hidden_dims = c(64, 512), latent_dim = c(10, 100), latent_shared = c(TRUE, FALSE), output_shared = c(TRUE, FALSE), lr = c(0.01, 0.1), decay_factor = c(0.99, 1.0), beta = c(0.01, 0.1), num_epochs = c(5, 20), batch_size = c(1000, 4000), num_shared_encode = c(0, 1, 2), num_shared_decode = c(0, 1, 2), ## Placement strategies encoder_shared_placement = c( "at_end", "at_start", "alternating", "random" ), decoder_shared_placement = c( "at_start", "at_end", "alternating", "random" ), refit_patience = 2, refit_loops = 10, epochs_per_loop = 5, reset_lr_refit = c(TRUE, FALSE) ) ## Visualize architecture plot_vae_architecture( aut$model, title = "Optimized CISSVAE Architecture" ) })
This function prints the available devices (cpu, cuda, mps) detected by PyTorch. If your mps/cuda device is not shown, check your PyTorch installation.
check_devices(env_path = NULL)check_devices(env_path = NULL)
env_path |
Path to virtual environment containing PyTorch and ciss-vae. Defaults to NULL. |
Vector of strings for available devices.
try( check_devices() )try( check_devices() )
Visualize the pattern of missing values in a dataset, arranged by cluster. Each column in the heatmap represents one observation and each row a feature. Tiles indicate whether a value is missing (black) or present (white). Cluster labels are shown as a column annotation bar above the heatmap. The package ComplexHeatmap must be installed for this function to work.
cluster_heatmap( data, clusters, cols_ignore = NULL, show_row_names = TRUE, missing_color = "yellow", observed_color = "purple4", title = "Missingness Heatmap by Cluster" )cluster_heatmap( data, clusters, cols_ignore = NULL, show_row_names = TRUE, missing_color = "yellow", observed_color = "purple4", title = "Missingness Heatmap by Cluster" )
data |
A |
clusters |
A vector of cluster labels for each observation (row) in
|
cols_ignore |
Optional character vector of column names in |
show_row_names |
Logical. If TRUE, displays feature names on plot |
missing_color |
Display color of missing values. Default black. |
observed_color |
Display color of observed values. Default white. |
title |
Optional plot title. Defaults to "Missingness Heatmap by Cluster" |
This function constructs a binary missingness matrix where 1 indicates a missing value and 0 a present value. Columns (observations) are ordered by their cluster labels, and the function displays a heatmap of missingness patterns using ComplexHeatmap. Cluster membership is displayed as an annotation above the heatmap.
A list of class "ComplexHeatmap" containing the heatmap
object. This can be used for further inspection or manual redraw.
if(requireNamespace("ComplexHeatmap")){ # Simple example with small dataset df <- data.frame( x1 = c(1, NA, 3), x2 = c(NA, 2, 3), x3 = c(1, 2, NA) ) cl <- c("A", "B", "A") cluster_heatmap(df, cl) # Example excluding a column prior to plotting cluster_heatmap(df, cl, cols_ignore = "x2") # Adding a 'Cluster' label and changing colors cluster_heatmap(df, clusters = paste0("Cluster ", cl), cols_ignore = "x2", missing_color = "red", observed_color = "blue") }if(requireNamespace("ComplexHeatmap")){ # Simple example with small dataset df <- data.frame( x1 = c(1, NA, 3), x2 = c(NA, 2, 3), x3 = c(1, 2, NA) ) cl <- c("A", "B", "A") cluster_heatmap(df, cl) # Example excluding a column prior to plotting cluster_heatmap(df, cl, cols_ignore = "x2") # Adding a 'Cluster' label and changing colors cluster_heatmap(df, clusters = paste0("Cluster ", cl), cols_ignore = "x2", missing_color = "red", observed_color = "blue") }
Given an R data.frame or matrix with missing values, clusters on the pattern of missingness and returns cluster labels plus silhouette score.
cluster_on_missing( data, cols_ignore = NULL, n_clusters = NULL, seed = 42, k_neighbors = NULL, leiden_resolution = 0.25, leiden_objective = "CPM", use_snn = TRUE, columns_ignore = NULL )cluster_on_missing( data, cols_ignore = NULL, n_clusters = NULL, seed = 42, k_neighbors = NULL, leiden_resolution = 0.25, leiden_objective = "CPM", use_snn = TRUE, columns_ignore = NULL )
data |
A data.frame or matrix (samples × features), may contain |
cols_ignore |
Character vector of column names to ignore when clustering. |
n_clusters |
Integer; if provided, will run KMeans with this many clusters.
If |
seed |
Integer; random seed for KMeans (or reproducibility in Leiden). |
k_neighbors |
Integer; minimum cluster size for Leiden.
If |
leiden_resolution |
Resolution for Leiden Clustering. |
leiden_objective |
objective |
use_snn |
use snn |
columns_ignore |
Alias for cols_ignore. Kept for continuity. |
A list with components:
clusters — integer vector of cluster labels
silhouette — numeric silhouette score, or NA if not computable
Groups samples with similar patterns of missingness across features using
either K-means clustering (when n_clusters is specified) or Leiden
(when n_clusters is NULL). This is useful for detecting cohorts with
shared missing-data behavior (e.g., site/batch effects).
cluster_on_missing_prop( prop_matrix, n_clusters = NULL, seed = NULL, k_neighbors = NULL, leiden_resolution = 0.25, use_snn = TRUE, leiden_objective = "CPM", metric = "euclidean", scale_features = FALSE )cluster_on_missing_prop( prop_matrix, n_clusters = NULL, seed = NULL, k_neighbors = NULL, leiden_resolution = 0.25, use_snn = TRUE, leiden_objective = "CPM", metric = "euclidean", scale_features = FALSE )
prop_matrix |
Matrix or data frame where rows are samples and
columns are features, entries are missingness proportions in |
n_clusters |
Integer; number of clusters for KMeans. If |
seed |
Integer; random seed for KMeans reproducibility (default: |
k_neighbors |
Integer; Leiden minimum cluster size. If |
leiden_resolution |
Numeric; Leiden cluster selection threshold
(default: |
use_snn |
Logical; whether to use shared nearest neighbors (optional). |
leiden_objective |
Character; Leiden optimization objective (optional). |
metric |
Character; distance metric. Options include:
|
scale_features |
Logical; whether to standardize feature columns
before clustering samples (default: |
A list with:
clusters: Integer vector of cluster assignments per sample.
silhouette_score: Numeric silhouette score, or NULL
if not computable.
set.seed(123) dat <- data.frame( sample_id = paste0("s", 1:12), # Two features measured at 3 timepoints each -> proportions by feature A_1 = c(NA, rnorm(11)), A_2 = c(NA, rnorm(11)), A_3 = rnorm(12), B_1 = rnorm(12), B_2 = c(rnorm(10), NA, NA), B_3 = rnorm(12) ) pm <- create_missingness_prop_matrix( dat, index_col = "sample_id", repeat_feature_names = c("A", "B") ) ## cluster_on_missing_prop requires a working Python environment via reticulate ## Examples are wrapped in try() to avoid failures on CRAN check systems try({ res <- cluster_on_missing_prop( pm, n_clusters = 2, metric = "cosine", scale_features = TRUE ) table(res$clusters) res$silhouette_score })set.seed(123) dat <- data.frame( sample_id = paste0("s", 1:12), # Two features measured at 3 timepoints each -> proportions by feature A_1 = c(NA, rnorm(11)), A_2 = c(NA, rnorm(11)), A_3 = rnorm(12), B_1 = rnorm(12), B_2 = c(rnorm(10), NA, NA), B_3 = rnorm(12) ) pm <- create_missingness_prop_matrix( dat, index_col = "sample_id", repeat_feature_names = c("A", "B") ) ## cluster_on_missing_prop requires a working Python environment via reticulate ## Examples are wrapped in try() to avoid failures on CRAN check systems try({ res <- cluster_on_missing_prop( pm, n_clusters = 2, metric = "cosine", scale_features = TRUE ) table(res$clusters) res$silhouette_score })
Produce a cluster-stratified summary table using gtsummary, where the
cluster assignments are supplied as a separate vector.
All additional arguments (...) are passed directly to
gtsummary::tbl_summary(), so users can specify
all_continuous() / all_categorical() selectors and custom statistics.
cluster_summary( data, clusters, add_options = list(add_overall = FALSE, add_n = TRUE, add_p = FALSE), return_as = c("gtsummary", "gt"), include = NULL, ... )cluster_summary( data, clusters, add_options = list(add_overall = FALSE, add_n = TRUE, add_p = FALSE), return_as = c("gtsummary", "gt"), include = NULL, ... )
data |
A data.frame or tibble of features to summarize. |
clusters |
A vector (factor, character, or numeric) of cluster labels
with length equal to |
add_options |
List of post-processing options:
|
return_as |
|
include |
Optional character vector of variables to include.
Defaults to all columns in |
... |
Passed to |
A gtsummary::tbl_summary (default) or gt::gt_tbl if return_as="gt".
if(requireNamespace("gtsummary")){ df <- data.frame( age = rnorm(100, 60, 10), bmi = rnorm(100, 28, 5), sex = sample(c("F","M"), 100, TRUE) ) cl <- sample(1:3, 100, TRUE) cluster_summary( data = df, clusters = cl, statistic = list( gtsummary::all_continuous() ~ "{mean} ({sd})", gtsummary::all_categorical() ~ "{n} / {N} ({p}%)" ), missing = "always" ) }if(requireNamespace("gtsummary")){ df <- data.frame( age = rnorm(100, 60, 10), bmi = rnorm(100, 28, 5), sex = sample(c("F","M"), 100, TRUE) ) cl <- sample(1:3, 100, TRUE) cluster_summary( data = df, clusters = cl, statistic = list( gtsummary::all_continuous() ~ "{mean} ({sd})", gtsummary::all_categorical() ~ "{n} / {N} ({p}%)" ), missing = "always" ) }
A tibble assigning each observation in df_missing to a cluster
determined by its missingness pattern.
clustersclusters
A tibble with 8000 rows and 2 variables:
Integer. Row identifier imported from data_raw/clusters.csv.
Factor (or integer) giving the missingness‐based cluster for each row.
Imported from data_raw/clusters.csv, then renamed ...1 → index.
data(clusters) table(clusters$cluster)data(clusters) table(clusters$cluster)
This function will either find an existing virtualenv by name (in the default location) or at a custom filesystem path, or create it (and install CISSVAE into it).
create_cissvae_env( envname = "cissvae_environment", path = NULL, install_python = FALSE, python_version = "3.10" )create_cissvae_env( envname = "cissvae_environment", path = NULL, install_python = FALSE, python_version = "3.10" )
envname |
Name of the virtual environment (when using the default env location). |
path |
Character; optional path to the directory in which to create/use the virtualenv. |
install_python |
Logical; if TRUE, install Python if none of at least the requested version is found on the system. |
python_version |
Python version string (major.minor), used when installing Python. |
NULL. Called for side effects.
## Requires a working Python environment via reticulate ## Examples are wrapped in try() to avoid failures on CRAN check systems try({ create_cissvae_env( envname = "cissvae_environment", install_python = FALSE, python_version = "3.10")})## Requires a working Python environment via reticulate ## Examples are wrapped in try() to avoid failures on CRAN check systems try({ create_cissvae_env( envname = "cissvae_environment", install_python = FALSE, python_version = "3.10")})
Creates a matrix where each entry represents the proportion of missing values
for each sample–feature combination across multiple timepoints. Each sample will have
one proportion value per feature. Features may have repeated time points
(columns named like feature_1, feature_2, ...). This matrix can be used
with cluster_on_missing_prop() to group samples with similar missingness patterns.
create_missingness_prop_matrix( data, index_col = NULL, cols_ignore = NULL, na_values = c(NA, NaN, Inf, -Inf), repeat_feature_names = character(0), loose = FALSE )create_missingness_prop_matrix( data, index_col = NULL, cols_ignore = NULL, na_values = c(NA, NaN, Inf, -Inf), repeat_feature_names = character(0), loose = FALSE )
data |
Data frame or matrix containing the input data with potential missing values. |
index_col |
Character scalar. Name of an index column to exclude from analysis (optional). If supplied and present, it will be removed from analysis; row names are preserved as-is. |
cols_ignore |
Character vector of column names to exclude from the proportion matrix (optional). |
na_values |
Vector of values to treat as missing in addition to standard missing values.
Defaults to |
repeat_feature_names |
Character vector of "base" feature names that have repeated timepoints.
Repeat measurements must be in the form |
loose |
Logical. If True, will match any column starting with feature from repeat_feature_names |
A numeric matrix of dimension nrow(data) by n_features, where rows are
samples and columns are features (base names). Entries are per-sample missingness proportions in [0, 1].
The returned matrix has an attribute "feature_columns_map": a named list mapping each
output feature to the source columns used to compute its proportion.
df <- data.frame( id = paste0("s", 1:4), CRP_1 = c(1.2, NA, 2.1, NaN), CRP_2 = c(NA, NA, 2.0, 1.9), IL6_1 = c(0.5, 0.7, Inf, 0.4), IL6_2 = c(0.6, -Inf, 0.8, 0.5), Albumin = c(3.9, 4.1, 4.0, NA) ) m <- create_missingness_prop_matrix( data = df, index_col = "id", cols_ignore = NULL, repeat_feature_names = c("CRP", "IL6") ) dim(m) # 4 x 3 (CRP, IL6, Albumin) # per-sample proportion missing across CRP_1 and CRP_2 m[ , "CRP"] attr(m, "feature_columns_map")df <- data.frame( id = paste0("s", 1:4), CRP_1 = c(1.2, NA, 2.1, NaN), CRP_2 = c(NA, NA, 2.0, 1.9), IL6_1 = c(0.5, 0.7, Inf, 0.4), IL6_2 = c(0.6, -Inf, 0.8, 0.5), Albumin = c(3.9, 4.1, 4.0, NA) ) m <- create_missingness_prop_matrix( data = df, index_col = "id", cols_ignore = NULL, repeat_feature_names = c("CRP", "IL6") ) dim(m) # 4 x 3 (CRP, IL6, Albumin) # per-sample proportion missing across CRP_1 and CRP_2 m[ , "CRP"] attr(m, "feature_columns_map")
A tibble of simulated biomarker measurements with missing entries.
Each row corresponds to one observation (indexed by index), and the remaining
columns are the measured biomarker values, some of which are set to NA to
demonstrate imputation workflows.
df_missingdf_missing
A tibble with 8,000 rows and 30 variables:
Integer. Row identifier imported from data_raw/df_missing.csv.
Demographic columns. Omit from selection of validation set. No missingness
Simulated Biomarker columns, have missingness
Imported from data_raw/df_missing.csv, then renamed ...1 → index.
data(df_missing) str(df_missing) summary(df_missing)data(df_missing) str(df_missing) summary(df_missing)
A sample imputable_matrix (dataframe).
dnidni
A dataframe:
A mock imputable_matrix dataframe
Imported from data_raw/dni.csv
data(dni)data(dni)
An S3 class returned by the CISS-VAE imputation function. Inherits from
list, so all standard list operations (e.g., $, [[)
work as expected. The class exists primarily to enable type-safe save/load
methods that correctly handle the embedded PyTorch model.
The following fields may be present depending on which return_*
arguments were set when the model was run:
imputed_datasetData frame. The full imputed dataset.
raw_dataData frame. The original data before imputation.
modelPython object (CISSVAE). The trained PyTorch
model. Present if return_model = TRUE.
cluster_datasetPython object (ClusterDataset).
Present if return_dataset = TRUE.
clustersInteger vector. Cluster assignment per row.
Present if return_clusters = TRUE.
silhouette_widthNumeric vector. Per-sample silhouette
widths. Present if return_silhouettes = TRUE.
training_historyData frame. Loss values recorded during
training. Present if return_history = TRUE.
val_dataData frame. Validation data with observed entries
replaced by NA. Present if return_validation_dataset = TRUE.
val_imputedData frame. Model reconstructions of the
validation entries. Present if return_validation_dataset = TRUE.
save_impute_result, load_impute_result,
print.impute_result, performance_by_cluster
Given a loaded model, an R data frame, and a vector of cluster labels, this builds the Python ClusterDataset and DataLoader, runs inference, and returns an imputed data frame in R.
impute_with_cissvae( model, data, index_col = NULL, cols_ignore = NULL, clusters, imputable_matrix = NULL, binary_feature_mask = NULL, categorical_column_map = NULL, replacement_value = 0, batch_size = NULL, seed = 42 )impute_with_cissvae( model, data, index_col = NULL, cols_ignore = NULL, clusters, imputable_matrix = NULL, binary_feature_mask = NULL, categorical_column_map = NULL, replacement_value = 0, batch_size = NULL, seed = 42 )
model |
Python model object loaded via load_cissvae_model() |
data |
R data.frame with missing values |
index_col |
String name of index column to preserve (optional) |
cols_ignore |
Character vector of column names to exclude from imputation scoring. |
clusters |
Integer vector of cluster labels for rows of data |
imputable_matrix |
Logical matrix indicating entries allowed to be imputed. |
binary_feature_mask |
Logical vector marking which columns are binary. |
categorical_column_map |
Named list. Maps generated dummy variable column names to the original categorical column name. |
replacement_value |
Numeric value used to replace missing entries before model input. |
batch_size |
Batch size passed to Python DataLoader. If NULL, batch_size = nrow(data) |
seed |
Base random seed for reproducible results |
Imputed R data.frame
Use same ClusterDataset parameters as for initial model training.
Clusters must have same labels as clusters used for model training
'binary_feature_mask' is required for correct imputation of binary columns.
## Requires a working Python environment via reticulate ## Wrapped in try() and donttest to avoid CRAN check failures try({ # Activate your reticulate Python environment with ciss_vae installed reticulate::use_virtualenv("cissvae_environment", required = TRUE) # Load example data and clusters (replace with your own) data(df_missing) data(clusters) # Load a previously saved model model <- try(load_cissvae_model("model.pt", python_env = "cissvae_environment")) # Perform imputation on new data imputed_df <- try( impute_with_cissvae( model = model, data = df_missing, index_col = "index", cols_ignore = c("Age", "Salary"), clusters = clusters$clusters, imputable_matrix = NULL, binary_feature_mask = NULL, replacement_value = 0, batch_size = 4000L, seed = 42 ) ) })## Requires a working Python environment via reticulate ## Wrapped in try() and donttest to avoid CRAN check failures try({ # Activate your reticulate Python environment with ciss_vae installed reticulate::use_virtualenv("cissvae_environment", required = TRUE) # Load example data and clusters (replace with your own) data(df_missing) data(clusters) # Load a previously saved model model <- try(load_cissvae_model("model.pt", python_env = "cissvae_environment")) # Perform imputation on new data imputed_df <- try( impute_with_cissvae( model = model, data = df_missing, index_col = "index", cols_ignore = c("Age", "Salary"), clusters = clusters$clusters, imputable_matrix = NULL, binary_feature_mask = NULL, replacement_value = 0, batch_size = 4000L, seed = 42 ) ) })
Loads a CISSVAE model previously saved by
save_cissvae_model. When the model was saved with
method = "state_dict" (the default), the architecture is
automatically reconstructed from the paired .config.rds file and
the weights are loaded via load_state_dict. When the model was saved
with method = "full", the entire Python object is deserialised
directly.
load_cissvae_model( file, method = c("state_dict", "full"), device = "cpu", python_env = NULL )load_cissvae_model( file, method = c("state_dict", "full"), device = "cpu", python_env = NULL )
file |
Character string. Path to the saved model file ( |
method |
Character string. One of |
device |
Character string. PyTorch device string passed to
|
python_env |
Optional character string. Name of a virtualenv or conda
environment to activate before loading. If |
The CISSVAE class is imported from the installed
ciss_vae.classes using reticulate::import
reticulate::import).
A CISSVAE Python object in evaluation mode
(model.eval() has been called).
save_cissvae_model, load_impute_result
try({ reticulate::use_virtualenv("cissvae_environment", required = TRUE) pt_file <- tempfile(fileext = ".pt") # save first save_cissvae_model(dat$model, file = pt_file) # reload in same or new session model <- load_cissvae_model(file = pt_file, device = "cpu") })try({ reticulate::use_virtualenv("cissvae_environment", required = TRUE) pt_file <- tempfile(fileext = ".pt") # save first save_cissvae_model(dat$model, file = pt_file) # reload in same or new session model <- load_cissvae_model(file = pt_file, device = "cpu") })
impute_result object from diskReads an impute_result previously saved by
save_impute_result. R-native fields are restored from the RDS
file. If a CISSVAE model was saved, the architecture is rebuilt from
the stored configuration, the state_dict weights are loaded, and the
model is placed in evaluation mode (model.eval()).
The CISSVAE class is imported from the installed
ciss_vae.classes Python package using reticulate::import.
load_impute_result(dir, name = "result", device = "cpu")load_impute_result(dir, name = "result", device = "cpu")
dir |
Character string. Path to the directory written by
|
name |
Character string. Base name used when saving. |
device |
Character string. PyTorch device ("cpu" or "cuda"). |
An impute_result object with model restored if present.
A sample survival dataset
mock_survmock_surv
A dataframe:
A mock survival dataset
Imported from data_raw/mock_survival.csv
data(mock_surv)data(mock_surv)
Calculates mean squared error (MSE) for continuous features and binary cross-entropy (BCE) for features explicitly marked as binary, comparing model-imputed validation values against ground-truth validation data.
performance_by_cluster( res, clusters = NULL, group_col = NULL, feature_cols = NULL, binary_features = character(0), categorical_features = list(), by_group = TRUE, by_cluster = TRUE, cols_ignore = NULL, eps = 1e-07 )performance_by_cluster( res, clusters = NULL, group_col = NULL, feature_cols = NULL, binary_features = character(0), categorical_features = list(), by_group = TRUE, by_cluster = TRUE, cols_ignore = NULL, eps = 1e-07 )
res |
A list containing CISS-VAE run outputs. Must include:
|
clusters |
Optional vector (same length as rows in |
group_col |
Optional character string naming a column in |
feature_cols |
Character vector specifying which feature columns to evaluate.
Defaults to all numeric columns except |
binary_features |
Character vector naming those columns (subset of
|
categorical_features |
List of categorical features. Key should be the category name and values should be the name of the dummy/one-hot-encoded variables. |
by_group |
Logical; if |
by_cluster |
Logical; if |
cols_ignore |
Character vector of column names to exclude from scoring (e.g., IDs). |
eps |
Numeric. Small constant used for clipping probabilities in BCE
calculation. Default is |
Validation loss is computed at the cell level and then aggregated to produce overall, per-cluster, per-group, and group-by-cluster summaries.
For features listed in binary_features, performance is binary
cross-entropy (BCE):
where is the predicted probability.
For other numeric features, performance is mean squared error (MSE):
.
Losses are computed at the individual cell level using only validation
entries (non-NA in val_data), then aggregated.
A named list containing:
overall: overall validation metrics (MSE, BCE, total)
per_cluster: metrics summarized by cluster (if by_cluster = TRUE)
per_group: metrics summarized by group (if by_group = TRUE)
group_by_cluster: metrics summarized by group and cluster
(if both by_group and by_cluster are TRUE)
Each summary contains:
mse: mean squared error across continuous validation cells
bce: mean binary cross-entropy across binary validation cells
imputation_error: mse + bce
data_complete <- data.frame( id = 1:10, group = sample(c("A", "B"), 10, replace = TRUE), x1 = rnorm(10), x2 = rnorm(10) ) missing_mask <- matrix( sample(c(TRUE, FALSE), 20, replace = TRUE), nrow = 10 ) val_data <- data_complete val_data[which(missing_mask, arr.ind = TRUE)] <- NA val_imputed <- data.frame( id = data_complete$id, group = data_complete$group, x1 = mean(data_complete$x1), x2 = mean(data_complete$x2) ) val_imputed[which(missing_mask, arr.ind = TRUE)] <- NA result <- list( val_data = val_data, val_imputed = val_imputed, clusters = sample(c(0, 1), 10, replace = TRUE) ) performance_by_cluster( res = result, group_col = "group", binary_features = character(0), by_group = TRUE, by_cluster = TRUE, cols_ignore = "id" )data_complete <- data.frame( id = 1:10, group = sample(c("A", "B"), 10, replace = TRUE), x1 = rnorm(10), x2 = rnorm(10) ) missing_mask <- matrix( sample(c(TRUE, FALSE), 20, replace = TRUE), nrow = 10 ) val_data <- data_complete val_data[which(missing_mask, arr.ind = TRUE)] <- NA val_imputed <- data.frame( id = data_complete$id, group = data_complete$group, x1 = mean(data_complete$x1), x2 = mean(data_complete$x2) ) val_imputed[which(missing_mask, arr.ind = TRUE)] <- NA result <- list( val_data = val_data, val_imputed = val_imputed, clusters = sample(c(0, 1), 10, replace = TRUE) ) performance_by_cluster( res = result, group_col = "group", binary_features = character(0), by_group = TRUE, by_cluster = TRUE, cols_ignore = "id" )
Creates a horizontal schematic diagram of the CISS-VAE architecture, showing
shared and cluster-specific layers. This function wraps the Python
plot_vae_architecture function from the ciss_vae package.
plot_vae_architecture( model, title = NULL, color_shared = "skyblue", color_unshared = "lightcoral", color_latent = "gold", color_input = "lightgreen", color_output = "lightgreen", figsize = c(16, 8), save_path = NULL, dpi = 300, return_plot = FALSE, display_plot = TRUE )plot_vae_architecture( model, title = NULL, color_shared = "skyblue", color_unshared = "lightcoral", color_latent = "gold", color_input = "lightgreen", color_output = "lightgreen", figsize = c(16, 8), save_path = NULL, dpi = 300, return_plot = FALSE, display_plot = TRUE )
model |
A trained CISSVAE model object (Python object) |
title |
Title of the plot. If NULL, no title is displayed. Default NULL. |
color_shared |
Color for shared hidden layers. Default "skyblue". |
color_unshared |
Color for unshared (cluster-specific) hidden layers. Default "lightcoral". |
color_latent |
Color for latent layer. Default "gold". |
color_input |
Color for input layer. Default "lightgreen". |
color_output |
Color for output layer. Default "lightgreen". |
figsize |
Size of the matplotlib figure as c(width, height). Default c(16, 8). |
save_path |
Optional path to save the plot as PNG. If NULL, plot is displayed. Default NULL. |
dpi |
Resolution for saved PNG file. Default 300. |
return_plot |
Logical; if TRUE, returns the plot as an R object using reticulate. Default FALSE. |
display_plot |
Logical; if TRUE, displays the plot. Set to FALSE when only saving. Default TRUE. |
If return_plot is TRUE, returns a Python matplotlib figure object that can be further manipulated. Otherwise returns NULL invisibly.
If you get a TCL or TK error, run: reticulate::py_run_string("import matplotlib; matplotlib.use('Agg')") to change the matplotlib backend to use 'Agg' instead.
## Requires a working Python environment via reticulate ## Examples are wrapped in try() to avoid failures on CRAN check systems try({ # Train a model first result <- run_cissvae(my_data, return_model = TRUE) # Basic plot plot_vae_architecture(result$model) # Save plot to file plot_vae_architecture( model = result$model, title = "CISS-VAE Architecture", save_path = "vae_architecture.png", dpi = 300 ) # Return plot object for further manipulation fig <- plot_vae_architecture( model = result$model, return_plot = TRUE, display_plot = FALSE ) })## Requires a working Python environment via reticulate ## Examples are wrapped in try() to avoid failures on CRAN check systems try({ # Train a model first result <- run_cissvae(my_data, return_model = TRUE) # Basic plot plot_vae_architecture(result$model) # Save plot to file plot_vae_architecture( model = result$model, title = "CISS-VAE Architecture", save_path = "vae_architecture.png", dpi = 300 ) # Return plot object for further manipulation fig <- plot_vae_architecture( model = result$model, return_plot = TRUE, display_plot = FALSE ) })
impute_result objectPrints a structured summary of an impute_result object, including
the dimensions of the imputed dataset, a preview of the imputed data,
cluster composition, and which optional components are present.
## S3 method for class 'impute_result' print(x, n = 6, ...)## S3 method for class 'impute_result' print(x, n = 6, ...)
x |
An object of class |
n |
Integer. Number of rows of the imputed dataset to preview.
Defaults to |
... |
Further arguments passed to or from other methods (currently unused). |
x, invisibly.
try({ res <- ciss_vae(data = my_data, ...) print(res) # default: 6 row preview print(res, n = 10) })try({ res <- ciss_vae(data = my_data, ...) print(res) # default: 6 row preview print(res, n = 10) })
This function wraps the Python run_cissvae function from the ciss_vae package,
providing a complete pipeline for missing data imputation using a Cluster-Informed
Shared and Specific Variational Autoencoder (CISS-VAE). The function handles data
preprocessing, model training, and returns imputed data along with optional
model artifacts.
The CISS-VAE architecture uses cluster information to learn both shared and cluster-specific representations, enabling more accurate imputation by leveraging patterns within and across different data subgroups.
run_cissvae( data, index_col = NULL, val_proportion = 0.1, replacement_value = 0, cols_ignore = NULL, imputable_matrix = NULL, binary_feature_mask = NULL, categorical_column_map = NULL, print_dataset = TRUE, clusters = NULL, n_clusters = NULL, seed = 42, missingness_proportion_matrix = NULL, scale_features = FALSE, k_neighbors = 15L, leiden_resolution = 0.5, leiden_objective = "CPM", hidden_dims = c(150, 120, 60), latent_dim = 15, layer_order_enc = c("unshared", "unshared", "unshared"), layer_order_dec = c("shared", "shared", "shared"), latent_shared = FALSE, output_shared = FALSE, batch_size = 4000, epochs = 500, initial_lr = 0.01, decay_factor = 0.999, weight_decay = 0.001, beta = 0.001, device = NULL, max_loops = 100, patience = 2, epochs_per_loop = NULL, initial_lr_refit = NULL, decay_factor_refit = NULL, beta_refit = NULL, verbose = FALSE, return_model = TRUE, return_clusters = TRUE, return_silhouettes = FALSE, return_history = FALSE, return_dataset = FALSE, return_validation_dataset = TRUE, debug = FALSE, columns_ignore = NULL )run_cissvae( data, index_col = NULL, val_proportion = 0.1, replacement_value = 0, cols_ignore = NULL, imputable_matrix = NULL, binary_feature_mask = NULL, categorical_column_map = NULL, print_dataset = TRUE, clusters = NULL, n_clusters = NULL, seed = 42, missingness_proportion_matrix = NULL, scale_features = FALSE, k_neighbors = 15L, leiden_resolution = 0.5, leiden_objective = "CPM", hidden_dims = c(150, 120, 60), latent_dim = 15, layer_order_enc = c("unshared", "unshared", "unshared"), layer_order_dec = c("shared", "shared", "shared"), latent_shared = FALSE, output_shared = FALSE, batch_size = 4000, epochs = 500, initial_lr = 0.01, decay_factor = 0.999, weight_decay = 0.001, beta = 0.001, device = NULL, max_loops = 100, patience = 2, epochs_per_loop = NULL, initial_lr_refit = NULL, decay_factor_refit = NULL, beta_refit = NULL, verbose = FALSE, return_model = TRUE, return_clusters = TRUE, return_silhouettes = FALSE, return_history = FALSE, return_dataset = FALSE, return_validation_dataset = TRUE, debug = FALSE, columns_ignore = NULL )
data |
A data.frame or matrix (samples × features) containing the data to impute.
May contain |
index_col |
Character. Name of column in |
val_proportion |
Numeric. Fraction of non-missing entries to hold out for
validation during training. Must be between 0 and 1. Default |
replacement_value |
Numeric. Fill value for masked entries during training.
Default |
cols_ignore |
Character or integer vector. Columns to exclude from validation set.
Can specify by name or index. Default |
imputable_matrix |
Logical matrix indicating entries allowed to be imputed. |
binary_feature_mask |
Logical vector marking which columns are binary. |
categorical_column_map |
Named list. Maps generated dummy variable column names to the original categorical column name. |
print_dataset |
Logical. If |
clusters |
Optional vector or single-column data.frame of precomputed cluster
labels for samples. If |
n_clusters |
Integer. Number of clusters for KMeans clustering when |
seed |
Integer. Random seed for reproducible results. Default |
missingness_proportion_matrix |
Optional pre-computed missingness proportion
matrix for biomarker-based clustering. If provided, clustering will be based on
these proportions. Default |
scale_features |
Logical. Whether to scale features when using missingness
proportion matrix clustering. Default |
k_neighbors |
Integer. Number of nearest neighbors for Leiden clustering. Defaults to 15. |
leiden_resolution |
Float. Resolution parameter for Leiden clustering. Defaults to 0.5. |
leiden_objective |
Character. Objective function for Leiden clustering. One of ("CPM", "RB", "Modularity") |
|
Integer vector. Sizes of hidden layers in encoder/decoder.
Length determines number of hidden layers. Default |
|
latent_dim |
Integer. Dimension of latent space representation. Default |
layer_order_enc |
Character vector. Sharing pattern for encoder layers.
Each element should be "shared" or "unshared". Length must match |
layer_order_dec |
Character vector. Sharing pattern for decoder layers.
Each element should be "shared" or "unshared". Length must match |
latent_shared |
Logical. Whether latent space weights are shared across clusters.
Default |
output_shared |
Logical. Whether output layer weights are shared across clusters.
Default |
batch_size |
Integer. Mini-batch size for training. Larger values may improve
training stability but require more memory. Default |
epochs |
Integer. Number of epochs for initial training phase. Default |
initial_lr |
Numeric. Initial learning rate for optimizer. Default |
decay_factor |
Numeric. Exponential decay factor for learning rate scheduling.
Must be between 0 and 1. Default |
weight_decay |
Weight decay (L2 penalty) used in Adam optimizer. |
beta |
Numeric. Weight for KL divergence term in VAE loss function.
Controls regularization strength. Default |
device |
Character. Device specification for computation ("cpu" or "cuda").
If |
max_loops |
Integer. Maximum number of impute-refit loops to perform.
Default |
patience |
Integer. Early stopping patience for refit loops. Training stops
if validation loss doesn't improve for this many consecutive loops. Default |
epochs_per_loop |
Integer. Number of epochs per refit loop. If |
initial_lr_refit |
Numeric. Learning rate for refit loops. If |
decay_factor_refit |
Numeric. Decay factor for refit loops. If |
beta_refit |
Numeric. KL weight for refit loops. If |
verbose |
Logical. If |
return_model |
Logical. If |
return_clusters |
Logical. If TRUE returns cluster vector |
return_silhouettes |
Logical. If |
return_history |
Logical. If |
return_dataset |
Logical. If |
return_validation_dataset |
Logical. If |
debug |
Logical; if TRUE, additional metadata is returned for debugging. |
columns_ignore |
Alias of cols_ignore. Kept for continuity |
The CISS-VAE method works in two main phases:
Initial Training: The model is trained on the original data with validation holdout to learn initial representations and imputation patterns.
Impute-Refit Loops: The model iteratively imputes missing values and retrains on the updated dataset until convergence or maximum loops reached.
The architecture uses both shared and cluster-specific layers to capture:
Shared patterns: Common relationships across all clusters
Specific patterns: Unique relationships within each cluster
A list containing imputed data and optional additional outputs:
data.frame of imputed data with same dimensions as input.
Missing values are filled with model predictions. If index_col was
provided, it is re-attached as the first column.
(if return_model=TRUE) Python CISSVAE model object.
Can be used for further analysis or predictions.
(if return_dataset=TRUE) Python ClusterDataset object
containing validation data, masks, normalization parameters, and cluster labels.
Can be used with performance_by_cluster() and other analysis functions.
(if return_clusters=TRUE) Returns vector of cluster assignments
(if return_silhouettes=TRUE) Numeric silhouette
score measuring cluster separation quality.
(if return_history=TRUE) data.frame containing training
history with columns for epoch, losses, and validation metrics.
(if return_validation_dataset=TRUE) data.frame containing values held
aside for validation.
(if return_validation_dataset=TRUE) data.frame containing imputed values of set held
aside for validation.
This function requires the Python ciss_vae package to be installed and
accessible via reticulate.
If Leiden clustering yields too many clusters, consider increasing k_neighbors or reducing leiden_resolution.
Use GPU computation when available for faster training on large datasets. Use check_devices() to see what devices are available.
Adjust batch_size based on available memory (larger is faster but uses more memory).
Set verbose = TRUE to monitor training progress.
create_missingness_prop_matrix for creating missingness proportion matrices
performance_by_cluster for analyzing model performance using the returned dataset
## Requires a working Python environment via reticulate ## Examples are wrapped in try() to avoid failures on CRAN check systems library(rCISSVAE) data(df_missing) data(clusters) try({ dat = run_cissvae( data = df_missing, index_col = "index", val_proportion = 0.1, ## pass a vector for different proportions by cluster cols_ignore = c("Age", "Salary", "ZipCode10001", "ZipCode20002", "ZipCode30003"), clusters = clusters$clusters, ## we have precomputed cluster labels so we pass them here epochs = 5, return_silhouettes = FALSE, return_history = TRUE, # Get detailed training history verbose = FALSE, return_model = TRUE, ## Allows for plotting model schematic device = "cpu", # Explicit device selection layer_order_enc = c("unshared", "shared", "unshared"), layer_order_dec = c("shared", "unshared", "shared") ) })## Requires a working Python environment via reticulate ## Examples are wrapped in try() to avoid failures on CRAN check systems library(rCISSVAE) data(df_missing) data(clusters) try({ dat = run_cissvae( data = df_missing, index_col = "index", val_proportion = 0.1, ## pass a vector for different proportions by cluster cols_ignore = c("Age", "Salary", "ZipCode10001", "ZipCode20002", "ZipCode30003"), clusters = clusters$clusters, ## we have precomputed cluster labels so we pass them here epochs = 5, return_silhouettes = FALSE, return_history = TRUE, # Get detailed training history verbose = FALSE, return_model = TRUE, ## Allows for plotting model schematic device = "cpu", # Explicit device selection layer_order_enc = c("unshared", "shared", "unshared"), layer_order_dec = c("shared", "unshared", "shared") ) })
Saves a trained CISSVAE PyTorch model to disk. The default method
(method = "state_dict") saves only the model weights alongside an
automatically captured architecture config, which is the recommended
approach for portability and long-term reproducibility. A full-object save
(method = "full") is also available but is less portable across
Python / PyTorch versions.
save_cissvae_model( model, file, method = c("state_dict", "full"), overwrite = FALSE )save_cissvae_model( model, file, method = c("state_dict", "full"), overwrite = FALSE )
model |
A |
file |
Character string. File path for the saved model. For
For |
method |
Character string. One of
|
overwrite |
Logical. If |
NULL, invisibly. Called for its side effects.
load_cissvae_model, save_impute_result
try({ reticulate::use_virtualenv("cissvae_environment", required = TRUE) data(df_missing) data(clusters) dat <- run_cissvae( data = df_missing, index_col = "index", val_proportion = 0.1, clusters = clusters$clusters, epochs = 5, return_model = TRUE, device = "cpu", layer_order_enc = c("unshared", "shared", "unshared"), layer_order_dec = c("shared", "unshared", "shared") ) # default: saves state_dict + config save_cissvae_model(dat$model, file = tempfile(fileext = ".pt")) # alternative: full object save save_cissvae_model(dat$model, file = tempfile(fileext = ".pt"), method = "full") })try({ reticulate::use_virtualenv("cissvae_environment", required = TRUE) data(df_missing) data(clusters) dat <- run_cissvae( data = df_missing, index_col = "index", val_proportion = 0.1, clusters = clusters$clusters, epochs = 5, return_model = TRUE, device = "cpu", layer_order_enc = c("unshared", "shared", "unshared"), layer_order_dec = c("shared", "unshared", "shared") ) # default: saves state_dict + config save_cissvae_model(dat$model, file = tempfile(fileext = ".pt")) # alternative: full object save save_cissvae_model(dat$model, file = tempfile(fileext = ".pt"), method = "full") })
impute_result object to diskSaves an impute_result to a directory. R-native fields are written
to an RDS file. The CISSVAE PyTorch model, if present, is saved
separately as a state_dict checkpoint (.pt) so that the live
Python object is never passed to saveRDS (which cannot serialise it).
The model architecture is captured automatically from the live object and
stored inside the RDS so that load_impute_result can
reconstruct it without any additional input from the user.
save_impute_result(x, dir, name = "result", overwrite = FALSE)save_impute_result(x, dir, name = "result", overwrite = FALSE)
x |
An object of class |
dir |
Character string. Path to the directory in which to write files. Created recursively if it does not exist. |
name |
Character string. Base name used for output files. Defaults to
|
overwrite |
Logical. If |
Invisibly, a named list with elements meta (path to the RDS)
and model (path to the .pt file, or NULL if no model
was present).
load_impute_result, impute_result
## Not run: res <- ciss_vae(data = my_data, return_model = TRUE, ...) save_impute_result(res, dir = "run_001", name = "imputation_result") ## End(Not run)## Not run: res <- ciss_vae(data = my_data, return_model = TRUE, ...) save_impute_result(res, dir = "run_001", name = "imputation_result") ## End(Not run)
Uses the same env discovery rules as create_cissvae_env().
update_cissvae_env( envname = "cissvae_environment", path = NULL, install_python = FALSE, python_version = "3.10", pkgs = "ciss-vae", upgrade = TRUE, quiet = FALSE )update_cissvae_env( envname = "cissvae_environment", path = NULL, install_python = FALSE, python_version = "3.10", pkgs = "ciss-vae", upgrade = TRUE, quiet = FALSE )
envname |
Name of the virtual environment (default location). |
path |
Optional directory containing the virtualenv. |
install_python |
Logical; if TRUE, install Python if none of at least the requested version is found on the system. |
python_version |
Python version string (major.minor), used when installing Python. |
pkgs |
Character vector of pip packages to upgrade in the env. Defaults to ciss-vae only. Other dependencies of ciss-vae ( "numpy", "pandas", "torch", "rich", "matplotlib", "scikit-learn", "optuna", "typing") can also be upgraded using this function by adding them to the character vector. |
upgrade |
Logical; if TRUE pass –upgrade to pip. |
quiet |
Logical; reduce output. |
NULL, invisibly.
## Requires a working Python environment via reticulate ## Examples are wrapped in try() to avoid failures on CRAN check systems try({ update_cissvae_env( envname = "cissvae_environment", install_python = FALSE, python_version = "3.10", pkgs = "ciss-vae")})## Requires a working Python environment via reticulate ## Examples are wrapped in try() to avoid failures on CRAN check systems try({ update_cissvae_env( envname = "cissvae_environment", install_python = FALSE, python_version = "3.10", pkgs = "ciss-vae")})