utils
utils
¶
Utility functions for the Alberta Framework.
AggregatedResults
¶
Bases: NamedTuple
Aggregated results across multiple seeds.
Attributes: config_name: Name of the configuration seeds: List of seeds used metric_arrays: Dict mapping metric name to (n_seeds, n_steps) array summary: Dict mapping metric name to MetricSummary (final values)
ExperimentConfig
¶
Bases: NamedTuple
Configuration for a single experiment.
Attributes: name: Human-readable name for this configuration learner_factory: Callable that returns a fresh learner instance stream_factory: Callable that returns a fresh stream instance num_steps: Number of learning steps to run
MetricSummary
¶
Bases: NamedTuple
Summary statistics for a single metric.
Attributes: mean: Mean across seeds std: Standard deviation across seeds min: Minimum value across seeds max: Maximum value across seeds n_seeds: Number of seeds values: Raw values per seed
SingleRunResult
¶
Bases: NamedTuple
Result from a single experiment run.
Attributes: config_name: Name of the configuration that was run seed: Random seed used for this run metrics_history: List of metric dictionaries from each step final_state: Final learner state after training
SignificanceResult
¶
Bases: NamedTuple
Result of a statistical significance test.
Attributes: test_name: Name of the test performed statistic: Test statistic value p_value: P-value of the test significant: Whether the result is significant at the given alpha alpha: Significance level used effect_size: Effect size (e.g., Cohen's d) method_a: Name of first method method_b: Name of second method
StatisticalSummary
¶
Bases: NamedTuple
Summary statistics for a set of values.
Attributes: mean: Arithmetic mean std: Standard deviation sem: Standard error of the mean ci_lower: Lower bound of confidence interval ci_upper: Upper bound of confidence interval median: Median value iqr: Interquartile range n_seeds: Number of samples
aggregate_metrics(results)
¶
Aggregate results from multiple seeds into summary statistics.
Args: results: List of SingleRunResult from multiple seeds
Returns: AggregatedResults with aggregated metrics
Source code in src/alberta_framework/utils/experiments.py
get_final_performance(results, metric='squared_error', window=100)
¶
Get final performance (mean, std) for each config.
Args: results: Dictionary of aggregated results metric: Metric to evaluate window: Number of final steps to average
Returns: Dictionary mapping config name to (mean, std) tuple
Source code in src/alberta_framework/utils/experiments.py
get_metric_timeseries(results, metric='squared_error')
¶
Get mean and standard deviation timeseries for a metric.
Args: results: Aggregated results metric: Name of the metric
Returns: Tuple of (mean, lower_bound, upper_bound) arrays
Source code in src/alberta_framework/utils/experiments.py
run_multi_seed_experiment(configs, seeds=30, parallel=True, n_jobs=-1, show_progress=True)
¶
Run experiments across multiple seeds with optional parallelization.
Args: configs: List of experiment configurations to run seeds: Number of seeds (generates 0..n-1) or explicit list of seeds parallel: Whether to use parallel execution (requires joblib) n_jobs: Number of parallel jobs (-1 for all CPUs) show_progress: Whether to show progress bar (requires tqdm)
Returns: Dictionary mapping config name to AggregatedResults
Source code in src/alberta_framework/utils/experiments.py
run_single_experiment(config, seed)
¶
Run a single experiment with a given seed.
Args: config: Experiment configuration seed: Random seed for the stream
Returns: SingleRunResult with metrics and final state
Source code in src/alberta_framework/utils/experiments.py
export_to_csv(results, filepath, metric='squared_error', include_timeseries=False)
¶
Export results to CSV file.
Args: results: Dictionary mapping config name to AggregatedResults filepath: Path to output CSV file metric: Metric to export include_timeseries: Whether to include full timeseries (large!)
Source code in src/alberta_framework/utils/export.py
export_to_json(results, filepath, include_timeseries=False)
¶
Export results to JSON file.
Args: results: Dictionary mapping config name to AggregatedResults filepath: Path to output JSON file include_timeseries: Whether to include full timeseries (large!)
Source code in src/alberta_framework/utils/export.py
generate_latex_table(results, significance_results=None, metric='squared_error', caption='Experimental Results', label='tab:results', metric_label='Error', lower_is_better=True)
¶
Generate a LaTeX table of results.
Args: results: Dictionary mapping config name to AggregatedResults significance_results: Optional pairwise significance test results metric: Metric to display caption: Table caption label: LaTeX label for the table metric_label: Human-readable name for the metric lower_is_better: Whether lower metric values are better
Returns: LaTeX table as a string
Source code in src/alberta_framework/utils/export.py
generate_markdown_table(results, significance_results=None, metric='squared_error', metric_label='Error', lower_is_better=True)
¶
Generate a markdown table of results.
Args: results: Dictionary mapping config name to AggregatedResults significance_results: Optional pairwise significance test results metric: Metric to display metric_label: Human-readable name for the metric lower_is_better: Whether lower metric values are better
Returns: Markdown table as a string
Source code in src/alberta_framework/utils/export.py
save_experiment_report(results, output_dir, experiment_name, significance_results=None, metric='squared_error')
¶
Save a complete experiment report with all artifacts.
Args: results: Dictionary mapping config name to AggregatedResults output_dir: Directory to save artifacts experiment_name: Name for the experiment (used in filenames) significance_results: Optional pairwise significance test results metric: Primary metric to report
Returns: Dictionary mapping artifact type to file path
Source code in src/alberta_framework/utils/export.py
compare_learners(results, metric='squared_error')
¶
Compare multiple learners on a given metric.
Args: results: Dictionary mapping learner name to metrics history metric: Metric to compare
Returns: Dictionary with summary statistics for each learner
Source code in src/alberta_framework/utils/metrics.py
compute_cumulative_error(metrics_history, error_key='squared_error')
¶
Compute cumulative error over time.
Args: metrics_history: List of metric dictionaries from learning loop error_key: Key to extract error values
Returns: Array of cumulative errors at each time step
Source code in src/alberta_framework/utils/metrics.py
compute_running_mean(values, window_size=100)
¶
Compute running mean of values.
Args: values: Array of values window_size: Size of the moving average window
Returns: Array of running mean values (same length as input, padded at start)
Source code in src/alberta_framework/utils/metrics.py
compute_tracking_error(metrics_history, window_size=100)
¶
Compute tracking error (running mean of squared error).
This is the key metric for evaluating continual learners: how well can the learner track the non-stationary target?
Args: metrics_history: List of metric dictionaries from learning loop window_size: Size of the moving average window
Returns: Array of tracking errors at each time step
Source code in src/alberta_framework/utils/metrics.py
extract_metric(metrics_history, key)
¶
Extract a single metric from the history.
Args: metrics_history: List of metric dictionaries key: Key to extract
Returns: Array of values for that metric
Source code in src/alberta_framework/utils/metrics.py
bonferroni_correction(p_values, alpha=0.05)
¶
Apply Bonferroni correction for multiple comparisons.
Args: p_values: List of p-values from multiple tests alpha: Family-wise significance level
Returns: Tuple of (list of significant booleans, corrected alpha)
Source code in src/alberta_framework/utils/statistics.py
bootstrap_ci(values, statistic='mean', confidence_level=0.95, n_bootstrap=10000, seed=42)
¶
Compute bootstrap confidence interval.
Args: values: Array of values statistic: Statistic to bootstrap ("mean" or "median") confidence_level: Confidence level n_bootstrap: Number of bootstrap samples seed: Random seed
Returns: Tuple of (point_estimate, ci_lower, ci_upper)
Source code in src/alberta_framework/utils/statistics.py
cohens_d(values_a, values_b)
¶
Compute Cohen's d effect size.
Args: values_a: Values for first group values_b: Values for second group
Returns: Cohen's d (positive means a > b)
Source code in src/alberta_framework/utils/statistics.py
compute_statistics(values, confidence_level=0.95)
¶
Compute comprehensive statistics for a set of values.
Args: values: Array of values (e.g., final performance across seeds) confidence_level: Confidence level for CI (default 0.95)
Returns: StatisticalSummary with all statistics
Source code in src/alberta_framework/utils/statistics.py
compute_timeseries_statistics(metric_array, confidence_level=0.95)
¶
Compute mean and confidence intervals for timeseries data.
Args: metric_array: Array of shape (n_seeds, n_steps) confidence_level: Confidence level for CI
Returns: Tuple of (mean, ci_lower, ci_upper) arrays of shape (n_steps,)
Source code in src/alberta_framework/utils/statistics.py
holm_correction(p_values, alpha=0.05)
¶
Apply Holm-Bonferroni step-down correction.
More powerful than Bonferroni while still controlling FWER.
Args: p_values: List of p-values from multiple tests alpha: Family-wise significance level
Returns: List of significant booleans
Source code in src/alberta_framework/utils/statistics.py
mann_whitney_comparison(values_a, values_b, alpha=0.05, method_a='A', method_b='B')
¶
Perform Mann-Whitney U test (non-parametric).
Args: values_a: Values for first method values_b: Values for second method alpha: Significance level method_a: Name of first method method_b: Name of second method
Returns: SignificanceResult with test results
Source code in src/alberta_framework/utils/statistics.py
pairwise_comparisons(results, metric='squared_error', test='ttest', correction='bonferroni', alpha=0.05, window=100)
¶
Perform all pairwise comparisons between methods.
Args: results: Dictionary mapping config name to AggregatedResults metric: Metric to compare test: Test to use ("ttest", "mann_whitney", or "wilcoxon") correction: Multiple comparison correction ("bonferroni" or "holm") alpha: Significance level window: Number of final steps to average
Returns: Dictionary mapping (method_a, method_b) to SignificanceResult
Source code in src/alberta_framework/utils/statistics.py
391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 | |
ttest_comparison(values_a, values_b, paired=True, alpha=0.05, method_a='A', method_b='B')
¶
Perform t-test comparison between two methods.
Args: values_a: Values for first method values_b: Values for second method paired: Whether to use paired t-test (default True for same seeds) alpha: Significance level method_a: Name of first method method_b: Name of second method
Returns: SignificanceResult with test results
Source code in src/alberta_framework/utils/statistics.py
wilcoxon_comparison(values_a, values_b, alpha=0.05, method_a='A', method_b='B')
¶
Perform Wilcoxon signed-rank test (paired non-parametric).
Args: values_a: Values for first method values_b: Values for second method alpha: Significance level method_a: Name of first method method_b: Name of second method
Returns: SignificanceResult with test results
Source code in src/alberta_framework/utils/statistics.py
create_comparison_figure(results, significance_results=None, metric='squared_error', step_size_metric='mean_step_size')
¶
Create a 2x2 multi-panel comparison figure.
Panels: - Top-left: Learning curves - Top-right: Final performance bars - Bottom-left: Step-size evolution - Bottom-right: Cumulative error
Args: results: Dictionary mapping config name to AggregatedResults significance_results: Optional pairwise significance test results metric: Error metric to use step_size_metric: Step-size metric to use
Returns: Figure with 4 subplots
Source code in src/alberta_framework/utils/visualization.py
plot_final_performance_bars(results, metric='squared_error', show_significance=True, significance_results=None, ax=None, colors=None, lower_is_better=True)
¶
Plot final performance as bar chart with error bars.
Args: results: Dictionary mapping config name to AggregatedResults metric: Metric to plot show_significance: Whether to show significance markers significance_results: Pairwise significance test results ax: Existing axes to plot on (creates new figure if None) colors: Optional custom colors for each method lower_is_better: Whether lower values are better
Returns: Tuple of (figure, axes)
Source code in src/alberta_framework/utils/visualization.py
175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 | |
plot_hyperparameter_heatmap(results, param1_name, param1_values, param2_name, param2_values, metric='squared_error', name_pattern='{p1}_{p2}', ax=None, cmap='viridis_r', lower_is_better=True)
¶
Plot hyperparameter sensitivity heatmap.
Args: results: Dictionary mapping config name to AggregatedResults param1_name: Name of first parameter (y-axis) param1_values: Values of first parameter param2_name: Name of second parameter (x-axis) param2_values: Values of second parameter metric: Metric to plot name_pattern: Pattern to generate config names (use {p1}, {p2}) ax: Existing axes to plot on cmap: Colormap to use lower_is_better: Whether lower values are better
Returns: Tuple of (figure, axes)
Source code in src/alberta_framework/utils/visualization.py
260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 | |
plot_learning_curves(results, metric='squared_error', show_ci=True, log_scale=True, window_size=100, ax=None, colors=None, labels=None)
¶
Plot learning curves with confidence intervals.
Args: results: Dictionary mapping config name to AggregatedResults metric: Metric to plot show_ci: Whether to show confidence intervals log_scale: Whether to use log scale for y-axis window_size: Window size for running mean smoothing ax: Existing axes to plot on (creates new figure if None) colors: Optional custom colors for each method labels: Optional custom labels for legend
Returns: Tuple of (figure, axes)
Source code in src/alberta_framework/utils/visualization.py
plot_step_size_evolution(results, metric='mean_step_size', show_ci=True, ax=None, colors=None)
¶
Plot step-size evolution over time.
Args: results: Dictionary mapping config name to AggregatedResults metric: Step-size metric to plot show_ci: Whether to show confidence intervals ax: Existing axes to plot on colors: Optional custom colors
Returns: Tuple of (figure, axes)
Source code in src/alberta_framework/utils/visualization.py
save_figure(fig, filename, formats=None, dpi=300, transparent=False)
¶
Save figure to multiple formats.
Args: fig: Matplotlib figure to save filename: Base filename (without extension) formats: List of formats to save (default: ["pdf", "png"]) dpi: Resolution for raster formats transparent: Whether to use transparent background
Returns: List of saved file paths
Source code in src/alberta_framework/utils/visualization.py
set_publication_style(font_size=10, use_latex=False, figure_width=3.5, figure_height=None, style='seaborn-v0_8-whitegrid')
¶
Set matplotlib style for publication-quality figures.
Args: font_size: Base font size use_latex: Whether to use LaTeX for text rendering figure_width: Default figure width in inches figure_height: Default figure height (auto if None) style: Matplotlib style to use