utils

`utils` ¶

Utility functions for the Alberta Framework.

`AggregatedResults` ¶

Bases: NamedTuple

Aggregated results across multiple seeds.

Attributes: config_name: Name of the configuration seeds: List of seeds used metric_arrays: Dict mapping metric name to (n_seeds, n_steps) array summary: Dict mapping metric name to MetricSummary (final values)

`ExperimentConfig` ¶

Bases: NamedTuple

Configuration for a single experiment.

Attributes: name: Human-readable name for this configuration learner_factory: Callable that returns a fresh learner instance stream_factory: Callable that returns a fresh stream instance num_steps: Number of learning steps to run

`MetricSummary` ¶

Bases: NamedTuple

Summary statistics for a single metric.

Attributes: mean: Mean across seeds std: Standard deviation across seeds min: Minimum value across seeds max: Maximum value across seeds n_seeds: Number of seeds values: Raw values per seed

`SingleRunResult` ¶

Bases: NamedTuple

Result from a single experiment run.

Attributes: config_name: Name of the configuration that was run seed: Random seed used for this run metrics_history: List of metric dictionaries from each step final_state: Final learner state after training

`SignificanceResult` ¶

Bases: NamedTuple

Result of a statistical significance test.

Attributes: test_name: Name of the test performed statistic: Test statistic value p_value: P-value of the test significant: Whether the result is significant at the given alpha alpha: Significance level used effect_size: Effect size (e.g., Cohen's d) method_a: Name of first method method_b: Name of second method

`StatisticalSummary` ¶

Bases: NamedTuple

Summary statistics for a set of values.

Attributes: mean: Arithmetic mean std: Standard deviation sem: Standard error of the mean ci_lower: Lower bound of confidence interval ci_upper: Upper bound of confidence interval median: Median value iqr: Interquartile range n_seeds: Number of samples

`aggregate_metrics(results)` ¶

Aggregate results from multiple seeds into summary statistics.

Args: results: List of SingleRunResult from multiple seeds

Returns: AggregatedResults with aggregated metrics

Source code in src/alberta_framework/utils/experiments.py

def aggregate_metrics(results: list[SingleRunResult]) -> AggregatedResults:
    """Aggregate results from multiple seeds into summary statistics.

    Args:
        results: List of SingleRunResult from multiple seeds

    Returns:
        AggregatedResults with aggregated metrics
    """
    if not results:
        raise ValueError("Cannot aggregate empty results list")

    config_name = results[0].config_name
    seeds = [r.seed for r in results]

    # Get all metric keys from first result
    metric_keys = list(results[0].metrics_history[0].keys())

    # Build metric arrays: (n_seeds, n_steps)
    metric_arrays: dict[str, NDArray[np.float64]] = {}
    for key in metric_keys:
        arrays = []
        for r in results:
            values = np.array([m[key] for m in r.metrics_history])
            arrays.append(values)
        metric_arrays[key] = np.stack(arrays)

    # Compute summary statistics for final values (mean of last 100 steps)
    summary: dict[str, MetricSummary] = {}
    n_seeds = len(results)
    for key in metric_keys:
        # Use mean of last 100 steps as the final value
        window = min(100, metric_arrays[key].shape[1])
        final_values = np.mean(metric_arrays[key][:, -window:], axis=1)
        summary[key] = MetricSummary(
            mean=float(np.mean(final_values)),
            std=float(np.std(final_values)),
            min=float(np.min(final_values)),
            max=float(np.max(final_values)),
            n_seeds=n_seeds,
            values=final_values,
        )

    return AggregatedResults(
        config_name=config_name,
        seeds=seeds,
        metric_arrays=metric_arrays,
        summary=summary,
    )

`get_final_performance(results, metric='squared_error', window=100)` ¶

Get final performance (mean, std) for each config.

Args: results: Dictionary of aggregated results metric: Metric to evaluate window: Number of final steps to average

Returns: Dictionary mapping config name to (mean, std) tuple

Source code in src/alberta_framework/utils/experiments.py

def get_final_performance(
    results: dict[str, AggregatedResults],
    metric: str = "squared_error",
    window: int = 100,
) -> dict[str, tuple[float, float]]:
    """Get final performance (mean, std) for each config.

    Args:
        results: Dictionary of aggregated results
        metric: Metric to evaluate
        window: Number of final steps to average

    Returns:
        Dictionary mapping config name to (mean, std) tuple
    """
    performance: dict[str, tuple[float, float]] = {}
    for name, agg in results.items():
        arr = agg.metric_arrays[metric]
        final_window = min(window, arr.shape[1])
        final_means = np.mean(arr[:, -final_window:], axis=1)
        performance[name] = (float(np.mean(final_means)), float(np.std(final_means)))
    return performance

`get_metric_timeseries(results, metric='squared_error')` ¶

Get mean and standard deviation timeseries for a metric.

Args: results: Aggregated results metric: Name of the metric

Returns: Tuple of (mean, lower_bound, upper_bound) arrays

Source code in src/alberta_framework/utils/experiments.py

def get_metric_timeseries(
    results: AggregatedResults,
    metric: str = "squared_error",
) -> tuple[NDArray[np.float64], NDArray[np.float64], NDArray[np.float64]]:
    """Get mean and standard deviation timeseries for a metric.

    Args:
        results: Aggregated results
        metric: Name of the metric

    Returns:
        Tuple of (mean, lower_bound, upper_bound) arrays
    """
    arr = results.metric_arrays[metric]
    mean = np.mean(arr, axis=0)
    std = np.std(arr, axis=0)
    return mean, mean - std, mean + std

`run_multi_seed_experiment(configs, seeds=30, parallel=True, n_jobs=-1, show_progress=True)` ¶

Run experiments across multiple seeds with optional parallelization.

Args: configs: List of experiment configurations to run seeds: Number of seeds (generates 0..n-1) or explicit list of seeds parallel: Whether to use parallel execution (requires joblib) n_jobs: Number of parallel jobs (-1 for all CPUs) show_progress: Whether to show progress bar (requires tqdm)

Returns: Dictionary mapping config name to AggregatedResults

Source code in src/alberta_framework/utils/experiments.py

def run_multi_seed_experiment(
    configs: Sequence[ExperimentConfig],
    seeds: int | Sequence[int] = 30,
    parallel: bool = True,
    n_jobs: int = -1,
    show_progress: bool = True,
) -> dict[str, AggregatedResults]:
    """Run experiments across multiple seeds with optional parallelization.

    Args:
        configs: List of experiment configurations to run
        seeds: Number of seeds (generates 0..n-1) or explicit list of seeds
        parallel: Whether to use parallel execution (requires joblib)
        n_jobs: Number of parallel jobs (-1 for all CPUs)
        show_progress: Whether to show progress bar (requires tqdm)

    Returns:
        Dictionary mapping config name to AggregatedResults
    """
    # Convert seeds to list
    if isinstance(seeds, int):
        seed_list = list(range(seeds))
    else:
        seed_list = list(seeds)

    # Build list of (config, seed) pairs
    tasks: list[tuple[ExperimentConfig, int]] = []
    for config in configs:
        for seed in seed_list:
            tasks.append((config, seed))

    # Run experiments
    if parallel:
        try:
            from joblib import Parallel, delayed

            if show_progress:
                try:
                    from tqdm import tqdm

                    results_list: list[SingleRunResult] = Parallel(n_jobs=n_jobs)(
                        delayed(run_single_experiment)(config, seed)
                        for config, seed in tqdm(tasks, desc="Running experiments")
                    )
                except ImportError:
                    results_list = Parallel(n_jobs=n_jobs)(
                        delayed(run_single_experiment)(config, seed) for config, seed in tasks
                    )
            else:
                results_list = Parallel(n_jobs=n_jobs)(
                    delayed(run_single_experiment)(config, seed) for config, seed in tasks
                )
        except ImportError:
            # Fallback to sequential if joblib not available
            results_list = _run_sequential(tasks, show_progress)
    else:
        results_list = _run_sequential(tasks, show_progress)

    # Group results by config name
    grouped: dict[str, list[SingleRunResult]] = {}
    for result in results_list:
        if result.config_name not in grouped:
            grouped[result.config_name] = []
        grouped[result.config_name].append(result)

    # Aggregate each config
    aggregated: dict[str, AggregatedResults] = {}
    for config_name, group_results in grouped.items():
        aggregated[config_name] = aggregate_metrics(group_results)

    return aggregated

`run_single_experiment(config, seed)` ¶

Run a single experiment with a given seed.

Args: config: Experiment configuration seed: Random seed for the stream

Returns: SingleRunResult with metrics and final state

Source code in src/alberta_framework/utils/experiments.py

def run_single_experiment(
    config: ExperimentConfig,
    seed: int,
) -> SingleRunResult:
    """Run a single experiment with a given seed.

    Args:
        config: Experiment configuration
        seed: Random seed for the stream

    Returns:
        SingleRunResult with metrics and final state
    """
    learner = config.learner_factory()
    stream = config.stream_factory()
    key = jr.key(seed)

    final_state: LearnerState | NormalizedLearnerState
    if isinstance(learner, NormalizedLinearLearner):
        norm_result = run_normalized_learning_loop(learner, stream, config.num_steps, key)
        final_state, metrics = cast(tuple[NormalizedLearnerState, Any], norm_result)
        metrics_history = metrics_to_dicts(metrics, normalized=True)
    else:
        linear_result = run_learning_loop(learner, stream, config.num_steps, key)
        final_state, metrics = cast(tuple[LearnerState, Any], linear_result)
        metrics_history = metrics_to_dicts(metrics)

    return SingleRunResult(
        config_name=config.name,
        seed=seed,
        metrics_history=metrics_history,
        final_state=final_state,
    )

`export_to_csv(results, filepath, metric='squared_error', include_timeseries=False)` ¶

Export results to CSV file.

Args: results: Dictionary mapping config name to AggregatedResults filepath: Path to output CSV file metric: Metric to export include_timeseries: Whether to include full timeseries (large!)

Source code in src/alberta_framework/utils/export.py

def export_to_csv(
    results: dict[str, "AggregatedResults"],
    filepath: str | Path,
    metric: str = "squared_error",
    include_timeseries: bool = False,
) -> None:
    """Export results to CSV file.

    Args:
        results: Dictionary mapping config name to AggregatedResults
        filepath: Path to output CSV file
        metric: Metric to export
        include_timeseries: Whether to include full timeseries (large!)
    """
    filepath = Path(filepath)
    filepath.parent.mkdir(parents=True, exist_ok=True)

    if include_timeseries:
        _export_timeseries_csv(results, filepath, metric)
    else:
        _export_summary_csv(results, filepath, metric)

`export_to_json(results, filepath, include_timeseries=False)` ¶

Export results to JSON file.

Args: results: Dictionary mapping config name to AggregatedResults filepath: Path to output JSON file include_timeseries: Whether to include full timeseries (large!)

Source code in src/alberta_framework/utils/export.py

def export_to_json(
    results: dict[str, "AggregatedResults"],
    filepath: str | Path,
    include_timeseries: bool = False,
) -> None:
    """Export results to JSON file.

    Args:
        results: Dictionary mapping config name to AggregatedResults
        filepath: Path to output JSON file
        include_timeseries: Whether to include full timeseries (large!)
    """
    filepath = Path(filepath)
    filepath.parent.mkdir(parents=True, exist_ok=True)

    from typing import Any

    data: dict[str, Any] = {}
    for name, agg in results.items():
        summary_data: dict[str, dict[str, Any]] = {}
        for metric_name, summary in agg.summary.items():
            summary_data[metric_name] = {
                "mean": summary.mean,
                "std": summary.std,
                "min": summary.min,
                "max": summary.max,
                "n_seeds": summary.n_seeds,
                "values": summary.values.tolist(),
            }

        config_data: dict[str, Any] = {
            "seeds": agg.seeds,
            "summary": summary_data,
        }

        if include_timeseries:
            config_data["timeseries"] = {
                metric: arr.tolist() for metric, arr in agg.metric_arrays.items()
            }

        data[name] = config_data

    with open(filepath, "w") as f:
        json.dump(data, f, indent=2)

`generate_latex_table(results, significance_results=None, metric='squared_error', caption='Experimental Results', label='tab:results', metric_label='Error', lower_is_better=True)` ¶

Generate a LaTeX table of results.

Args: results: Dictionary mapping config name to AggregatedResults significance_results: Optional pairwise significance test results metric: Metric to display caption: Table caption label: LaTeX label for the table metric_label: Human-readable name for the metric lower_is_better: Whether lower metric values are better

Returns: LaTeX table as a string

Source code in src/alberta_framework/utils/export.py

def generate_latex_table(
    results: dict[str, "AggregatedResults"],
    significance_results: dict[tuple[str, str], "SignificanceResult"] | None = None,
    metric: str = "squared_error",
    caption: str = "Experimental Results",
    label: str = "tab:results",
    metric_label: str = "Error",
    lower_is_better: bool = True,
) -> str:
    """Generate a LaTeX table of results.

    Args:
        results: Dictionary mapping config name to AggregatedResults
        significance_results: Optional pairwise significance test results
        metric: Metric to display
        caption: Table caption
        label: LaTeX label for the table
        metric_label: Human-readable name for the metric
        lower_is_better: Whether lower metric values are better

    Returns:
        LaTeX table as a string
    """
    lines = []
    lines.append(r"\begin{table}[ht]")
    lines.append(r"\centering")
    lines.append(r"\caption{" + caption + "}")
    lines.append(r"\label{" + label + "}")
    lines.append(r"\begin{tabular}{lcc}")
    lines.append(r"\toprule")
    lines.append(r"Method & " + metric_label + r" & Seeds \\")
    lines.append(r"\midrule")

    # Find best result
    summaries = {name: agg.summary[metric] for name, agg in results.items()}
    if lower_is_better:
        best_name = min(summaries.keys(), key=lambda k: summaries[k].mean)
    else:
        best_name = max(summaries.keys(), key=lambda k: summaries[k].mean)

    for name, agg in results.items():
        summary = agg.summary[metric]
        mean_str = f"{summary.mean:.4f}"
        std_str = f"{summary.std:.4f}"

        # Bold if best
        if name == best_name:
            value_str = rf"\textbf{{{mean_str}}} $\pm$ {std_str}"
        else:
            value_str = rf"{mean_str} $\pm$ {std_str}"

        # Add significance marker if provided
        if significance_results:
            sig_marker = _get_significance_marker(name, best_name, significance_results)
            value_str += sig_marker

        # Escape underscores in method name
        escaped_name = name.replace("_", r"\_")
        lines.append(rf"{escaped_name} & {value_str} & {summary.n_seeds} \\")

    lines.append(r"\bottomrule")
    lines.append(r"\end{tabular}")

    if significance_results:
        lines.append(r"\vspace{0.5em}")
        lines.append(r"\footnotesize{$^*$ $p < 0.05$, $^{**}$ $p < 0.01$, $^{***}$ $p < 0.001$}")

    lines.append(r"\end{table}")

    return "\n".join(lines)

`generate_markdown_table(results, significance_results=None, metric='squared_error', metric_label='Error', lower_is_better=True)` ¶

Generate a markdown table of results.

Args: results: Dictionary mapping config name to AggregatedResults significance_results: Optional pairwise significance test results metric: Metric to display metric_label: Human-readable name for the metric lower_is_better: Whether lower metric values are better

Returns: Markdown table as a string

Source code in src/alberta_framework/utils/export.py

def generate_markdown_table(
    results: dict[str, "AggregatedResults"],
    significance_results: dict[tuple[str, str], "SignificanceResult"] | None = None,
    metric: str = "squared_error",
    metric_label: str = "Error",
    lower_is_better: bool = True,
) -> str:
    """Generate a markdown table of results.

    Args:
        results: Dictionary mapping config name to AggregatedResults
        significance_results: Optional pairwise significance test results
        metric: Metric to display
        metric_label: Human-readable name for the metric
        lower_is_better: Whether lower metric values are better

    Returns:
        Markdown table as a string
    """
    lines = []
    lines.append(f"| Method | {metric_label} (mean ± std) | Seeds |")
    lines.append("|--------|-------------------------|-------|")

    # Find best result
    summaries = {name: agg.summary[metric] for name, agg in results.items()}
    if lower_is_better:
        best_name = min(summaries.keys(), key=lambda k: summaries[k].mean)
    else:
        best_name = max(summaries.keys(), key=lambda k: summaries[k].mean)

    for name, agg in results.items():
        summary = agg.summary[metric]
        mean_str = f"{summary.mean:.4f}"
        std_str = f"{summary.std:.4f}"

        # Bold if best
        if name == best_name:
            value_str = f"**{mean_str}** ± {std_str}"
        else:
            value_str = f"{mean_str} ± {std_str}"

        # Add significance marker if provided
        if significance_results:
            sig_marker = _get_md_significance_marker(name, best_name, significance_results)
            value_str += sig_marker

        lines.append(f"| {name} | {value_str} | {summary.n_seeds} |")

    if significance_results:
        lines.append("")
        lines.append("\\* p < 0.05, \\*\\* p < 0.01, \\*\\*\\* p < 0.001")

    return "\n".join(lines)

`save_experiment_report(results, output_dir, experiment_name, significance_results=None, metric='squared_error')` ¶

Save a complete experiment report with all artifacts.

Args: results: Dictionary mapping config name to AggregatedResults output_dir: Directory to save artifacts experiment_name: Name for the experiment (used in filenames) significance_results: Optional pairwise significance test results metric: Primary metric to report

Returns: Dictionary mapping artifact type to file path

Source code in src/alberta_framework/utils/export.py

def save_experiment_report(
    results: dict[str, "AggregatedResults"],
    output_dir: str | Path,
    experiment_name: str,
    significance_results: dict[tuple[str, str], "SignificanceResult"] | None = None,
    metric: str = "squared_error",
) -> dict[str, Path]:
    """Save a complete experiment report with all artifacts.

    Args:
        results: Dictionary mapping config name to AggregatedResults
        output_dir: Directory to save artifacts
        experiment_name: Name for the experiment (used in filenames)
        significance_results: Optional pairwise significance test results
        metric: Primary metric to report

    Returns:
        Dictionary mapping artifact type to file path
    """
    output_dir = Path(output_dir)
    output_dir.mkdir(parents=True, exist_ok=True)

    artifacts: dict[str, Path] = {}

    # Export summary CSV
    csv_path = output_dir / f"{experiment_name}_summary.csv"
    export_to_csv(results, csv_path, metric=metric)
    artifacts["summary_csv"] = csv_path

    # Export JSON
    json_path = output_dir / f"{experiment_name}_results.json"
    export_to_json(results, json_path, include_timeseries=False)
    artifacts["json"] = json_path

    # Generate LaTeX table
    latex_path = output_dir / f"{experiment_name}_table.tex"
    latex_content = generate_latex_table(
        results,
        significance_results=significance_results,
        metric=metric,
        caption=f"{experiment_name} Results",
        label=f"tab:{experiment_name}",
    )
    with open(latex_path, "w") as f:
        f.write(latex_content)
    artifacts["latex_table"] = latex_path

    # Generate markdown table
    md_path = output_dir / f"{experiment_name}_table.md"
    md_content = generate_markdown_table(
        results,
        significance_results=significance_results,
        metric=metric,
    )
    with open(md_path, "w") as f:
        f.write(md_content)
    artifacts["markdown_table"] = md_path

    # If significance results provided, save those too
    if significance_results:
        sig_latex_path = output_dir / f"{experiment_name}_significance.tex"
        sig_latex = generate_significance_table(significance_results, format="latex")
        with open(sig_latex_path, "w") as f:
            f.write(sig_latex)
        artifacts["significance_latex"] = sig_latex_path

        sig_md_path = output_dir / f"{experiment_name}_significance.md"
        sig_md = generate_significance_table(significance_results, format="markdown")
        with open(sig_md_path, "w") as f:
            f.write(sig_md)
        artifacts["significance_md"] = sig_md_path

    return artifacts

`compare_learners(results, metric='squared_error')` ¶

Compare multiple learners on a given metric.

Args: results: Dictionary mapping learner name to metrics history metric: Metric to compare

Returns: Dictionary with summary statistics for each learner

Source code in src/alberta_framework/utils/metrics.py

def compare_learners(
    results: dict[str, list[dict[str, float]]],
    metric: str = "squared_error",
) -> dict[str, dict[str, float]]:
    """Compare multiple learners on a given metric.

    Args:
        results: Dictionary mapping learner name to metrics history
        metric: Metric to compare

    Returns:
        Dictionary with summary statistics for each learner
    """
    summary = {}
    for name, metrics_history in results.items():
        values = extract_metric(metrics_history, metric)
        summary[name] = {
            "mean": float(np.mean(values)),
            "std": float(np.std(values)),
            "cumulative": float(np.sum(values)),
            "final_100_mean": (
                float(np.mean(values[-100:])) if len(values) >= 100 else float(np.mean(values))
            ),
        }
    return summary

`compute_cumulative_error(metrics_history, error_key='squared_error')` ¶

Compute cumulative error over time.

Args: metrics_history: List of metric dictionaries from learning loop error_key: Key to extract error values

Returns: Array of cumulative errors at each time step

Source code in src/alberta_framework/utils/metrics.py

def compute_cumulative_error(
    metrics_history: list[dict[str, float]],
    error_key: str = "squared_error",
) -> NDArray[np.float64]:
    """Compute cumulative error over time.

    Args:
        metrics_history: List of metric dictionaries from learning loop
        error_key: Key to extract error values

    Returns:
        Array of cumulative errors at each time step
    """
    errors = np.array([m[error_key] for m in metrics_history])
    return np.cumsum(errors)

`compute_running_mean(values, window_size=100)` ¶

Compute running mean of values.

Args: values: Array of values window_size: Size of the moving average window

Returns: Array of running mean values (same length as input, padded at start)

Source code in src/alberta_framework/utils/metrics.py

def compute_running_mean(
    values: NDArray[np.float64] | list[float],
    window_size: int = 100,
) -> NDArray[np.float64]:
    """Compute running mean of values.

    Args:
        values: Array of values
        window_size: Size of the moving average window

    Returns:
        Array of running mean values (same length as input, padded at start)
    """
    values_arr = np.asarray(values)
    cumsum = np.cumsum(np.insert(values_arr, 0, 0))
    running_mean = (cumsum[window_size:] - cumsum[:-window_size]) / window_size

    # Pad the beginning with the first computed mean
    if len(running_mean) > 0:
        padding = np.full(window_size - 1, running_mean[0])
        return np.concatenate([padding, running_mean])
    return values_arr

`compute_tracking_error(metrics_history, window_size=100)` ¶

Compute tracking error (running mean of squared error).

This is the key metric for evaluating continual learners: how well can the learner track the non-stationary target?

Args: metrics_history: List of metric dictionaries from learning loop window_size: Size of the moving average window

Returns: Array of tracking errors at each time step

Source code in src/alberta_framework/utils/metrics.py

def compute_tracking_error(
    metrics_history: list[dict[str, float]],
    window_size: int = 100,
) -> NDArray[np.float64]:
    """Compute tracking error (running mean of squared error).

    This is the key metric for evaluating continual learners:
    how well can the learner track the non-stationary target?

    Args:
        metrics_history: List of metric dictionaries from learning loop
        window_size: Size of the moving average window

    Returns:
        Array of tracking errors at each time step
    """
    errors = np.array([m["squared_error"] for m in metrics_history])
    return compute_running_mean(errors, window_size)

`extract_metric(metrics_history, key)` ¶

Extract a single metric from the history.

Args: metrics_history: List of metric dictionaries key: Key to extract

Returns: Array of values for that metric

Source code in src/alberta_framework/utils/metrics.py

def extract_metric(
    metrics_history: list[dict[str, float]],
    key: str,
) -> NDArray[np.float64]:
    """Extract a single metric from the history.

    Args:
        metrics_history: List of metric dictionaries
        key: Key to extract

    Returns:
        Array of values for that metric
    """
    return np.array([m[key] for m in metrics_history])

`bonferroni_correction(p_values, alpha=0.05)` ¶

Apply Bonferroni correction for multiple comparisons.

Args: p_values: List of p-values from multiple tests alpha: Family-wise significance level

Returns: Tuple of (list of significant booleans, corrected alpha)

Source code in src/alberta_framework/utils/statistics.py

def bonferroni_correction(
    p_values: list[float],
    alpha: float = 0.05,
) -> tuple[list[bool], float]:
    """Apply Bonferroni correction for multiple comparisons.

    Args:
        p_values: List of p-values from multiple tests
        alpha: Family-wise significance level

    Returns:
        Tuple of (list of significant booleans, corrected alpha)
    """
    n_tests = len(p_values)
    corrected_alpha = alpha / n_tests
    significant = [p < corrected_alpha for p in p_values]
    return significant, corrected_alpha

`bootstrap_ci(values, statistic='mean', confidence_level=0.95, n_bootstrap=10000, seed=42)` ¶

Compute bootstrap confidence interval.

Args: values: Array of values statistic: Statistic to bootstrap ("mean" or "median") confidence_level: Confidence level n_bootstrap: Number of bootstrap samples seed: Random seed

Returns: Tuple of (point_estimate, ci_lower, ci_upper)

Source code in src/alberta_framework/utils/statistics.py

def bootstrap_ci(
    values: NDArray[np.float64] | list[float],
    statistic: str = "mean",
    confidence_level: float = 0.95,
    n_bootstrap: int = 10000,
    seed: int = 42,
) -> tuple[float, float, float]:
    """Compute bootstrap confidence interval.

    Args:
        values: Array of values
        statistic: Statistic to bootstrap ("mean" or "median")
        confidence_level: Confidence level
        n_bootstrap: Number of bootstrap samples
        seed: Random seed

    Returns:
        Tuple of (point_estimate, ci_lower, ci_upper)
    """
    arr = np.asarray(values)
    rng = np.random.default_rng(seed)

    stat_func = np.mean if statistic == "mean" else np.median
    point_estimate = float(stat_func(arr))

    # Generate bootstrap samples
    bootstrap_stats_list: list[float] = []
    for _ in range(n_bootstrap):
        sample = rng.choice(arr, size=len(arr), replace=True)
        bootstrap_stats_list.append(float(stat_func(sample)))

    bootstrap_stats = np.array(bootstrap_stats_list)

    # Percentile method
    lower_percentile = (1 - confidence_level) / 2 * 100
    upper_percentile = (1 + confidence_level) / 2 * 100
    ci_lower = float(np.percentile(bootstrap_stats, lower_percentile))
    ci_upper = float(np.percentile(bootstrap_stats, upper_percentile))

    return point_estimate, ci_lower, ci_upper

`cohens_d(values_a, values_b)` ¶

Compute Cohen's d effect size.

Args: values_a: Values for first group values_b: Values for second group

Returns: Cohen's d (positive means a > b)

Source code in src/alberta_framework/utils/statistics.py

def cohens_d(
    values_a: NDArray[np.float64] | list[float],
    values_b: NDArray[np.float64] | list[float],
) -> float:
    """Compute Cohen's d effect size.

    Args:
        values_a: Values for first group
        values_b: Values for second group

    Returns:
        Cohen's d (positive means a > b)
    """
    a = np.asarray(values_a)
    b = np.asarray(values_b)

    mean_a = np.mean(a)
    mean_b = np.mean(b)

    n_a = len(a)
    n_b = len(b)

    # Pooled standard deviation
    var_a = np.var(a, ddof=1) if n_a > 1 else 0.0
    var_b = np.var(b, ddof=1) if n_b > 1 else 0.0

    pooled_std = np.sqrt(((n_a - 1) * var_a + (n_b - 1) * var_b) / (n_a + n_b - 2))

    if pooled_std == 0:
        return 0.0

    return float((mean_a - mean_b) / pooled_std)

`compute_statistics(values, confidence_level=0.95)` ¶

Compute comprehensive statistics for a set of values.

Args: values: Array of values (e.g., final performance across seeds) confidence_level: Confidence level for CI (default 0.95)

Returns: StatisticalSummary with all statistics

Source code in src/alberta_framework/utils/statistics.py

def compute_statistics(
    values: NDArray[np.float64] | list[float],
    confidence_level: float = 0.95,
) -> StatisticalSummary:
    """Compute comprehensive statistics for a set of values.

    Args:
        values: Array of values (e.g., final performance across seeds)
        confidence_level: Confidence level for CI (default 0.95)

    Returns:
        StatisticalSummary with all statistics
    """
    arr = np.asarray(values)
    n = len(arr)

    mean = float(np.mean(arr))
    std = float(np.std(arr, ddof=1)) if n > 1 else 0.0
    sem = std / np.sqrt(n) if n > 1 else 0.0
    median = float(np.median(arr))
    q75, q25 = np.percentile(arr, [75, 25])
    iqr = float(q75 - q25)

    # Compute confidence interval
    try:
        from scipy import stats

        if n > 1:
            t_value = float(stats.t.ppf((1 + confidence_level) / 2, n - 1))
            margin = t_value * sem
            ci_lower = mean - margin
            ci_upper = mean + margin
        else:
            ci_lower = ci_upper = mean
    except ImportError:
        # Fallback without scipy: use normal approximation
        z_value = 1.96 if confidence_level == 0.95 else 2.576  # 95% or 99%
        margin = z_value * sem
        ci_lower = mean - margin
        ci_upper = mean + margin

    return StatisticalSummary(
        mean=mean,
        std=std,
        sem=sem,
        ci_lower=float(ci_lower),
        ci_upper=float(ci_upper),
        median=median,
        iqr=iqr,
        n_seeds=n,
    )

`compute_timeseries_statistics(metric_array, confidence_level=0.95)` ¶

Compute mean and confidence intervals for timeseries data.

Args: metric_array: Array of shape (n_seeds, n_steps) confidence_level: Confidence level for CI

Returns: Tuple of (mean, ci_lower, ci_upper) arrays of shape (n_steps,)

Source code in src/alberta_framework/utils/statistics.py

def compute_timeseries_statistics(
    metric_array: NDArray[np.float64],
    confidence_level: float = 0.95,
) -> tuple[NDArray[np.float64], NDArray[np.float64], NDArray[np.float64]]:
    """Compute mean and confidence intervals for timeseries data.

    Args:
        metric_array: Array of shape (n_seeds, n_steps)
        confidence_level: Confidence level for CI

    Returns:
        Tuple of (mean, ci_lower, ci_upper) arrays of shape (n_steps,)
    """
    n_seeds = metric_array.shape[0]
    mean = np.mean(metric_array, axis=0)
    std = np.std(metric_array, axis=0, ddof=1)
    sem = std / np.sqrt(n_seeds)

    try:
        from scipy import stats

        t_value = stats.t.ppf((1 + confidence_level) / 2, n_seeds - 1)
    except ImportError:
        t_value = 1.96 if confidence_level == 0.95 else 2.576

    margin = t_value * sem
    ci_lower = mean - margin
    ci_upper = mean + margin

    return mean, ci_lower, ci_upper

`holm_correction(p_values, alpha=0.05)` ¶

Apply Holm-Bonferroni step-down correction.

More powerful than Bonferroni while still controlling FWER.

Args: p_values: List of p-values from multiple tests alpha: Family-wise significance level

Returns: List of significant booleans

Source code in src/alberta_framework/utils/statistics.py

def holm_correction(
    p_values: list[float],
    alpha: float = 0.05,
) -> list[bool]:
    """Apply Holm-Bonferroni step-down correction.

    More powerful than Bonferroni while still controlling FWER.

    Args:
        p_values: List of p-values from multiple tests
        alpha: Family-wise significance level

    Returns:
        List of significant booleans
    """
    n_tests = len(p_values)

    # Sort p-values and track original indices
    sorted_indices = np.argsort(p_values)
    sorted_p = [p_values[i] for i in sorted_indices]

    # Apply Holm correction
    significant_sorted = []
    for i, p in enumerate(sorted_p):
        corrected_alpha = alpha / (n_tests - i)
        if p < corrected_alpha:
            significant_sorted.append(True)
        else:
            # Once we fail to reject, all subsequent are not significant
            significant_sorted.extend([False] * (n_tests - i))
            break

    # Restore original order
    significant = [False] * n_tests
    for orig_idx, sig in zip(sorted_indices, significant_sorted, strict=False):
        significant[orig_idx] = sig

    return significant

`mann_whitney_comparison(values_a, values_b, alpha=0.05, method_a='A', method_b='B')` ¶

Perform Mann-Whitney U test (non-parametric).

Args: values_a: Values for first method values_b: Values for second method alpha: Significance level method_a: Name of first method method_b: Name of second method

Returns: SignificanceResult with test results

Source code in src/alberta_framework/utils/statistics.py

def mann_whitney_comparison(
    values_a: NDArray[np.float64] | list[float],
    values_b: NDArray[np.float64] | list[float],
    alpha: float = 0.05,
    method_a: str = "A",
    method_b: str = "B",
) -> SignificanceResult:
    """Perform Mann-Whitney U test (non-parametric).

    Args:
        values_a: Values for first method
        values_b: Values for second method
        alpha: Significance level
        method_a: Name of first method
        method_b: Name of second method

    Returns:
        SignificanceResult with test results
    """
    a = np.asarray(values_a)
    b = np.asarray(values_b)

    try:
        from scipy import stats

        result = stats.mannwhitneyu(a, b, alternative="two-sided")
        # scipy returns (statistic, pvalue) tuple
        stat_val = float(result[0])
        p_val = float(result[1])
    except ImportError:
        raise ImportError(
            "scipy is required for Mann-Whitney test. Install with: pip install scipy"
        )

    # Compute rank-biserial correlation as effect size
    n_a, n_b = len(a), len(b)
    r = 1 - (2 * stat_val) / (n_a * n_b)

    return SignificanceResult(
        test_name="Mann-Whitney U",
        statistic=stat_val,
        p_value=p_val,
        significant=p_val < alpha,
        alpha=alpha,
        effect_size=r,
        method_a=method_a,
        method_b=method_b,
    )

`pairwise_comparisons(results, metric='squared_error', test='ttest', correction='bonferroni', alpha=0.05, window=100)` ¶

Perform all pairwise comparisons between methods.

Args: results: Dictionary mapping config name to AggregatedResults metric: Metric to compare test: Test to use ("ttest", "mann_whitney", or "wilcoxon") correction: Multiple comparison correction ("bonferroni" or "holm") alpha: Significance level window: Number of final steps to average

Returns: Dictionary mapping (method_a, method_b) to SignificanceResult

Source code in src/alberta_framework/utils/statistics.py

def pairwise_comparisons(
    results: "dict[str, AggregatedResults]",  # noqa: F821
    metric: str = "squared_error",
    test: str = "ttest",
    correction: str = "bonferroni",
    alpha: float = 0.05,
    window: int = 100,
) -> dict[tuple[str, str], SignificanceResult]:
    """Perform all pairwise comparisons between methods.

    Args:
        results: Dictionary mapping config name to AggregatedResults
        metric: Metric to compare
        test: Test to use ("ttest", "mann_whitney", or "wilcoxon")
        correction: Multiple comparison correction ("bonferroni" or "holm")
        alpha: Significance level
        window: Number of final steps to average

    Returns:
        Dictionary mapping (method_a, method_b) to SignificanceResult
    """
    from alberta_framework.utils.experiments import AggregatedResults

    names = list(results.keys())
    n = len(names)

    if n < 2:
        return {}

    # Extract final values for each method
    final_values: dict[str, NDArray[np.float64]] = {}
    for name, agg in results.items():
        if not isinstance(agg, AggregatedResults):
            raise TypeError(f"Expected AggregatedResults, got {type(agg)}")
        arr = agg.metric_arrays[metric]
        final_window = min(window, arr.shape[1])
        final_values[name] = np.mean(arr[:, -final_window:], axis=1)

    if test not in ("ttest", "mann_whitney", "wilcoxon"):
        raise ValueError(f"Unknown test: {test}")

    # Perform all pairwise comparisons
    comparisons: dict[tuple[str, str], SignificanceResult] = {}
    p_values: list[float] = []

    for i in range(n):
        for j in range(i + 1, n):
            name_a, name_b = names[i], names[j]
            values_a = final_values[name_a]
            values_b = final_values[name_b]

            if test == "ttest":
                result = ttest_comparison(
                    values_a,
                    values_b,
                    paired=True,
                    alpha=alpha,
                    method_a=name_a,
                    method_b=name_b,
                )
            elif test == "mann_whitney":
                result = mann_whitney_comparison(
                    values_a,
                    values_b,
                    alpha=alpha,
                    method_a=name_a,
                    method_b=name_b,
                )
            else:  # wilcoxon
                result = wilcoxon_comparison(
                    values_a,
                    values_b,
                    alpha=alpha,
                    method_a=name_a,
                    method_b=name_b,
                )

            comparisons[(name_a, name_b)] = result
            p_values.append(result.p_value)

    # Apply multiple comparison correction
    if correction == "bonferroni":
        significant_list, _ = bonferroni_correction(p_values, alpha)
    elif correction == "holm":
        significant_list = holm_correction(p_values, alpha)
    else:
        raise ValueError(f"Unknown correction: {correction}")

    # Update significance based on correction
    corrected_comparisons: dict[tuple[str, str], SignificanceResult] = {}
    for (key, result), sig in zip(comparisons.items(), significant_list, strict=False):
        corrected_comparisons[key] = SignificanceResult(
            test_name=f"{result.test_name} ({correction})",
            statistic=result.statistic,
            p_value=result.p_value,
            significant=sig,
            alpha=alpha,
            effect_size=result.effect_size,
            method_a=result.method_a,
            method_b=result.method_b,
        )

    return corrected_comparisons

`ttest_comparison(values_a, values_b, paired=True, alpha=0.05, method_a='A', method_b='B')` ¶

Perform t-test comparison between two methods.

Args: values_a: Values for first method values_b: Values for second method paired: Whether to use paired t-test (default True for same seeds) alpha: Significance level method_a: Name of first method method_b: Name of second method

Returns: SignificanceResult with test results

Source code in src/alberta_framework/utils/statistics.py

def ttest_comparison(
    values_a: NDArray[np.float64] | list[float],
    values_b: NDArray[np.float64] | list[float],
    paired: bool = True,
    alpha: float = 0.05,
    method_a: str = "A",
    method_b: str = "B",
) -> SignificanceResult:
    """Perform t-test comparison between two methods.

    Args:
        values_a: Values for first method
        values_b: Values for second method
        paired: Whether to use paired t-test (default True for same seeds)
        alpha: Significance level
        method_a: Name of first method
        method_b: Name of second method

    Returns:
        SignificanceResult with test results
    """
    a = np.asarray(values_a)
    b = np.asarray(values_b)

    try:
        from scipy import stats

        if paired:
            result = stats.ttest_rel(a, b)
            test_name = "paired t-test"
        else:
            result = stats.ttest_ind(a, b)
            test_name = "independent t-test"
        # scipy returns (statistic, pvalue) tuple
        stat_val = float(result[0])
        p_val = float(result[1])
    except ImportError:
        raise ImportError("scipy is required for t-test. Install with: pip install scipy")

    effect = cohens_d(a, b)

    return SignificanceResult(
        test_name=test_name,
        statistic=stat_val,
        p_value=p_val,
        significant=p_val < alpha,
        alpha=alpha,
        effect_size=effect,
        method_a=method_a,
        method_b=method_b,
    )

`wilcoxon_comparison(values_a, values_b, alpha=0.05, method_a='A', method_b='B')` ¶

Perform Wilcoxon signed-rank test (paired non-parametric).

Args: values_a: Values for first method values_b: Values for second method alpha: Significance level method_a: Name of first method method_b: Name of second method

Returns: SignificanceResult with test results

Source code in src/alberta_framework/utils/statistics.py

def wilcoxon_comparison(
    values_a: NDArray[np.float64] | list[float],
    values_b: NDArray[np.float64] | list[float],
    alpha: float = 0.05,
    method_a: str = "A",
    method_b: str = "B",
) -> SignificanceResult:
    """Perform Wilcoxon signed-rank test (paired non-parametric).

    Args:
        values_a: Values for first method
        values_b: Values for second method
        alpha: Significance level
        method_a: Name of first method
        method_b: Name of second method

    Returns:
        SignificanceResult with test results
    """
    a = np.asarray(values_a)
    b = np.asarray(values_b)

    try:
        from scipy import stats

        result = stats.wilcoxon(a, b, alternative="two-sided")
        # scipy returns (statistic, pvalue) tuple
        stat_val = float(result[0])
        p_val = float(result[1])
    except ImportError:
        raise ImportError("scipy is required for Wilcoxon test. Install with: pip install scipy")

    effect = cohens_d(a, b)

    return SignificanceResult(
        test_name="Wilcoxon signed-rank",
        statistic=stat_val,
        p_value=p_val,
        significant=p_val < alpha,
        alpha=alpha,
        effect_size=effect,
        method_a=method_a,
        method_b=method_b,
    )

`create_comparison_figure(results, significance_results=None, metric='squared_error', step_size_metric='mean_step_size')` ¶

Create a 2x2 multi-panel comparison figure.

Panels: - Top-left: Learning curves - Top-right: Final performance bars - Bottom-left: Step-size evolution - Bottom-right: Cumulative error

Args: results: Dictionary mapping config name to AggregatedResults significance_results: Optional pairwise significance test results metric: Error metric to use step_size_metric: Step-size metric to use

Returns: Figure with 4 subplots

Source code in src/alberta_framework/utils/visualization.py

def create_comparison_figure(
    results: dict[str, "AggregatedResults"],
    significance_results: dict[tuple[str, str], "SignificanceResult"] | None = None,
    metric: str = "squared_error",
    step_size_metric: str = "mean_step_size",
) -> "Figure":
    """Create a 2x2 multi-panel comparison figure.

    Panels:
    - Top-left: Learning curves
    - Top-right: Final performance bars
    - Bottom-left: Step-size evolution
    - Bottom-right: Cumulative error

    Args:
        results: Dictionary mapping config name to AggregatedResults
        significance_results: Optional pairwise significance test results
        metric: Error metric to use
        step_size_metric: Step-size metric to use

    Returns:
        Figure with 4 subplots
    """
    try:
        import matplotlib.pyplot as plt
    except ImportError:
        raise ImportError("matplotlib is required. Install with: pip install matplotlib")

    fig, axes = plt.subplots(2, 2, figsize=(7, 5.6))

    # Top-left: Learning curves
    plot_learning_curves(results, metric=metric, ax=axes[0, 0])
    axes[0, 0].set_title("Learning Curves")

    # Top-right: Final performance bars
    plot_final_performance_bars(
        results,
        metric=metric,
        significance_results=significance_results,
        ax=axes[0, 1],
    )
    axes[0, 1].set_title("Final Performance")

    # Bottom-left: Step-size evolution (if available)
    has_step_sizes = any(step_size_metric in agg.metric_arrays for agg in results.values())
    if has_step_sizes:
        plot_step_size_evolution(results, metric=step_size_metric, ax=axes[1, 0])
        axes[1, 0].set_title("Step-Size Evolution")
    else:
        axes[1, 0].text(
            0.5,
            0.5,
            "Step-size data\nnot available",
            ha="center",
            va="center",
            transform=axes[1, 0].transAxes,
        )
        axes[1, 0].set_title("Step-Size Evolution")

    # Bottom-right: Cumulative error
    _plot_cumulative_error(results, metric=metric, ax=axes[1, 1])
    axes[1, 1].set_title("Cumulative Error")

    fig.tight_layout()
    return fig

`plot_final_performance_bars(results, metric='squared_error', show_significance=True, significance_results=None, ax=None, colors=None, lower_is_better=True)` ¶

Plot final performance as bar chart with error bars.

Args: results: Dictionary mapping config name to AggregatedResults metric: Metric to plot show_significance: Whether to show significance markers significance_results: Pairwise significance test results ax: Existing axes to plot on (creates new figure if None) colors: Optional custom colors for each method lower_is_better: Whether lower values are better

Returns: Tuple of (figure, axes)

Source code in src/alberta_framework/utils/visualization.py

def plot_final_performance_bars(
    results: dict[str, "AggregatedResults"],
    metric: str = "squared_error",
    show_significance: bool = True,
    significance_results: dict[tuple[str, str], "SignificanceResult"] | None = None,
    ax: "Axes | None" = None,
    colors: dict[str, str] | None = None,
    lower_is_better: bool = True,
) -> tuple["Figure", "Axes"]:
    """Plot final performance as bar chart with error bars.

    Args:
        results: Dictionary mapping config name to AggregatedResults
        metric: Metric to plot
        show_significance: Whether to show significance markers
        significance_results: Pairwise significance test results
        ax: Existing axes to plot on (creates new figure if None)
        colors: Optional custom colors for each method
        lower_is_better: Whether lower values are better

    Returns:
        Tuple of (figure, axes)
    """
    try:
        import matplotlib.pyplot as plt
    except ImportError:
        raise ImportError("matplotlib is required. Install with: pip install matplotlib")

    if ax is None:
        fig, ax = plt.subplots()
    else:
        fig = cast("Figure", ax.figure)
    names = list(results.keys())
    means = [results[name].summary[metric].mean for name in names]
    stds = [results[name].summary[metric].std for name in names]

    # Default colors
    default_colors = plt.rcParams["axes.prop_cycle"].by_key()["color"]

    x = np.arange(len(names))
    bar_colors = [
        (colors or {}).get(name, default_colors[i % len(default_colors)])
        for i, name in enumerate(names)
    ]

    bars = ax.bar(
        x, means, yerr=stds, capsize=3, color=bar_colors, edgecolor="black", linewidth=0.5
    )

    # Find best and mark it
    if lower_is_better:
        best_idx = int(np.argmin(means))
    else:
        best_idx = int(np.argmax(means))

    bars[best_idx].set_edgecolor("gold")
    bars[best_idx].set_linewidth(2)

    ax.set_xticks(x)
    ax.set_xticklabels(names, rotation=45, ha="right")
    ax.set_ylabel(_metric_to_label(metric))

    # Add significance markers if provided
    if show_significance and significance_results:
        best_name = names[best_idx]
        y_max = max(m + s for m, s in zip(means, stds, strict=False))
        y_offset = y_max * 0.05

        for i, name in enumerate(names):
            if name == best_name:
                continue

            marker = _get_significance_marker_for_plot(name, best_name, significance_results)
            if marker:
                ax.annotate(
                    marker,
                    (i, means[i] + stds[i] + y_offset),
                    ha="center",
                    fontsize=_current_style["font_size"],
                )

    fig.tight_layout()
    return fig, ax

`plot_hyperparameter_heatmap(results, param1_name, param1_values, param2_name, param2_values, metric='squared_error', name_pattern='{p1}_{p2}', ax=None, cmap='viridis_r', lower_is_better=True)` ¶

Plot hyperparameter sensitivity heatmap.

Args: results: Dictionary mapping config name to AggregatedResults param1_name: Name of first parameter (y-axis) param1_values: Values of first parameter param2_name: Name of second parameter (x-axis) param2_values: Values of second parameter metric: Metric to plot name_pattern: Pattern to generate config names (use {p1}, {p2}) ax: Existing axes to plot on cmap: Colormap to use lower_is_better: Whether lower values are better

Returns: Tuple of (figure, axes)

Source code in src/alberta_framework/utils/visualization.py

def plot_hyperparameter_heatmap(
    results: dict[str, "AggregatedResults"],
    param1_name: str,
    param1_values: list[Any],
    param2_name: str,
    param2_values: list[Any],
    metric: str = "squared_error",
    name_pattern: str = "{p1}_{p2}",
    ax: "Axes | None" = None,
    cmap: str = "viridis_r",
    lower_is_better: bool = True,
) -> tuple["Figure", "Axes"]:
    """Plot hyperparameter sensitivity heatmap.

    Args:
        results: Dictionary mapping config name to AggregatedResults
        param1_name: Name of first parameter (y-axis)
        param1_values: Values of first parameter
        param2_name: Name of second parameter (x-axis)
        param2_values: Values of second parameter
        metric: Metric to plot
        name_pattern: Pattern to generate config names (use {p1}, {p2})
        ax: Existing axes to plot on
        cmap: Colormap to use
        lower_is_better: Whether lower values are better

    Returns:
        Tuple of (figure, axes)
    """
    try:
        import matplotlib.pyplot as plt
    except ImportError:
        raise ImportError("matplotlib is required. Install with: pip install matplotlib")

    if ax is None:
        fig, ax = plt.subplots()
    else:
        fig = cast("Figure", ax.figure)
    # Build heatmap data
    data = np.zeros((len(param1_values), len(param2_values)))
    for i, p1 in enumerate(param1_values):
        for j, p2 in enumerate(param2_values):
            name = name_pattern.format(p1=p1, p2=p2)
            if name in results:
                data[i, j] = results[name].summary[metric].mean
            else:
                data[i, j] = np.nan

    if lower_is_better:
        cmap_to_use = cmap
    else:
        cmap_to_use = cmap.replace("_r", "") if "_r" in cmap else f"{cmap}_r"

    im = ax.imshow(data, cmap=cmap_to_use, aspect="auto")
    ax.set_xticks(np.arange(len(param2_values)))
    ax.set_yticks(np.arange(len(param1_values)))
    ax.set_xticklabels([str(v) for v in param2_values])
    ax.set_yticklabels([str(v) for v in param1_values])
    ax.set_xlabel(param2_name)
    ax.set_ylabel(param1_name)

    # Add colorbar
    cbar = fig.colorbar(im, ax=ax)
    cbar.set_label(_metric_to_label(metric))

    # Add value annotations
    for i in range(len(param1_values)):
        for j in range(len(param2_values)):
            if not np.isnan(data[i, j]):
                text_color = "white" if data[i, j] > np.nanmean(data) else "black"
                ax.annotate(
                    f"{data[i, j]:.3f}",
                    (j, i),
                    ha="center",
                    va="center",
                    color=text_color,
                    fontsize=_current_style["font_size"] - 2,
                )

    fig.tight_layout()
    return fig, ax

`plot_learning_curves(results, metric='squared_error', show_ci=True, log_scale=True, window_size=100, ax=None, colors=None, labels=None)` ¶

Plot learning curves with confidence intervals.

Args: results: Dictionary mapping config name to AggregatedResults metric: Metric to plot show_ci: Whether to show confidence intervals log_scale: Whether to use log scale for y-axis window_size: Window size for running mean smoothing ax: Existing axes to plot on (creates new figure if None) colors: Optional custom colors for each method labels: Optional custom labels for legend

Returns: Tuple of (figure, axes)

Source code in src/alberta_framework/utils/visualization.py

def plot_learning_curves(
    results: dict[str, "AggregatedResults"],
    metric: str = "squared_error",
    show_ci: bool = True,
    log_scale: bool = True,
    window_size: int = 100,
    ax: "Axes | None" = None,
    colors: dict[str, str] | None = None,
    labels: dict[str, str] | None = None,
) -> tuple["Figure", "Axes"]:
    """Plot learning curves with confidence intervals.

    Args:
        results: Dictionary mapping config name to AggregatedResults
        metric: Metric to plot
        show_ci: Whether to show confidence intervals
        log_scale: Whether to use log scale for y-axis
        window_size: Window size for running mean smoothing
        ax: Existing axes to plot on (creates new figure if None)
        colors: Optional custom colors for each method
        labels: Optional custom labels for legend

    Returns:
        Tuple of (figure, axes)
    """
    try:
        import matplotlib.pyplot as plt
    except ImportError:
        raise ImportError("matplotlib is required. Install with: pip install matplotlib")

    from alberta_framework.utils.metrics import compute_running_mean
    from alberta_framework.utils.statistics import compute_timeseries_statistics

    if ax is None:
        fig, ax = plt.subplots()
    else:
        fig = cast("Figure", ax.figure)
    # Default colors
    default_colors = plt.rcParams["axes.prop_cycle"].by_key()["color"]

    for i, (name, agg) in enumerate(results.items()):
        color = (colors or {}).get(name, default_colors[i % len(default_colors)])
        label = (labels or {}).get(name, name)

        # Compute smoothed mean and CI
        metric_array = agg.metric_arrays[metric]

        # Smooth each seed individually, then compute statistics
        smoothed = np.array(
            [
                compute_running_mean(metric_array[seed_idx], window_size)
                for seed_idx in range(metric_array.shape[0])
            ]
        )

        mean, ci_lower, ci_upper = compute_timeseries_statistics(smoothed)

        steps = np.arange(len(mean))
        ax.plot(steps, mean, color=color, label=label, linewidth=_current_style["line_width"])

        if show_ci:
            ax.fill_between(steps, ci_lower, ci_upper, color=color, alpha=0.2)

    ax.set_xlabel("Time Step")
    ax.set_ylabel(_metric_to_label(metric))
    if log_scale:
        ax.set_yscale("log")
    ax.legend(loc="best", framealpha=0.9)
    ax.grid(True, alpha=0.3)

    fig.tight_layout()
    return fig, ax

`plot_step_size_evolution(results, metric='mean_step_size', show_ci=True, ax=None, colors=None)` ¶

Plot step-size evolution over time.

Args: results: Dictionary mapping config name to AggregatedResults metric: Step-size metric to plot show_ci: Whether to show confidence intervals ax: Existing axes to plot on colors: Optional custom colors

Returns: Tuple of (figure, axes)

Source code in src/alberta_framework/utils/visualization.py

def plot_step_size_evolution(
    results: dict[str, "AggregatedResults"],
    metric: str = "mean_step_size",
    show_ci: bool = True,
    ax: "Axes | None" = None,
    colors: dict[str, str] | None = None,
) -> tuple["Figure", "Axes"]:
    """Plot step-size evolution over time.

    Args:
        results: Dictionary mapping config name to AggregatedResults
        metric: Step-size metric to plot
        show_ci: Whether to show confidence intervals
        ax: Existing axes to plot on
        colors: Optional custom colors

    Returns:
        Tuple of (figure, axes)
    """
    try:
        import matplotlib.pyplot as plt
    except ImportError:
        raise ImportError("matplotlib is required. Install with: pip install matplotlib")

    from alberta_framework.utils.statistics import compute_timeseries_statistics

    if ax is None:
        fig, ax = plt.subplots()
    else:
        fig = cast("Figure", ax.figure)
    default_colors = plt.rcParams["axes.prop_cycle"].by_key()["color"]

    for i, (name, agg) in enumerate(results.items()):
        if metric not in agg.metric_arrays:
            continue

        color = (colors or {}).get(name, default_colors[i % len(default_colors)])
        metric_array = agg.metric_arrays[metric]

        mean, ci_lower, ci_upper = compute_timeseries_statistics(metric_array)
        steps = np.arange(len(mean))

        ax.plot(steps, mean, color=color, label=name, linewidth=_current_style["line_width"])
        if show_ci:
            ax.fill_between(steps, ci_lower, ci_upper, color=color, alpha=0.2)

    ax.set_xlabel("Time Step")
    ax.set_ylabel("Step Size")
    ax.set_yscale("log")
    ax.legend(loc="best", framealpha=0.9)
    ax.grid(True, alpha=0.3)

    fig.tight_layout()
    return fig, ax

`save_figure(fig, filename, formats=None, dpi=300, transparent=False)` ¶

Save figure to multiple formats.

Args: fig: Matplotlib figure to save filename: Base filename (without extension) formats: List of formats to save (default: ["pdf", "png"]) dpi: Resolution for raster formats transparent: Whether to use transparent background

Returns: List of saved file paths

Source code in src/alberta_framework/utils/visualization.py

def save_figure(
    fig: "Figure",
    filename: str | Path,
    formats: list[str] | None = None,
    dpi: int = 300,
    transparent: bool = False,
) -> list[Path]:
    """Save figure to multiple formats.

    Args:
        fig: Matplotlib figure to save
        filename: Base filename (without extension)
        formats: List of formats to save (default: ["pdf", "png"])
        dpi: Resolution for raster formats
        transparent: Whether to use transparent background

    Returns:
        List of saved file paths
    """
    if formats is None:
        formats = ["pdf", "png"]

    filename = Path(filename)
    filename.parent.mkdir(parents=True, exist_ok=True)

    saved_paths = []
    for fmt in formats:
        path = filename.with_suffix(f".{fmt}")
        fig.savefig(
            path,
            format=fmt,
            dpi=dpi,
            bbox_inches="tight",
            transparent=transparent,
        )
        saved_paths.append(path)

    return saved_paths

`set_publication_style(font_size=10, use_latex=False, figure_width=3.5, figure_height=None, style='seaborn-v0_8-whitegrid')` ¶

Set matplotlib style for publication-quality figures.

Args: font_size: Base font size use_latex: Whether to use LaTeX for text rendering figure_width: Default figure width in inches figure_height: Default figure height (auto if None) style: Matplotlib style to use

Source code in src/alberta_framework/utils/visualization.py

def set_publication_style(
    font_size: int = 10,
    use_latex: bool = False,
    figure_width: float = 3.5,
    figure_height: float | None = None,
    style: str = "seaborn-v0_8-whitegrid",
) -> None:
    """Set matplotlib style for publication-quality figures.

    Args:
        font_size: Base font size
        use_latex: Whether to use LaTeX for text rendering
        figure_width: Default figure width in inches
        figure_height: Default figure height (auto if None)
        style: Matplotlib style to use
    """
    try:
        import matplotlib.pyplot as plt
    except ImportError:
        raise ImportError("matplotlib is required. Install with: pip install matplotlib")

    # Update current style
    _current_style["font_size"] = font_size
    _current_style["figure_width"] = figure_width
    _current_style["use_latex"] = use_latex
    if figure_height is not None:
        _current_style["figure_height"] = figure_height
    else:
        _current_style["figure_height"] = figure_width * 0.8

    # Try to use the requested style, fall back to default if not available
    try:
        plt.style.use(style)
    except OSError:
        # Style not available, use defaults
        pass

    # Configure matplotlib
    plt.rcParams.update(
        {
            "font.size": font_size,
            "axes.labelsize": font_size,
            "axes.titlesize": font_size + 1,
            "xtick.labelsize": font_size - 1,
            "ytick.labelsize": font_size - 1,
            "legend.fontsize": font_size - 1,
            "figure.figsize": (_current_style["figure_width"], _current_style["figure_height"]),
            "figure.dpi": _current_style["dpi"],
            "savefig.dpi": _current_style["dpi"],
            "lines.linewidth": _current_style["line_width"],
            "lines.markersize": _current_style["marker_size"],
            "axes.linewidth": 0.8,
            "grid.linewidth": 0.5,
            "grid.alpha": 0.3,
        }
    )

    if use_latex:
        plt.rcParams.update(
            {
                "text.usetex": True,
                "font.family": "serif",
                "font.serif": ["Computer Modern Roman"],
            }
        )

utils

utils ¶

AggregatedResults ¶

ExperimentConfig ¶

MetricSummary ¶

SingleRunResult ¶

SignificanceResult ¶

StatisticalSummary ¶

aggregate_metrics(results) ¶

get_final_performance(results, metric='squared_error', window=100) ¶

get_metric_timeseries(results, metric='squared_error') ¶

run_multi_seed_experiment(configs, seeds=30, parallel=True, n_jobs=-1, show_progress=True) ¶

run_single_experiment(config, seed) ¶

export_to_csv(results, filepath, metric='squared_error', include_timeseries=False) ¶

export_to_json(results, filepath, include_timeseries=False) ¶

generate_latex_table(results, significance_results=None, metric='squared_error', caption='Experimental Results', label='tab:results', metric_label='Error', lower_is_better=True) ¶

generate_markdown_table(results, significance_results=None, metric='squared_error', metric_label='Error', lower_is_better=True) ¶

save_experiment_report(results, output_dir, experiment_name, significance_results=None, metric='squared_error') ¶

compare_learners(results, metric='squared_error') ¶

compute_cumulative_error(metrics_history, error_key='squared_error') ¶

compute_running_mean(values, window_size=100) ¶

compute_tracking_error(metrics_history, window_size=100) ¶

extract_metric(metrics_history, key) ¶

bonferroni_correction(p_values, alpha=0.05) ¶

bootstrap_ci(values, statistic='mean', confidence_level=0.95, n_bootstrap=10000, seed=42) ¶

cohens_d(values_a, values_b) ¶

compute_statistics(values, confidence_level=0.95) ¶

compute_timeseries_statistics(metric_array, confidence_level=0.95) ¶

holm_correction(p_values, alpha=0.05) ¶

mann_whitney_comparison(values_a, values_b, alpha=0.05, method_a='A', method_b='B') ¶

pairwise_comparisons(results, metric='squared_error', test='ttest', correction='bonferroni', alpha=0.05, window=100) ¶

ttest_comparison(values_a, values_b, paired=True, alpha=0.05, method_a='A', method_b='B') ¶

wilcoxon_comparison(values_a, values_b, alpha=0.05, method_a='A', method_b='B') ¶

create_comparison_figure(results, significance_results=None, metric='squared_error', step_size_metric='mean_step_size') ¶

plot_final_performance_bars(results, metric='squared_error', show_significance=True, significance_results=None, ax=None, colors=None, lower_is_better=True) ¶

plot_hyperparameter_heatmap(results, param1_name, param1_values, param2_name, param2_values, metric='squared_error', name_pattern='{p1}_{p2}', ax=None, cmap='viridis_r', lower_is_better=True) ¶

plot_learning_curves(results, metric='squared_error', show_ci=True, log_scale=True, window_size=100, ax=None, colors=None, labels=None) ¶

plot_step_size_evolution(results, metric='mean_step_size', show_ci=True, ax=None, colors=None) ¶

save_figure(fig, filename, formats=None, dpi=300, transparent=False) ¶

set_publication_style(font_size=10, use_latex=False, figure_width=3.5, figure_height=None, style='seaborn-v0_8-whitegrid') ¶

`utils` ¶

`AggregatedResults` ¶

`ExperimentConfig` ¶

`MetricSummary` ¶

`SingleRunResult` ¶

`SignificanceResult` ¶

`StatisticalSummary` ¶

`aggregate_metrics(results)` ¶

`get_final_performance(results, metric='squared_error', window=100)` ¶

`get_metric_timeseries(results, metric='squared_error')` ¶

`run_multi_seed_experiment(configs, seeds=30, parallel=True, n_jobs=-1, show_progress=True)` ¶

`run_single_experiment(config, seed)` ¶

`export_to_csv(results, filepath, metric='squared_error', include_timeseries=False)` ¶

`export_to_json(results, filepath, include_timeseries=False)` ¶

`generate_latex_table(results, significance_results=None, metric='squared_error', caption='Experimental Results', label='tab:results', metric_label='Error', lower_is_better=True)` ¶

`generate_markdown_table(results, significance_results=None, metric='squared_error', metric_label='Error', lower_is_better=True)` ¶

`save_experiment_report(results, output_dir, experiment_name, significance_results=None, metric='squared_error')` ¶

`compare_learners(results, metric='squared_error')` ¶

`compute_cumulative_error(metrics_history, error_key='squared_error')` ¶

`compute_running_mean(values, window_size=100)` ¶

`compute_tracking_error(metrics_history, window_size=100)` ¶

`extract_metric(metrics_history, key)` ¶

`bonferroni_correction(p_values, alpha=0.05)` ¶

`bootstrap_ci(values, statistic='mean', confidence_level=0.95, n_bootstrap=10000, seed=42)` ¶

`cohens_d(values_a, values_b)` ¶

`compute_statistics(values, confidence_level=0.95)` ¶

`compute_timeseries_statistics(metric_array, confidence_level=0.95)` ¶

`holm_correction(p_values, alpha=0.05)` ¶

`mann_whitney_comparison(values_a, values_b, alpha=0.05, method_a='A', method_b='B')` ¶

`pairwise_comparisons(results, metric='squared_error', test='ttest', correction='bonferroni', alpha=0.05, window=100)` ¶

`ttest_comparison(values_a, values_b, paired=True, alpha=0.05, method_a='A', method_b='B')` ¶

`wilcoxon_comparison(values_a, values_b, alpha=0.05, method_a='A', method_b='B')` ¶

`create_comparison_figure(results, significance_results=None, metric='squared_error', step_size_metric='mean_step_size')` ¶

`plot_final_performance_bars(results, metric='squared_error', show_significance=True, significance_results=None, ax=None, colors=None, lower_is_better=True)` ¶

`plot_hyperparameter_heatmap(results, param1_name, param1_values, param2_name, param2_values, metric='squared_error', name_pattern='{p1}_{p2}', ax=None, cmap='viridis_r', lower_is_better=True)` ¶

`plot_learning_curves(results, metric='squared_error', show_ci=True, log_scale=True, window_size=100, ax=None, colors=None, labels=None)` ¶

`plot_step_size_evolution(results, metric='mean_step_size', show_ci=True, ax=None, colors=None)` ¶

`save_figure(fig, filename, formats=None, dpi=300, transparent=False)` ¶

`set_publication_style(font_size=10, use_latex=False, figure_width=3.5, figure_height=None, style='seaborn-v0_8-whitegrid')` ¶