statistics
statistics
¶
Statistical analysis utilities for publication-quality experiments.
Provides functions for computing confidence intervals, significance tests, effect sizes, and multiple comparison corrections.
StatisticalSummary
¶
Bases: NamedTuple
Summary statistics for a set of values.
Attributes: mean: Arithmetic mean std: Standard deviation sem: Standard error of the mean ci_lower: Lower bound of confidence interval ci_upper: Upper bound of confidence interval median: Median value iqr: Interquartile range n_seeds: Number of samples
SignificanceResult
¶
Bases: NamedTuple
Result of a statistical significance test.
Attributes: test_name: Name of the test performed statistic: Test statistic value p_value: P-value of the test significant: Whether the result is significant at the given alpha alpha: Significance level used effect_size: Effect size (e.g., Cohen's d) method_a: Name of first method method_b: Name of second method
compute_statistics(values, confidence_level=0.95)
¶
Compute comprehensive statistics for a set of values.
Args: values: Array of values (e.g., final performance across seeds) confidence_level: Confidence level for CI (default 0.95)
Returns: StatisticalSummary with all statistics
Source code in src/alberta_framework/utils/statistics.py
compute_timeseries_statistics(metric_array, confidence_level=0.95)
¶
Compute mean and confidence intervals for timeseries data.
Args: metric_array: Array of shape (n_seeds, n_steps) confidence_level: Confidence level for CI
Returns: Tuple of (mean, ci_lower, ci_upper) arrays of shape (n_steps,)
Source code in src/alberta_framework/utils/statistics.py
cohens_d(values_a, values_b)
¶
Compute Cohen's d effect size.
Args: values_a: Values for first group values_b: Values for second group
Returns: Cohen's d (positive means a > b)
Source code in src/alberta_framework/utils/statistics.py
ttest_comparison(values_a, values_b, paired=True, alpha=0.05, method_a='A', method_b='B')
¶
Perform t-test comparison between two methods.
Args: values_a: Values for first method values_b: Values for second method paired: Whether to use paired t-test (default True for same seeds) alpha: Significance level method_a: Name of first method method_b: Name of second method
Returns: SignificanceResult with test results
Source code in src/alberta_framework/utils/statistics.py
mann_whitney_comparison(values_a, values_b, alpha=0.05, method_a='A', method_b='B')
¶
Perform Mann-Whitney U test (non-parametric).
Args: values_a: Values for first method values_b: Values for second method alpha: Significance level method_a: Name of first method method_b: Name of second method
Returns: SignificanceResult with test results
Source code in src/alberta_framework/utils/statistics.py
wilcoxon_comparison(values_a, values_b, alpha=0.05, method_a='A', method_b='B')
¶
Perform Wilcoxon signed-rank test (paired non-parametric).
Args: values_a: Values for first method values_b: Values for second method alpha: Significance level method_a: Name of first method method_b: Name of second method
Returns: SignificanceResult with test results
Source code in src/alberta_framework/utils/statistics.py
bonferroni_correction(p_values, alpha=0.05)
¶
Apply Bonferroni correction for multiple comparisons.
Args: p_values: List of p-values from multiple tests alpha: Family-wise significance level
Returns: Tuple of (list of significant booleans, corrected alpha)
Source code in src/alberta_framework/utils/statistics.py
holm_correction(p_values, alpha=0.05)
¶
Apply Holm-Bonferroni step-down correction.
More powerful than Bonferroni while still controlling FWER.
Args: p_values: List of p-values from multiple tests alpha: Family-wise significance level
Returns: List of significant booleans
Source code in src/alberta_framework/utils/statistics.py
pairwise_comparisons(results, metric='squared_error', test='ttest', correction='bonferroni', alpha=0.05, window=100)
¶
Perform all pairwise comparisons between methods.
Args: results: Dictionary mapping config name to AggregatedResults metric: Metric to compare test: Test to use ("ttest", "mann_whitney", or "wilcoxon") correction: Multiple comparison correction ("bonferroni" or "holm") alpha: Significance level window: Number of final steps to average
Returns: Dictionary mapping (method_a, method_b) to SignificanceResult
Source code in src/alberta_framework/utils/statistics.py
391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 | |
bootstrap_ci(values, statistic='mean', confidence_level=0.95, n_bootstrap=10000, seed=42)
¶
Compute bootstrap confidence interval.
Args: values: Array of values statistic: Statistic to bootstrap ("mean" or "median") confidence_level: Confidence level n_bootstrap: Number of bootstrap samples seed: Random seed
Returns: Tuple of (point_estimate, ci_lower, ci_upper)