Evaluation Guide

This guide covers evaluation metrics, procedures, and analysis for M3SGG models.

Evaluation Overview

DLHM VidSGG provides comprehensive evaluation capabilities for video scene graph generation models across different modes and datasets.

Evaluation Modes

PredCLS: Evaluate relationship prediction given ground truth objects
SGCLS: Evaluate both object classification and relationship prediction
SGDET: Evaluate end-to-end object detection and relationship prediction

Basic Evaluation

Simple Evaluation Command

python scripts/evaluation/test.py -m predcls -datasize large -data_path data/action_genome -model_path output/model.pth

This evaluates a trained model on the Action Genome test set.

Complete Evaluation Command

python scripts/evaluation/test.py \
  -m predcls \
  -datasize large \
  -data_path data/action_genome \
  -model_path output/sttran_predcls/checkpoint_best.tar \
  -save_results output/evaluation_results.json

Evaluation Metrics

Recall Metrics

The primary evaluation metrics for scene graph generation:

Recall@K: Percentage of ground truth relationships that appear in top-K predictions
Mean Recall@K (mRecall@K): Average recall across all relationship categories

Standard Evaluation Metrics
Metric	Description
Recall@10	Recall considering top 10 predictions per frame
Recall@20	Recall considering top 20 predictions per frame
Recall@50	Recall considering top 50 predictions per frame
mRecall@10	Mean recall across relationship categories (top 10)
mRecall@20	Mean recall across relationship categories (top 20)
mRecall@50	Mean recall across relationship categories (top 50)

Zero-Shot Metrics

For unseen relationship combinations:

Zero-Shot Recall@K: Performance on novel object-relationship-object triplets
Compositional Recall: Performance on new compositions of seen elements

Per-Category Analysis

Detailed analysis for each relationship category:

# Example per-category results
per_category_results = {
    'holding': {'recall@20': 25.3, 'precision': 18.7},
    'sitting_on': {'recall@20': 31.2, 'precision': 22.1},
    'looking_at': {'recall@20': 15.8, 'precision': 12.4}
}

Evaluation Procedures

Standard Evaluation

# Evaluate all models on test set
for model in sttran tempura scenellm stket; do
  python scripts/evaluation/test.py \
    -m predcls \
    -model_path output/${model}_predcls/checkpoint_best.tar \
    -save_results results/${model}_predcls_results.json
done

Cross-Dataset Evaluation

Evaluate model generalization across datasets:

# Train on Action Genome, test on EASG
python scripts/evaluation/test.py \
  -m predcls \
  -data_path data/EASG \
  -model_path output/action_genome_model.pth \
  -save_results cross_dataset_results.json

Temporal Evaluation

Analyze performance across different temporal windows:

# Evaluate with different temporal window sizes
for window in 1 3 5 10; do
  python scripts/evaluation/test.py \
    -m predcls \
    -temporal_window $window \
    -model_path output/model.pth
done

Mode-Specific Evaluation

PredCLS Evaluation

Input: Ground truth object bounding boxes and labels Task: Predict relationships between objects

python scripts/evaluation/test.py -m predcls -model_path output/sttran_predcls.pth

Key Metrics: * Relationship prediction accuracy * Per-category relationship recall * Temporal consistency

SGCLS Evaluation

Input: Ground truth object bounding boxes Task: Predict object labels and relationships

python scripts/evaluation/test.py -m sgcls -model_path output/sttran_sgcls.pth

Key Metrics: * Object classification accuracy * Relationship prediction given predicted objects * Joint object-relationship accuracy

SGDET Evaluation

Input: Raw video frames Task: Detect objects and predict relationships end-to-end

python scripts/evaluation/test.py -m sgdet -model_path output/sttran_sgdet.pth

Key Metrics: * Object detection mAP * Relationship prediction accuracy * End-to-end scene graph quality

Advanced Evaluation

Uncertainty Evaluation

For models with uncertainty estimation (e.g., Tempura):

# Evaluate uncertainty calibration
python evaluate_uncertainty.py \
  -model_path output/tempura_model.pth \
  -calibration_method temperature_scaling

Uncertainty Metrics: * Calibration error (ECE) * Reliability diagrams * Uncertainty-accuracy correlation

Robustness Evaluation

Test model robustness to various perturbations:

# Evaluate with noise
python test_robustness.py \
  -model_path output/model.pth \
  -noise_level 0.1 \
  -noise_type gaussian

Robustness Tests: * Gaussian noise in input frames * Occlusions and crops * Temporal jittering * Lighting changes

Efficiency Evaluation

Measure computational efficiency:

# Profile model inference
python profile_model.py \
  -model_path output/model.pth \
  -batch_size 1 \
  -num_iterations 100

Efficiency Metrics: * Inference time per frame * GPU memory usage * FLOPs count * Model parameters

Evaluation Analysis

Statistical Significance

Test statistical significance of results:

from scipy import stats

# Compare two models
model1_scores = [19.2, 18.8, 19.5, ...]
model2_scores = [20.1, 19.7, 20.3, ...]

t_stat, p_value = stats.ttest_rel(model1_scores, model2_scores)
print(f"P-value: {p_value}")

Error Analysis

Analyze common failure modes:

# Analyze prediction errors
python analyze_errors.py \
  -predictions output/predictions.json \
  -ground_truth data/test_annotations.json \
  -save_analysis error_analysis.html

Analysis Categories: * Frequent false positives * Common missed relationships * Object detection failures * Temporal inconsistencies

Visualization

Generate evaluation visualizations:

# Create evaluation plots
python visualize_results.py \
  -results_dir output/evaluation_results/ \
  -output_dir plots/

Visualization Types: * Recall curves * Precision-recall plots * Confusion matrices * Per-category performance bars

Benchmark Comparison

Standard Benchmarks

Compare against established benchmarks:

Action Genome Benchmark Results
Model	R@10	R@20	R@50	mR@50	Year
IMP	8.9	12.1	17.8	4.2	2017
KERN	9.2	12.7	18.4	4.8	2019
STTran	14.6	19.2	26.5	7.8	2021
Tempura	15.8	21.1	28.3	8.9	2022

Leaderboard Submission

Prepare results for benchmark submission:

# Format results for submission
python format_submission.py \
  -predictions output/test_predictions.json \
  -output submission.zip

Custom Evaluation

Domain-Specific Metrics

Implement custom metrics for specific domains:

def custom_metric(predictions, ground_truth):
    # Custom evaluation logic
    score = compute_domain_specific_score(predictions, ground_truth)
    return score

Temporal Metrics

Evaluate temporal consistency:

def temporal_consistency(predictions):
    # Measure consistency across time
    consistency_score = 0
    for t in range(1, len(predictions)):
        consistency_score += similarity(predictions[t], predictions[t-1])
    return consistency_score / (len(predictions) - 1)

Quality Assessment

Assess overall scene graph quality:

def scene_graph_quality(prediction, ground_truth):
    # Graph-level similarity metrics
    node_similarity = compute_node_similarity(prediction, ground_truth)
    edge_similarity = compute_edge_similarity(prediction, ground_truth)
    structure_similarity = compute_structure_similarity(prediction, ground_truth)

    return (node_similarity + edge_similarity + structure_similarity) / 3

Evaluation Best Practices

Reproducibility

Ensure reproducible evaluation results:

# Set random seeds for consistent evaluation
torch.manual_seed(42)
np.random.seed(42)

# Use consistent evaluation protocols
eval_config = {
    'batch_size': 1,
    'num_workers': 0,  # For reproducibility
    'deterministic': True
}

Multiple Runs

Perform multiple evaluation runs:

# Run evaluation multiple times with different seeds
for seed in 42 123 456 789 999; do
  python scripts/evaluation/test.py \
    -m predcls \
    -model_path output/model.pth \
    -seed $seed \
    -save_results results/run_${seed}.json
done

Statistical Reporting

Report results with confidence intervals:

import numpy as np
from scipy import stats

# Calculate mean and confidence interval
scores = [19.2, 18.8, 19.5, 19.1, 19.3]
mean_score = np.mean(scores)
std_error = stats.sem(scores)
ci = stats.t.interval(0.95, len(scores)-1, loc=mean_score, scale=std_error)

print(f"Recall@20: {mean_score:.1f} ± {std_error:.1f} (95% CI: {ci[0]:.1f}-{ci[1]:.1f})")

Troubleshooting

Common Issues

Low Evaluation Scores

Verify ground truth data format
Check evaluation metric implementation
Compare preprocessing with training

Inconsistent Results

Set random seeds for reproducibility
Use same data splits as training
Verify model loading correctly

Memory Issues During Evaluation

Reduce batch size to 1
Process samples sequentially
Clear cache between batches

Performance Debugging

Slow Evaluation

Profile bottlenecks in evaluation code
Optimize data loading pipeline
Use GPU for faster inference

Unexpected Results

Visualize predictions vs ground truth
Check for data leakage or preprocessing errors
Validate against simple baselines

Evaluation Reports

Automated Reports

Generate comprehensive evaluation reports:

# Generate evaluation report
python generate_report.py \
  -results output/evaluation_results.json \
  -template templates/evaluation_report.html \
  -output reports/model_evaluation.html

Report Contents

Standard evaluation reports include:

Model Information: Architecture, parameters, training details
Dataset Statistics: Test set size, class distribution
Quantitative Results: All evaluation metrics with confidence intervals
Qualitative Analysis: Visualization of predictions and failures
Comparison: Performance relative to baselines and state-of-the-art

Next Steps

Training Guide - Return to training with evaluation insights
Models - Understand model architectures and their evaluation characteristics
api/lib - API documentation for evaluation functions