Evaluation Guide

This guide covers evaluation metrics, procedures, and analysis for M3SGG models.

Evaluation Overview

DLHM VidSGG provides comprehensive evaluation capabilities for video scene graph generation models across different modes and datasets.

Evaluation Modes

  • PredCLS: Evaluate relationship prediction given ground truth objects

  • SGCLS: Evaluate both object classification and relationship prediction

  • SGDET: Evaluate end-to-end object detection and relationship prediction

Basic Evaluation

Simple Evaluation Command

python scripts/evaluation/test.py -m predcls -datasize large -data_path data/action_genome -model_path output/model.pth

This evaluates a trained model on the Action Genome test set.

Complete Evaluation Command

python scripts/evaluation/test.py \
  -m predcls \
  -datasize large \
  -data_path data/action_genome \
  -model_path output/sttran_predcls/checkpoint_best.tar \
  -save_results output/evaluation_results.json

Evaluation Metrics

Recall Metrics

The primary evaluation metrics for scene graph generation:

Recall@K

Percentage of ground truth relationships that appear in top-K predictions

Mean Recall@K (mRecall@K)

Average recall across all relationship categories

Standard Evaluation Metrics

Metric

Description

Recall@10

Recall considering top 10 predictions per frame

Recall@20

Recall considering top 20 predictions per frame

Recall@50

Recall considering top 50 predictions per frame

mRecall@10

Mean recall across relationship categories (top 10)

mRecall@20

Mean recall across relationship categories (top 20)

mRecall@50

Mean recall across relationship categories (top 50)

Zero-Shot Metrics

For unseen relationship combinations:

  • Zero-Shot Recall@K: Performance on novel object-relationship-object triplets

  • Compositional Recall: Performance on new compositions of seen elements

Per-Category Analysis

Detailed analysis for each relationship category:

# Example per-category results
per_category_results = {
    'holding': {'recall@20': 25.3, 'precision': 18.7},
    'sitting_on': {'recall@20': 31.2, 'precision': 22.1},
    'looking_at': {'recall@20': 15.8, 'precision': 12.4}
}

Evaluation Procedures

Standard Evaluation

# Evaluate all models on test set
for model in sttran tempura scenellm stket; do
  python scripts/evaluation/test.py \
    -m predcls \
    -model_path output/${model}_predcls/checkpoint_best.tar \
    -save_results results/${model}_predcls_results.json
done

Cross-Dataset Evaluation

Evaluate model generalization across datasets:

# Train on Action Genome, test on EASG
python scripts/evaluation/test.py \
  -m predcls \
  -data_path data/EASG \
  -model_path output/action_genome_model.pth \
  -save_results cross_dataset_results.json

Temporal Evaluation

Analyze performance across different temporal windows:

# Evaluate with different temporal window sizes
for window in 1 3 5 10; do
  python scripts/evaluation/test.py \
    -m predcls \
    -temporal_window $window \
    -model_path output/model.pth
done

Mode-Specific Evaluation

PredCLS Evaluation

Input: Ground truth object bounding boxes and labels Task: Predict relationships between objects

python scripts/evaluation/test.py -m predcls -model_path output/sttran_predcls.pth

Key Metrics: * Relationship prediction accuracy * Per-category relationship recall * Temporal consistency

SGCLS Evaluation

Input: Ground truth object bounding boxes Task: Predict object labels and relationships

python scripts/evaluation/test.py -m sgcls -model_path output/sttran_sgcls.pth

Key Metrics: * Object classification accuracy * Relationship prediction given predicted objects * Joint object-relationship accuracy

SGDET Evaluation

Input: Raw video frames Task: Detect objects and predict relationships end-to-end

python scripts/evaluation/test.py -m sgdet -model_path output/sttran_sgdet.pth

Key Metrics: * Object detection mAP * Relationship prediction accuracy * End-to-end scene graph quality

Advanced Evaluation

Uncertainty Evaluation

For models with uncertainty estimation (e.g., Tempura):

# Evaluate uncertainty calibration
python evaluate_uncertainty.py \
  -model_path output/tempura_model.pth \
  -calibration_method temperature_scaling

Uncertainty Metrics: * Calibration error (ECE) * Reliability diagrams * Uncertainty-accuracy correlation

Robustness Evaluation

Test model robustness to various perturbations:

# Evaluate with noise
python test_robustness.py \
  -model_path output/model.pth \
  -noise_level 0.1 \
  -noise_type gaussian

Robustness Tests: * Gaussian noise in input frames * Occlusions and crops * Temporal jittering * Lighting changes

Efficiency Evaluation

Measure computational efficiency:

# Profile model inference
python profile_model.py \
  -model_path output/model.pth \
  -batch_size 1 \
  -num_iterations 100

Efficiency Metrics: * Inference time per frame * GPU memory usage * FLOPs count * Model parameters

Evaluation Analysis

Statistical Significance

Test statistical significance of results:

from scipy import stats

# Compare two models
model1_scores = [19.2, 18.8, 19.5, ...]
model2_scores = [20.1, 19.7, 20.3, ...]

t_stat, p_value = stats.ttest_rel(model1_scores, model2_scores)
print(f"P-value: {p_value}")

Error Analysis

Analyze common failure modes:

# Analyze prediction errors
python analyze_errors.py \
  -predictions output/predictions.json \
  -ground_truth data/test_annotations.json \
  -save_analysis error_analysis.html

Analysis Categories: * Frequent false positives * Common missed relationships * Object detection failures * Temporal inconsistencies

Visualization

Generate evaluation visualizations:

# Create evaluation plots
python visualize_results.py \
  -results_dir output/evaluation_results/ \
  -output_dir plots/

Visualization Types: * Recall curves * Precision-recall plots * Confusion matrices * Per-category performance bars

Benchmark Comparison

Standard Benchmarks

Compare against established benchmarks:

Action Genome Benchmark Results

Model

R@10

R@20

R@50

mR@50

Year

IMP

8.9

12.1

17.8

4.2

2017

KERN

9.2

12.7

18.4

4.8

2019

STTran

14.6

19.2

26.5

7.8

2021

Tempura

15.8

21.1

28.3

8.9

2022

Leaderboard Submission

Prepare results for benchmark submission:

# Format results for submission
python format_submission.py \
  -predictions output/test_predictions.json \
  -output submission.zip

Custom Evaluation

Domain-Specific Metrics

Implement custom metrics for specific domains:

def custom_metric(predictions, ground_truth):
    # Custom evaluation logic
    score = compute_domain_specific_score(predictions, ground_truth)
    return score

Temporal Metrics

Evaluate temporal consistency:

def temporal_consistency(predictions):
    # Measure consistency across time
    consistency_score = 0
    for t in range(1, len(predictions)):
        consistency_score += similarity(predictions[t], predictions[t-1])
    return consistency_score / (len(predictions) - 1)

Quality Assessment

Assess overall scene graph quality:

def scene_graph_quality(prediction, ground_truth):
    # Graph-level similarity metrics
    node_similarity = compute_node_similarity(prediction, ground_truth)
    edge_similarity = compute_edge_similarity(prediction, ground_truth)
    structure_similarity = compute_structure_similarity(prediction, ground_truth)

    return (node_similarity + edge_similarity + structure_similarity) / 3

Evaluation Best Practices

Reproducibility

Ensure reproducible evaluation results:

# Set random seeds for consistent evaluation
torch.manual_seed(42)
np.random.seed(42)

# Use consistent evaluation protocols
eval_config = {
    'batch_size': 1,
    'num_workers': 0,  # For reproducibility
    'deterministic': True
}

Multiple Runs

Perform multiple evaluation runs:

# Run evaluation multiple times with different seeds
for seed in 42 123 456 789 999; do
  python scripts/evaluation/test.py \
    -m predcls \
    -model_path output/model.pth \
    -seed $seed \
    -save_results results/run_${seed}.json
done

Statistical Reporting

Report results with confidence intervals:

import numpy as np
from scipy import stats

# Calculate mean and confidence interval
scores = [19.2, 18.8, 19.5, 19.1, 19.3]
mean_score = np.mean(scores)
std_error = stats.sem(scores)
ci = stats.t.interval(0.95, len(scores)-1, loc=mean_score, scale=std_error)

print(f"Recall@20: {mean_score:.1f} ± {std_error:.1f} (95% CI: {ci[0]:.1f}-{ci[1]:.1f})")

Troubleshooting

Common Issues

Low Evaluation Scores
  • Verify ground truth data format

  • Check evaluation metric implementation

  • Compare preprocessing with training

Inconsistent Results
  • Set random seeds for reproducibility

  • Use same data splits as training

  • Verify model loading correctly

Memory Issues During Evaluation
  • Reduce batch size to 1

  • Process samples sequentially

  • Clear cache between batches

Performance Debugging

Slow Evaluation
  • Profile bottlenecks in evaluation code

  • Optimize data loading pipeline

  • Use GPU for faster inference

Unexpected Results
  • Visualize predictions vs ground truth

  • Check for data leakage or preprocessing errors

  • Validate against simple baselines

Evaluation Reports

Automated Reports

Generate comprehensive evaluation reports:

# Generate evaluation report
python generate_report.py \
  -results output/evaluation_results.json \
  -template templates/evaluation_report.html \
  -output reports/model_evaluation.html

Report Contents

Standard evaluation reports include:

  • Model Information: Architecture, parameters, training details

  • Dataset Statistics: Test set size, class distribution

  • Quantitative Results: All evaluation metrics with confidence intervals

  • Qualitative Analysis: Visualization of predictions and failures

  • Comparison: Performance relative to baselines and state-of-the-art

Next Steps

  • Training Guide - Return to training with evaluation insights

  • Models - Understand model architectures and their evaluation characteristics

  • api/lib - API documentation for evaluation functions