Evaluation Guide
This guide covers evaluation metrics, procedures, and analysis for M3SGG models.
Evaluation Overview
DLHM VidSGG provides comprehensive evaluation capabilities for video scene graph generation models across different modes and datasets.
Evaluation Modes
PredCLS: Evaluate relationship prediction given ground truth objects
SGCLS: Evaluate both object classification and relationship prediction
SGDET: Evaluate end-to-end object detection and relationship prediction
Basic Evaluation
Simple Evaluation Command
python scripts/evaluation/test.py -m predcls -datasize large -data_path data/action_genome -model_path output/model.pth
This evaluates a trained model on the Action Genome test set.
Complete Evaluation Command
python scripts/evaluation/test.py \
-m predcls \
-datasize large \
-data_path data/action_genome \
-model_path output/sttran_predcls/checkpoint_best.tar \
-save_results output/evaluation_results.json
Evaluation Metrics
Recall Metrics
The primary evaluation metrics for scene graph generation:
- Recall@K
Percentage of ground truth relationships that appear in top-K predictions
- Mean Recall@K (mRecall@K)
Average recall across all relationship categories
Metric |
Description |
---|---|
Recall considering top 10 predictions per frame |
|
Recall considering top 20 predictions per frame |
|
Recall considering top 50 predictions per frame |
|
Mean recall across relationship categories (top 10) |
|
Mean recall across relationship categories (top 20) |
|
Mean recall across relationship categories (top 50) |
Zero-Shot Metrics
For unseen relationship combinations:
Zero-Shot Recall@K: Performance on novel object-relationship-object triplets
Compositional Recall: Performance on new compositions of seen elements
Per-Category Analysis
Detailed analysis for each relationship category:
# Example per-category results
per_category_results = {
'holding': {'recall@20': 25.3, 'precision': 18.7},
'sitting_on': {'recall@20': 31.2, 'precision': 22.1},
'looking_at': {'recall@20': 15.8, 'precision': 12.4}
}
Evaluation Procedures
Standard Evaluation
# Evaluate all models on test set
for model in sttran tempura scenellm stket; do
python scripts/evaluation/test.py \
-m predcls \
-model_path output/${model}_predcls/checkpoint_best.tar \
-save_results results/${model}_predcls_results.json
done
Cross-Dataset Evaluation
Evaluate model generalization across datasets:
# Train on Action Genome, test on EASG
python scripts/evaluation/test.py \
-m predcls \
-data_path data/EASG \
-model_path output/action_genome_model.pth \
-save_results cross_dataset_results.json
Temporal Evaluation
Analyze performance across different temporal windows:
# Evaluate with different temporal window sizes
for window in 1 3 5 10; do
python scripts/evaluation/test.py \
-m predcls \
-temporal_window $window \
-model_path output/model.pth
done
Mode-Specific Evaluation
PredCLS Evaluation
Input: Ground truth object bounding boxes and labels Task: Predict relationships between objects
python scripts/evaluation/test.py -m predcls -model_path output/sttran_predcls.pth
Key Metrics: * Relationship prediction accuracy * Per-category relationship recall * Temporal consistency
SGCLS Evaluation
Input: Ground truth object bounding boxes Task: Predict object labels and relationships
python scripts/evaluation/test.py -m sgcls -model_path output/sttran_sgcls.pth
Key Metrics: * Object classification accuracy * Relationship prediction given predicted objects * Joint object-relationship accuracy
SGDET Evaluation
Input: Raw video frames Task: Detect objects and predict relationships end-to-end
python scripts/evaluation/test.py -m sgdet -model_path output/sttran_sgdet.pth
Key Metrics: * Object detection mAP * Relationship prediction accuracy * End-to-end scene graph quality
Advanced Evaluation
Uncertainty Evaluation
For models with uncertainty estimation (e.g., Tempura):
# Evaluate uncertainty calibration
python evaluate_uncertainty.py \
-model_path output/tempura_model.pth \
-calibration_method temperature_scaling
Uncertainty Metrics: * Calibration error (ECE) * Reliability diagrams * Uncertainty-accuracy correlation
Robustness Evaluation
Test model robustness to various perturbations:
# Evaluate with noise
python test_robustness.py \
-model_path output/model.pth \
-noise_level 0.1 \
-noise_type gaussian
Robustness Tests: * Gaussian noise in input frames * Occlusions and crops * Temporal jittering * Lighting changes
Efficiency Evaluation
Measure computational efficiency:
# Profile model inference
python profile_model.py \
-model_path output/model.pth \
-batch_size 1 \
-num_iterations 100
Efficiency Metrics: * Inference time per frame * GPU memory usage * FLOPs count * Model parameters
Evaluation Analysis
Statistical Significance
Test statistical significance of results:
from scipy import stats
# Compare two models
model1_scores = [19.2, 18.8, 19.5, ...]
model2_scores = [20.1, 19.7, 20.3, ...]
t_stat, p_value = stats.ttest_rel(model1_scores, model2_scores)
print(f"P-value: {p_value}")
Error Analysis
Analyze common failure modes:
# Analyze prediction errors
python analyze_errors.py \
-predictions output/predictions.json \
-ground_truth data/test_annotations.json \
-save_analysis error_analysis.html
Analysis Categories: * Frequent false positives * Common missed relationships * Object detection failures * Temporal inconsistencies
Visualization
Generate evaluation visualizations:
# Create evaluation plots
python visualize_results.py \
-results_dir output/evaluation_results/ \
-output_dir plots/
Visualization Types: * Recall curves * Precision-recall plots * Confusion matrices * Per-category performance bars
Benchmark Comparison
Standard Benchmarks
Compare against established benchmarks:
Model |
Year |
||||
---|---|---|---|---|---|
IMP |
8.9 |
12.1 |
17.8 |
4.2 |
2017 |
KERN |
9.2 |
12.7 |
18.4 |
4.8 |
2019 |
STTran |
14.6 |
19.2 |
26.5 |
7.8 |
2021 |
Tempura |
15.8 |
21.1 |
28.3 |
8.9 |
2022 |
Leaderboard Submission
Prepare results for benchmark submission:
# Format results for submission
python format_submission.py \
-predictions output/test_predictions.json \
-output submission.zip
Custom Evaluation
Domain-Specific Metrics
Implement custom metrics for specific domains:
def custom_metric(predictions, ground_truth):
# Custom evaluation logic
score = compute_domain_specific_score(predictions, ground_truth)
return score
Temporal Metrics
Evaluate temporal consistency:
def temporal_consistency(predictions):
# Measure consistency across time
consistency_score = 0
for t in range(1, len(predictions)):
consistency_score += similarity(predictions[t], predictions[t-1])
return consistency_score / (len(predictions) - 1)
Quality Assessment
Assess overall scene graph quality:
def scene_graph_quality(prediction, ground_truth):
# Graph-level similarity metrics
node_similarity = compute_node_similarity(prediction, ground_truth)
edge_similarity = compute_edge_similarity(prediction, ground_truth)
structure_similarity = compute_structure_similarity(prediction, ground_truth)
return (node_similarity + edge_similarity + structure_similarity) / 3
Evaluation Best Practices
Reproducibility
Ensure reproducible evaluation results:
# Set random seeds for consistent evaluation
torch.manual_seed(42)
np.random.seed(42)
# Use consistent evaluation protocols
eval_config = {
'batch_size': 1,
'num_workers': 0, # For reproducibility
'deterministic': True
}
Multiple Runs
Perform multiple evaluation runs:
# Run evaluation multiple times with different seeds
for seed in 42 123 456 789 999; do
python scripts/evaluation/test.py \
-m predcls \
-model_path output/model.pth \
-seed $seed \
-save_results results/run_${seed}.json
done
Statistical Reporting
Report results with confidence intervals:
import numpy as np
from scipy import stats
# Calculate mean and confidence interval
scores = [19.2, 18.8, 19.5, 19.1, 19.3]
mean_score = np.mean(scores)
std_error = stats.sem(scores)
ci = stats.t.interval(0.95, len(scores)-1, loc=mean_score, scale=std_error)
print(f"Recall@20: {mean_score:.1f} ± {std_error:.1f} (95% CI: {ci[0]:.1f}-{ci[1]:.1f})")
Troubleshooting
Common Issues
- Low Evaluation Scores
Verify ground truth data format
Check evaluation metric implementation
Compare preprocessing with training
- Inconsistent Results
Set random seeds for reproducibility
Use same data splits as training
Verify model loading correctly
- Memory Issues During Evaluation
Reduce batch size to 1
Process samples sequentially
Clear cache between batches
Performance Debugging
- Slow Evaluation
Profile bottlenecks in evaluation code
Optimize data loading pipeline
Use GPU for faster inference
- Unexpected Results
Visualize predictions vs ground truth
Check for data leakage or preprocessing errors
Validate against simple baselines
Evaluation Reports
Automated Reports
Generate comprehensive evaluation reports:
# Generate evaluation report
python generate_report.py \
-results output/evaluation_results.json \
-template templates/evaluation_report.html \
-output reports/model_evaluation.html
Report Contents
Standard evaluation reports include:
Model Information: Architecture, parameters, training details
Dataset Statistics: Test set size, class distribution
Quantitative Results: All evaluation metrics with confidence intervals
Qualitative Analysis: Visualization of predictions and failures
Comparison: Performance relative to baselines and state-of-the-art
Next Steps
Training Guide - Return to training with evaluation insights
Models - Understand model architectures and their evaluation characteristics
api/lib - API documentation for evaluation functions