Evaluation Guide ================ This guide covers evaluation metrics, procedures, and analysis for M3SGG models. Evaluation Overview ------------------- DLHM VidSGG provides comprehensive evaluation capabilities for video scene graph generation models across different modes and datasets. **Evaluation Modes** * **PredCLS**: Evaluate relationship prediction given ground truth objects * **SGCLS**: Evaluate both object classification and relationship prediction * **SGDET**: Evaluate end-to-end object detection and relationship prediction Basic Evaluation ---------------- Simple Evaluation Command ~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: bash python scripts/evaluation/test.py -m predcls -datasize large -data_path data/action_genome -model_path output/model.pth This evaluates a trained model on the Action Genome test set. Complete Evaluation Command ~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: bash python scripts/evaluation/test.py \ -m predcls \ -datasize large \ -data_path data/action_genome \ -model_path output/sttran_predcls/checkpoint_best.tar \ -save_results output/evaluation_results.json Evaluation Metrics ------------------ Recall Metrics ~~~~~~~~~~~~~~ The primary evaluation metrics for scene graph generation: **Recall@K** Percentage of ground truth relationships that appear in top-K predictions **Mean Recall@K (mRecall@K)** Average recall across all relationship categories .. list-table:: Standard Evaluation Metrics :widths: 25 75 :header-rows: 1 * - Metric - Description * - Recall@10 - Recall considering top 10 predictions per frame * - Recall@20 - Recall considering top 20 predictions per frame * - Recall@50 - Recall considering top 50 predictions per frame * - mRecall@10 - Mean recall across relationship categories (top 10) * - mRecall@20 - Mean recall across relationship categories (top 20) * - mRecall@50 - Mean recall across relationship categories (top 50) Zero-Shot Metrics ~~~~~~~~~~~~~~~~~ For unseen relationship combinations: * **Zero-Shot Recall@K**: Performance on novel object-relationship-object triplets * **Compositional Recall**: Performance on new compositions of seen elements Per-Category Analysis ~~~~~~~~~~~~~~~~~~~~~ Detailed analysis for each relationship category: .. code-block:: python # Example per-category results per_category_results = { 'holding': {'recall@20': 25.3, 'precision': 18.7}, 'sitting_on': {'recall@20': 31.2, 'precision': 22.1}, 'looking_at': {'recall@20': 15.8, 'precision': 12.4} } Evaluation Procedures --------------------- Standard Evaluation ~~~~~~~~~~~~~~~~~~~ .. code-block:: bash # Evaluate all models on test set for model in sttran tempura scenellm stket; do python scripts/evaluation/test.py \ -m predcls \ -model_path output/${model}_predcls/checkpoint_best.tar \ -save_results results/${model}_predcls_results.json done Cross-Dataset Evaluation ~~~~~~~~~~~~~~~~~~~~~~~~ Evaluate model generalization across datasets: .. code-block:: bash # Train on Action Genome, test on EASG python scripts/evaluation/test.py \ -m predcls \ -data_path data/EASG \ -model_path output/action_genome_model.pth \ -save_results cross_dataset_results.json Temporal Evaluation ~~~~~~~~~~~~~~~~~~~ Analyze performance across different temporal windows: .. code-block:: bash # Evaluate with different temporal window sizes for window in 1 3 5 10; do python scripts/evaluation/test.py \ -m predcls \ -temporal_window $window \ -model_path output/model.pth done Mode-Specific Evaluation ------------------------ PredCLS Evaluation ~~~~~~~~~~~~~~~~~~ **Input**: Ground truth object bounding boxes and labels **Task**: Predict relationships between objects .. code-block:: bash python scripts/evaluation/test.py -m predcls -model_path output/sttran_predcls.pth **Key Metrics**: * Relationship prediction accuracy * Per-category relationship recall * Temporal consistency SGCLS Evaluation ~~~~~~~~~~~~~~~~ **Input**: Ground truth object bounding boxes **Task**: Predict object labels and relationships .. code-block:: bash python scripts/evaluation/test.py -m sgcls -model_path output/sttran_sgcls.pth **Key Metrics**: * Object classification accuracy * Relationship prediction given predicted objects * Joint object-relationship accuracy SGDET Evaluation ~~~~~~~~~~~~~~~~ **Input**: Raw video frames **Task**: Detect objects and predict relationships end-to-end .. code-block:: bash python scripts/evaluation/test.py -m sgdet -model_path output/sttran_sgdet.pth **Key Metrics**: * Object detection mAP * Relationship prediction accuracy * End-to-end scene graph quality Advanced Evaluation ------------------- Uncertainty Evaluation ~~~~~~~~~~~~~~~~~~~~~~ For models with uncertainty estimation (e.g., Tempura): .. code-block:: python # Evaluate uncertainty calibration python evaluate_uncertainty.py \ -model_path output/tempura_model.pth \ -calibration_method temperature_scaling **Uncertainty Metrics**: * Calibration error (ECE) * Reliability diagrams * Uncertainty-accuracy correlation Robustness Evaluation ~~~~~~~~~~~~~~~~~~~~~ Test model robustness to various perturbations: .. code-block:: bash # Evaluate with noise python test_robustness.py \ -model_path output/model.pth \ -noise_level 0.1 \ -noise_type gaussian **Robustness Tests**: * Gaussian noise in input frames * Occlusions and crops * Temporal jittering * Lighting changes Efficiency Evaluation ~~~~~~~~~~~~~~~~~~~~~ Measure computational efficiency: .. code-block:: python # Profile model inference python profile_model.py \ -model_path output/model.pth \ -batch_size 1 \ -num_iterations 100 **Efficiency Metrics**: * Inference time per frame * GPU memory usage * FLOPs count * Model parameters Evaluation Analysis ------------------- Statistical Significance ~~~~~~~~~~~~~~~~~~~~~~~~ Test statistical significance of results: .. code-block:: python from scipy import stats # Compare two models model1_scores = [19.2, 18.8, 19.5, ...] model2_scores = [20.1, 19.7, 20.3, ...] t_stat, p_value = stats.ttest_rel(model1_scores, model2_scores) print(f"P-value: {p_value}") Error Analysis ~~~~~~~~~~~~~~ Analyze common failure modes: .. code-block:: python # Analyze prediction errors python analyze_errors.py \ -predictions output/predictions.json \ -ground_truth data/test_annotations.json \ -save_analysis error_analysis.html **Analysis Categories**: * Frequent false positives * Common missed relationships * Object detection failures * Temporal inconsistencies Visualization ~~~~~~~~~~~~~ Generate evaluation visualizations: .. code-block:: python # Create evaluation plots python visualize_results.py \ -results_dir output/evaluation_results/ \ -output_dir plots/ **Visualization Types**: * Recall curves * Precision-recall plots * Confusion matrices * Per-category performance bars Benchmark Comparison -------------------- Standard Benchmarks ~~~~~~~~~~~~~~~~~~~ Compare against established benchmarks: .. list-table:: Action Genome Benchmark Results :widths: 20 15 15 15 15 20 :header-rows: 1 * - Model - R@10 - R@20 - R@50 - mR@50 - Year * - IMP - 8.9 - 12.1 - 17.8 - 4.2 - 2017 * - KERN - 9.2 - 12.7 - 18.4 - 4.8 - 2019 * - STTran - 14.6 - 19.2 - 26.5 - 7.8 - 2021 * - Tempura - 15.8 - 21.1 - 28.3 - 8.9 - 2022 Leaderboard Submission ~~~~~~~~~~~~~~~~~~~~~~ Prepare results for benchmark submission: .. code-block:: python # Format results for submission python format_submission.py \ -predictions output/test_predictions.json \ -output submission.zip Custom Evaluation ----------------- Domain-Specific Metrics ~~~~~~~~~~~~~~~~~~~~~~~ Implement custom metrics for specific domains: .. code-block:: python def custom_metric(predictions, ground_truth): # Custom evaluation logic score = compute_domain_specific_score(predictions, ground_truth) return score Temporal Metrics ~~~~~~~~~~~~~~~~ Evaluate temporal consistency: .. code-block:: python def temporal_consistency(predictions): # Measure consistency across time consistency_score = 0 for t in range(1, len(predictions)): consistency_score += similarity(predictions[t], predictions[t-1]) return consistency_score / (len(predictions) - 1) Quality Assessment ~~~~~~~~~~~~~~~~~~ Assess overall scene graph quality: .. code-block:: python def scene_graph_quality(prediction, ground_truth): # Graph-level similarity metrics node_similarity = compute_node_similarity(prediction, ground_truth) edge_similarity = compute_edge_similarity(prediction, ground_truth) structure_similarity = compute_structure_similarity(prediction, ground_truth) return (node_similarity + edge_similarity + structure_similarity) / 3 Evaluation Best Practices -------------------------- Reproducibility ~~~~~~~~~~~~~~~ Ensure reproducible evaluation results: .. code-block:: python # Set random seeds for consistent evaluation torch.manual_seed(42) np.random.seed(42) # Use consistent evaluation protocols eval_config = { 'batch_size': 1, 'num_workers': 0, # For reproducibility 'deterministic': True } Multiple Runs ~~~~~~~~~~~~~ Perform multiple evaluation runs: .. code-block:: bash # Run evaluation multiple times with different seeds for seed in 42 123 456 789 999; do python scripts/evaluation/test.py \ -m predcls \ -model_path output/model.pth \ -seed $seed \ -save_results results/run_${seed}.json done Statistical Reporting ~~~~~~~~~~~~~~~~~~~~~ Report results with confidence intervals: .. code-block:: python import numpy as np from scipy import stats # Calculate mean and confidence interval scores = [19.2, 18.8, 19.5, 19.1, 19.3] mean_score = np.mean(scores) std_error = stats.sem(scores) ci = stats.t.interval(0.95, len(scores)-1, loc=mean_score, scale=std_error) print(f"Recall@20: {mean_score:.1f} ± {std_error:.1f} (95% CI: {ci[0]:.1f}-{ci[1]:.1f})") Troubleshooting --------------- Common Issues ~~~~~~~~~~~~~ **Low Evaluation Scores** * Verify ground truth data format * Check evaluation metric implementation * Compare preprocessing with training **Inconsistent Results** * Set random seeds for reproducibility * Use same data splits as training * Verify model loading correctly **Memory Issues During Evaluation** * Reduce batch size to 1 * Process samples sequentially * Clear cache between batches Performance Debugging ~~~~~~~~~~~~~~~~~~~~~ **Slow Evaluation** * Profile bottlenecks in evaluation code * Optimize data loading pipeline * Use GPU for faster inference **Unexpected Results** * Visualize predictions vs ground truth * Check for data leakage or preprocessing errors * Validate against simple baselines Evaluation Reports ------------------ Automated Reports ~~~~~~~~~~~~~~~~~ Generate comprehensive evaluation reports: .. code-block:: python # Generate evaluation report python generate_report.py \ -results output/evaluation_results.json \ -template templates/evaluation_report.html \ -output reports/model_evaluation.html Report Contents ~~~~~~~~~~~~~~~ Standard evaluation reports include: * **Model Information**: Architecture, parameters, training details * **Dataset Statistics**: Test set size, class distribution * **Quantitative Results**: All evaluation metrics with confidence intervals * **Qualitative Analysis**: Visualization of predictions and failures * **Comparison**: Performance relative to baselines and state-of-the-art Next Steps ---------- * :doc:`training` - Return to training with evaluation insights * :doc:`models` - Understand model architectures and their evaluation characteristics * :doc:`api/lib` - API documentation for evaluation functions