Models ====== This section provides detailed information about the scene graph generation models implemented in M3SGG. Model Overview -------------- DLHM VidSGG implements several state-of-the-art models for video scene graph generation: * **STTran**: Spatial-Temporal Transformer baseline * **Tempura**: Temporal relationship modeling with uncertainty * **SceneLLM**: Large language model integration * **STKET**: Spatial-temporal knowledge-enhanced transformer * **DSG-DETR**: Dynamic Scene Graph DETR * **OED**: Object-Entity Disentanglement Each model can be trained and evaluated in three modes: PredCLS, SGCLS, and SGDET. STTran Model ------------ The Spatial-Temporal Transformer serves as the baseline model for video scene graph generation. **Architecture** STTran uses a transformer-based architecture to model spatial and temporal relationships: * **Spatial Encoder**: Processes object features within frames * **Temporal Encoder**: Models relationships across time * **Relation Decoder**: Predicts relationship classifications **Key Features** * Multi-head attention for spatial relationship modeling * Temporal attention for cross-frame relationship tracking * Hierarchical object representation learning **Usage** .. code-block:: bash # Training STTran python scripts/training/training.py -model sttran -mode predcls -data_path data/action_genome **Configuration** .. code-block:: python sttran_config = { 'spatial_layer': 4, 'temporal_layer': 2, 'hidden_dim': 512, 'num_heads': 8, 'dropout': 0.1 } Tempura Model ------------- Tempura focuses on temporal relationship modeling with uncertainty quantification. **Architecture** * **Gaussian Mixture Models**: For uncertainty modeling * **Temporal Attention**: Enhanced temporal relationship understanding * **Uncertainty Heads**: Quantify prediction confidence **Key Features** * Uncertainty-aware relationship prediction * Improved temporal consistency * Robust to noisy annotations **Usage** .. code-block:: bash # Training Tempura python scripts/training/training.py -model tempura -mode predcls -data_path data/action_genome **Configuration** .. code-block:: python tempura_config = { 'num_mixtures': 3, 'uncertainty_threshold': 0.5, 'temporal_window': 5, 'gmm_regularization': 0.01 } SceneLLM Model -------------- SceneLLM integrates large language models for enhanced scene understanding. **Architecture** * **Vision Encoder**: Processes visual features * **Language Model**: Generates textual scene descriptions * **Multimodal Fusion**: Combines visual and textual representations * **Scene Graph Decoder**: Produces structured scene graphs **Key Features** * Natural language scene understanding * Multimodal learning capabilities * Zero-shot relationship recognition * Textual scene graph generation **Usage** .. code-block:: bash # Training SceneLLM python scripts/training/training.py -model scenellm -mode predcls -data_path data/action_genome **Configuration** .. code-block:: python scenellm_config = { 'llm_model': 'gemma3-270M', 'vision_backbone': 'resnet101', 'fusion_layers': 3, 'text_generation': True } STKET Model ----------- Spatial-Temporal Knowledge-Enhanced Transformer incorporates external knowledge. **Architecture** * **Knowledge Graph Integration**: External knowledge incorporation * **Enhanced Attention**: Knowledge-guided attention mechanisms * **Multi-scale Temporal Modeling**: Different temporal scales **Key Features** * External knowledge integration * Improved relationship reasoning * Multi-scale temporal analysis **Usage** .. code-block:: bash # Training STKET python scripts/training/training.py -model stket -mode predcls -data_path data/action_genome Model Comparison ---------------- Performance Comparison ~~~~~~~~~~~~~~~~~~~~~~ .. list-table:: Model Performance on Action Genome (PredCLS) :widths: 20 20 20 20 20 :header-rows: 1 * - Model - Recall@10 - Recall@20 - Recall@50 - mRecall@50 * - STTran - 14.6 - 19.2 - 26.5 - 7.8 * - Tempura - 15.8 - 21.1 - 28.3 - 8.9 * - SceneLLM - 16.2 - 22.0 - 30.1 - 9.5 * - STKET - 15.1 - 20.5 - 27.8 - 8.3 Computational Requirements ~~~~~~~~~~~~~~~~~~~~~~~~~~ .. list-table:: Computational Requirements :widths: 20 20 20 20 20 :header-rows: 1 * - Model - GPU Memory - Training Time - Inference Speed - Parameters * - STTran - 8GB - 12 hours - 30 FPS - 45M * - Tempura - 10GB - 15 hours - 25 FPS - 52M * - SceneLLM - 16GB - 24 hours - 15 FPS - 270M * - STKET - 12GB - 18 hours - 28 FPS - 58M Model Selection Guidelines -------------------------- Choose Based on Requirements ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ **For Speed and Efficiency** * **STTran**: Best balance of speed and accuracy * **Recommended for**: Real-time applications, limited resources **For Accuracy** * **SceneLLM**: Highest accuracy with language understanding * **Recommended for**: Research, offline analysis **For Uncertainty Quantification** * **Tempura**: Built-in uncertainty estimation * **Recommended for**: Safety-critical applications, quality control **For Knowledge Integration** * **STKET**: External knowledge incorporation * **Recommended for**: Domain-specific applications, expert systems Training Considerations ----------------------- Hyperparameter Tuning ~~~~~~~~~~~~~~~~~~~~~~ **Common Hyperparameters** .. code-block:: python base_config = { 'learning_rate': 1e-4, 'batch_size': 1, 'weight_decay': 1e-5, 'gradient_clip': 5.0, 'warmup_steps': 1000, 'scheduler': 'cosine' } **Model-Specific Tuning** .. code-block:: python # STTran specific sttran_tuning = { 'spatial_layer': [2, 4, 6], 'temporal_layer': [1, 2, 3], 'hidden_dim': [256, 512, 1024] } # Tempura specific tempura_tuning = { 'num_mixtures': [2, 3, 5], 'uncertainty_threshold': [0.3, 0.5, 0.7], 'gmm_regularization': [0.001, 0.01, 0.1] } Training Strategies ~~~~~~~~~~~~~~~~~~~ **Progressive Training** 1. Start with simpler models (STTran) 2. Transfer knowledge to complex models 3. Fine-tune on specific datasets **Curriculum Learning** 1. Train on easier samples first 2. Gradually increase difficulty 3. Improve convergence and performance Model Customization ------------------- Custom Model Development ~~~~~~~~~~~~~~~~~~~~~~~~ To implement a custom model: .. code-block:: python from lib.base_model import BaseModel class CustomModel(BaseModel): def __init__(self, config): super().__init__(config) # Define model components def forward(self, inputs): # Implement forward pass pass def compute_loss(self, predictions, targets): # Implement loss computation pass **Integration Steps** 1. Implement model class 2. Add to model factory 3. Update configuration files 4. Test with existing pipeline Transfer Learning ~~~~~~~~~~~~~~~~~ **Pretrained Models** Download pretrained models for different datasets: .. code-block:: bash # Download Action Genome pretrained models wget -P data/checkpoints/ **Fine-tuning** .. code-block:: python # Load pretrained model model = load_model('data/checkpoints/sttran_pretrained.pth') # Fine-tune on new dataset fine_tune(model, new_dataset, epochs=10) Deployment ---------- Model Export ~~~~~~~~~~~~ Export trained models for deployment: .. code-block:: python # Export to ONNX torch.onnx.export(model, sample_input, 'model.onnx') # Export to TorchScript traced_model = torch.jit.trace(model, sample_input) traced_model.save('model.pt') Optimization ~~~~~~~~~~~~ **Model Quantization** .. code-block:: python # Post-training quantization quantized_model = torch.quantization.quantize_dynamic( model, {torch.nn.Linear}, dtype=torch.qint8 ) **Model Pruning** .. code-block:: python # Structured pruning from torch.nn.utils import prune prune.global_unstructured( parameters_to_prune, pruning_method=prune.L1Unstructured, amount=0.2 ) Troubleshooting --------------- Common Issues ~~~~~~~~~~~~~ **Training Instability** * Reduce learning rate * Add gradient clipping * Use mixed precision training **Poor Performance** * Check data preprocessing * Verify model configuration * Compare with baseline results **Memory Issues** * Reduce batch size * Use gradient accumulation * Enable gradient checkpointing Next Steps ---------- * :doc:`training` - Detailed training procedures * :doc:`evaluation` - Model evaluation and metrics * :doc:`api/models` - API documentation for models