Models
This section documents the various scene graph generation models implemented in the project.
STTran Model
- class m3sgg.core.models.sttran.sttran.ObjectClassifier(mode='sgdet', obj_classes=None)[source]
Bases:
Module
Module for computing object contexts and edge contexts in scene graphs.
Handles object classification and contextual feature extraction for spatial-temporal transformer-based scene graph generation.
- Parameters:
nn.Module (class) – Base PyTorch module class
- forward(entry)[source]
Define the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class m3sgg.core.models.sttran.sttran.STTran(mode='sgdet', attention_class_num=None, spatial_class_num=None, contact_class_num=None, obj_classes=None, rel_classes=None, enc_layer_num=None, dec_layer_num=None, window_size=4, trainPrior=None, use_spatial_prior=False, use_temporal_prior=False)[source]
Bases:
Module
- __init__(mode='sgdet', attention_class_num=None, spatial_class_num=None, contact_class_num=None, obj_classes=None, rel_classes=None, enc_layer_num=None, dec_layer_num=None, window_size=4, trainPrior=None, use_spatial_prior=False, use_temporal_prior=False)[source]
- Parameters:
classes – Object classes
rel_classes – Relationship classes. None if were not using rel mode
mode – (sgcls, predcls, or sgdet)
- forward(entry)[source]
Define the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
Let’s get the relationships yo
- class m3sgg.core.detectors.easg.sttran_EASG.ObjectClassifier(mode='edgecls', obj_classes=None)[source]
Bases:
Module
Module for computing object contexts and edge contexts for EASG.
EASG-specific implementation of object classification and contextual feature extraction for efficient scene graph generation.
- Parameters:
nn.Module (class) – Base PyTorch module class
- forward(entry)[source]
Define the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class m3sgg.core.detectors.easg.sttran_EASG.ActionClassifier(mode='edgecls', verb_classes=None)[source]
Bases:
Module
- __init__(mode='edgecls', verb_classes=None)[source]
Initialize internal Module state, shared by both nn.Module and ScriptModule.
- forward(entry)[source]
Define the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class m3sgg.core.detectors.easg.sttran_EASG.STTran(mode='edgecls', obj_classes=None, verb_classes=None, edge_class_num=None, enc_layer_num=None, dec_layer_num=None, use_visual_features=False)[source]
Bases:
Module
- __init__(mode='edgecls', obj_classes=None, verb_classes=None, edge_class_num=None, enc_layer_num=None, dec_layer_num=None, use_visual_features=False)[source]
Initialize internal Module state, shared by both nn.Module and ScriptModule.
- forward(entry)[source]
Define the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
SceneLLM Model
SceneLLM main module with imports from distributed components.
This module provides access to all SceneLLM components through a unified interface. The implementation has been distributed across multiple files for better organization:
vqvae.py: VQ-VAE quantizer implementation
sia.py: Spatial Information Aggregator and hierarchical graph functions
ot.py: Optimal Transport codebook updater
llm.py: SceneLLM LoRA implementation
network.py: Main SceneLLM model and SGG decoder
TODO: Compare different clustering methods TODO: Improve prompt template for LLM TODO: Add better LLM TODO: Improve GCN architecture TODO: Use Cross Entropy instead of MSE
- class m3sgg.core.models.scenellm.scenellm.SceneLLM(cfg, dataset)[source]
Bases:
Module
SceneLLM model for scene graph generation with language model integration.
Combines VQ-VAE quantization, Spatial Information Aggregator (SIA), optimal transport codebook updates, and LoRA-adapted language models for advanced scene graph generation and description.
- Parameters:
nn.Module (class) – Base PyTorch module class
- __init__(cfg, dataset)[source]
Initialize the SceneLLM model.
- Parameters:
cfg (Config) – Configuration object containing model parameters
dataset (object) – Dataset information for model setup
- Returns:
None
- Return type:
None
- forward(entry)[source]
Define the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class m3sgg.core.models.scenellm.scenellm.VQVAEQuantizer(input_dim=2048, dim=1024, codebook_size=8192, commitment_cost=0.25)[source]
Bases:
Module
Vector Quantized Variational AutoEncoder (VQ-VAE) quantizer.
Implements discrete latent space quantization for scene representations with codebook learning and commitment loss for stable training.
- Parameters:
nn.Module (class) – Base PyTorch module class
- __init__(input_dim=2048, dim=1024, codebook_size=8192, commitment_cost=0.25)[source]
Initialize the VQ-VAE quantizer.
- Parameters:
- Returns:
None
- Return type:
None
- forward(roi_feats)[source]
Forward pass through VQ-VAE quantizer.
- Parameters:
roi_feats (torch.Tensor) – ROI features tensor of shape [N, input_dim]
- Returns:
Tuple containing reconstructed features, reconstruction loss, embedding loss, and commitment loss
- Return type:
- class m3sgg.core.models.scenellm.scenellm.SIA(dim=1024)[source]
Bases:
Module
- __init__(dim=1024)[source]
Spatial Information Aggregator - Embed (x, y, w, h) then fuse ROI tokens with spatial reasoning.
- forward(feats, boxes)[source]
Define the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class m3sgg.core.models.scenellm.scenellm.OTCodebookUpdater(base_codebook, step=512, max_iterations=10)[source]
Bases:
object
- class m3sgg.core.models.scenellm.scenellm.SceneLLMLoRA(model_name, fallback_dim=None, r=16, alpha=32, dropout=0.05)[source]
Bases:
Module
SceneLLM with LoRA (Low-Rank Adaptation) for efficient fine-tuning.
Implements LoRA adaptation on language models for scene graph generation with fallback support when transformers library is unavailable.
- Parameters:
nn.Module (class) – Base PyTorch module class
- __init__(model_name, fallback_dim=None, r=16, alpha=32, dropout=0.05)[source]
Initialize SceneLLM with LoRA adaptation.
- Parameters:
model_name (str) – Name of the base language model (e.g., ‘google/gemma-2-2b’)
fallback_dim (int, optional) – Dimension for fallback when transformers unavailable, defaults to None
r (int, optional) – LoRA rank parameter, defaults to 16
alpha (int, optional) – LoRA alpha parameter, defaults to 32
dropout (float, optional) – LoRA dropout rate, defaults to 0.05
- Returns:
None
- Return type:
None
- forward(token_embeds)[source]
Define the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class m3sgg.core.models.scenellm.scenellm.SGGDecoder(hidden_dim, attn_c, spat_c, cont_c)[source]
Bases:
Module
Scene Graph Generation decoder with transformer architecture.
Decodes hidden representations into attention, spatial, and contact relation predictions using transformer encoder and linear heads.
- Parameters:
nn.Module (class) – Base PyTorch module class
- forward(seq)[source]
Forward pass through the SGG decoder.
- Parameters:
seq (torch.Tensor) – Input sequence tensor of shape [B, T, D]
- Returns:
Dictionary containing attention, spatial, and contact predictions
- Return type:
- m3sgg.core.models.scenellm.scenellm.build_hierarchical_graph(boxes)[source]
Build hierarchical graph from bounding boxes using hierarchical clustering.
Creates a graph structure from spatial relationships between bounding boxes using hierarchical clustering algorithms.
- Parameters:
boxes (torch.Tensor) – Tensor of normalized bounding boxes, shape [N, 4]
- Returns:
DGL graph or simple edge list based on hierarchical clustering
- Return type:
dgl.DGLGraph or dict
Main SceneLLM network and SGG decoder implementation. Credit to the authors of the original code: https://doi.org/10.1016/j.patcog.2025.111992.
- class m3sgg.core.models.scenellm.network.SceneLLM(cfg, dataset)[source]
Bases:
Module
SceneLLM model for scene graph generation with language model integration.
Combines VQ-VAE quantization, Spatial Information Aggregator (SIA), optimal transport codebook updates, and LoRA-adapted language models for advanced scene graph generation and description.
- Parameters:
nn.Module (class) – Base PyTorch module class
- __init__(cfg, dataset)[source]
Initialize the SceneLLM model.
- Parameters:
cfg (Config) – Configuration object containing model parameters
dataset (object) – Dataset information for model setup
- Returns:
None
- Return type:
None
- forward(entry)[source]
Define the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class m3sgg.core.models.scenellm.network.SGGDecoder(hidden_dim, attn_c, spat_c, cont_c)[source]
Bases:
Module
Scene Graph Generation decoder with transformer architecture.
Decodes hidden representations into attention, spatial, and contact relation predictions using transformer encoder and linear heads.
- Parameters:
nn.Module (class) – Base PyTorch module class
- forward(seq)[source]
Forward pass through the SGG decoder.
- Parameters:
seq (torch.Tensor) – Input sequence tensor of shape [B, T, D]
- Returns:
Dictionary containing attention, spatial, and contact predictions
- Return type:
SceneLLM LoRA implementation for SceneLLM. Credit to the authors of the original code: https://doi.org/10.1016/j.patcog.2025.111992.
- class m3sgg.core.models.scenellm.llm.SceneLLMLoRA(model_name, fallback_dim=None, r=16, alpha=32, dropout=0.05)[source]
Bases:
Module
SceneLLM with LoRA (Low-Rank Adaptation) for efficient fine-tuning.
Implements LoRA adaptation on language models for scene graph generation with fallback support when transformers library is unavailable.
- Parameters:
nn.Module (class) – Base PyTorch module class
- __init__(model_name, fallback_dim=None, r=16, alpha=32, dropout=0.05)[source]
Initialize SceneLLM with LoRA adaptation.
- Parameters:
model_name (str) – Name of the base language model (e.g., ‘google/gemma-2-2b’)
fallback_dim (int, optional) – Dimension for fallback when transformers unavailable, defaults to None
r (int, optional) – LoRA rank parameter, defaults to 16
alpha (int, optional) – LoRA alpha parameter, defaults to 32
dropout (float, optional) – LoRA dropout rate, defaults to 0.05
- Returns:
None
- Return type:
None
- forward(token_embeds)[source]
Define the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
VQ-VAE Quantizer implementation for SceneLLM. Credit to the authors of the original code: https://doi.org/10.1016/j.patcog.2025.111992.
- class m3sgg.core.models.scenellm.vqvae.VQVAEQuantizer(input_dim=2048, dim=1024, codebook_size=8192, commitment_cost=0.25)[source]
Bases:
Module
Vector Quantized Variational AutoEncoder (VQ-VAE) quantizer.
Implements discrete latent space quantization for scene representations with codebook learning and commitment loss for stable training.
- Parameters:
nn.Module (class) – Base PyTorch module class
- __init__(input_dim=2048, dim=1024, codebook_size=8192, commitment_cost=0.25)[source]
Initialize the VQ-VAE quantizer.
- Parameters:
- Returns:
None
- Return type:
None
- forward(roi_feats)[source]
Forward pass through VQ-VAE quantizer.
- Parameters:
roi_feats (torch.Tensor) – ROI features tensor of shape [N, input_dim]
- Returns:
Tuple containing reconstructed features, reconstruction loss, embedding loss, and commitment loss
- Return type:
Optimal Transport Codebook Updater implementation for SceneLLM. Credit to the authors of the original code: https://doi.org/10.1016/j.patcog.2025.111992.
- class m3sgg.core.models.scenellm.ot.OTCodebookUpdater(base_codebook, step=512, max_iterations=10)[source]
Bases:
object
Spatial Information Aggregator (SIA) implementation for SceneLLM. Credit to the authors of the original code: https://doi.org/10.1016/j.patcog.2025.111992.
- m3sgg.core.models.scenellm.sia.build_hierarchical_graph(boxes)[source]
Build hierarchical graph from bounding boxes using hierarchical clustering.
Creates a graph structure from spatial relationships between bounding boxes using hierarchical clustering algorithms.
- Parameters:
boxes (torch.Tensor) – Tensor of normalized bounding boxes, shape [N, 4]
- Returns:
DGL graph or simple edge list based on hierarchical clustering
- Return type:
dgl.DGLGraph or dict
- class m3sgg.core.models.scenellm.sia.SIA(dim=1024)[source]
Bases:
Module
- __init__(dim=1024)[source]
Spatial Information Aggregator - Embed (x, y, w, h) then fuse ROI tokens with spatial reasoning.
- forward(feats, boxes)[source]
Define the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
Tempura Model
- class m3sgg.core.models.tempura.tempura.PositionalEncoding(d_model: int, dropout: float = 0.1, max_len: int = 5000)[source]
Bases:
Module
Positional encoding for transformer-based models.
Implements sinusoidal positional encoding to provide temporal information to transformer architectures for sequence modeling.
- Parameters:
nn.Module (class) – Base PyTorch module class
- __init__(d_model: int, dropout: float = 0.1, max_len: int = 5000)[source]
Initialize positional encoding.
- forward(x: Tensor, indices=None) Tensor [source]
Apply positional encoding to input tensor.
- Parameters:
x (torch.Tensor) – Input tensor of shape [batch_size, seq_len, embedding_dim]
indices (torch.Tensor, optional) – Optional indices for position selection, defaults to None
- Returns:
Input tensor with positional encoding added
- Return type:
- class m3sgg.core.models.tempura.tempura.ObjectClassifier(mode='sgdet', obj_head='gmm', K=4, obj_classes=None, mem_compute=None, selection=None, selection_lambda=0.5, tracking=None)[source]
Bases:
Module
Tempura object classifier for computing object and edge contexts.
Implements the Tempura model’s approach to object classification and contextual feature extraction with memory-augmented learning and uncertainty estimation capabilities.
- Parameters:
nn.Module (class) – Base PyTorch module class
- __init__(mode='sgdet', obj_head='gmm', K=4, obj_classes=None, mem_compute=None, selection=None, selection_lambda=0.5, tracking=None)[source]
Initialize internal Module state, shared by both nn.Module and ScriptModule.
- forward(entry, phase='train', unc=False)[source]
Define the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class m3sgg.core.models.tempura.tempura.TEMPURA(mode='sgdet', attention_class_num=None, spatial_class_num=None, contact_class_num=None, obj_classes=None, rel_classes=None, enc_layer_num=None, dec_layer_num=None, obj_mem_compute=None, rel_mem_compute=None, mem_fusion=None, selection=None, selection_lambda=0.5, take_obj_mem_feat=False, obj_head='gmm', rel_head='gmm', K=None, tracking=None)[source]
Bases:
Module
- __init__(mode='sgdet', attention_class_num=None, spatial_class_num=None, contact_class_num=None, obj_classes=None, rel_classes=None, enc_layer_num=None, dec_layer_num=None, obj_mem_compute=None, rel_mem_compute=None, mem_fusion=None, selection=None, selection_lambda=0.5, take_obj_mem_feat=False, obj_head='gmm', rel_head='gmm', K=None, tracking=None)[source]
- Parameters:
classes – Object classes
rel_classes – Relationship classes. None if were not using rel mode
mode – (sgcls, predcls, or sgdet)
- forward(entry, phase='train', unc=False)[source]
Define the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class m3sgg.core.models.tempura.transformer_tempura.TransformerEncoderLayer(embed_dim=1936, nhead=4, dim_feedforward=2048, dropout=0.1)[source]
Bases:
Module
- __init__(embed_dim=1936, nhead=4, dim_feedforward=2048, dropout=0.1)[source]
Initialize internal Module state, shared by both nn.Module and ScriptModule.
- forward(src, input_key_padding_mask)[source]
Define the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class m3sgg.core.models.tempura.transformer_tempura.TransformerDecoderLayer(embed_dim=1936, nhead=4, dim_feedforward=2048, dropout=0.1)[source]
Bases:
Module
- __init__(embed_dim=1936, nhead=4, dim_feedforward=2048, dropout=0.1)[source]
Initialize internal Module state, shared by both nn.Module and ScriptModule.
- forward(global_input, input_key_padding_mask, position_embed)[source]
Define the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class m3sgg.core.models.tempura.transformer_tempura.TransformerEncoder(encoder_layer, num_layers)[source]
Bases:
Module
- __init__(encoder_layer, num_layers)[source]
Initialize internal Module state, shared by both nn.Module and ScriptModule.
- forward(input, input_key_padding_mask)[source]
Define the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class m3sgg.core.models.tempura.transformer_tempura.TransformerDecoder(decoder_layer, num_layers, embed_dim)[source]
Bases:
Module
- __init__(decoder_layer, num_layers, embed_dim)[source]
Initialize internal Module state, shared by both nn.Module and ScriptModule.
- forward(global_input, input_key_padding_mask, position_embed)[source]
Define the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class m3sgg.core.models.tempura.transformer_tempura.transformer(enc_layer_num=1, dec_layer_num=3, embed_dim=1936, nhead=8, dim_feedforward=2048, dropout=0.1, mode=None, mem_compute=True, mem_fusion=None, selection=None, selection_lambda=0.5)[source]
Bases:
Module
Spatial Temporal Transformer.
- Parameters:
- __init__(enc_layer_num=1, dec_layer_num=3, embed_dim=1936, nhead=8, dim_feedforward=2048, dropout=0.1, mode=None, mem_compute=True, mem_fusion=None, selection=None, selection_lambda=0.5)[source]
Initialize internal Module state, shared by both nn.Module and ScriptModule.
- forward(features, im_idx, memory=[])[source]
Define the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class m3sgg.core.models.tempura.gmm_heads.GMM_head(hid_dim, num_classes, rel_type=None, k=4)[source]
Bases:
Module
Gaussian Mixture Model head for uncertainty estimation in Tempura.
Implements a GMM-based classification head that models uncertainty through multiple Gaussian components with learnable means, variances, and mixture weights.
- Parameters:
nn.Module (class) – Base PyTorch module class
- uncertainty(conf_mu_k, conf_var_k, conf_pi_k_)[source]
Compute epistemic and aleatoric uncertainty.
- Parameters:
- Returns:
Tuple containing prediction, aleatoric uncertainty, and epistemic uncertainty
- Return type:
- forward(x, phase='train', unc=False)[source]
Define the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
STKET Model
Spatio-Temporal Knowledge-Enhanced Transformer (STKET) for Scene Graph Generation.
This module implements the STKET model for video scene graph generation, combining spatial and temporal reasoning with transformer architectures.
- class m3sgg.core.models.stket.stket.ObjectClassifier(mode='sgdet', obj_classes=None)[source]
Bases:
Module
Module for computing object contexts and edge contexts in scene graphs.
Handles object classification and contextual feature extraction for spatial-temporal transformer-based scene graph generation.
- Parameters:
nn.Module (class) – Base PyTorch module class
- class m3sgg.core.models.stket.stket.STKET(mode='sgdet', attention_class_num=None, spatial_class_num=None, contact_class_num=None, obj_classes=None, rel_classes=None, N_layer_num=1, enc_layer_num=None, dec_layer_num=None, pred_contact_threshold=0.5, window_size=4, trainPrior=None, use_spatial_prior=False, use_temporal_prior=False)[source]
Bases:
Module
Spatio-Temporal Knowledge-Enhanced Transformer for Scene Graph Generation.
Implements the STKET model that combines spatial and temporal reasoning with transformer architectures for video scene graph generation.
- Parameters:
nn.Module (class) – Base PyTorch module class
- __init__(mode='sgdet', attention_class_num=None, spatial_class_num=None, contact_class_num=None, obj_classes=None, rel_classes=None, N_layer_num=1, enc_layer_num=None, dec_layer_num=None, pred_contact_threshold=0.5, window_size=4, trainPrior=None, use_spatial_prior=False, use_temporal_prior=False)[source]
Initialize the STKET model.
- Parameters:
mode (str, optional) – Classification mode (‘sgdet’, ‘sgcls’, ‘predcls’), defaults to “sgdet”
attention_class_num (int, optional) – Number of attention relationship classes, defaults to None
spatial_class_num (int, optional) – Number of spatial relationship classes, defaults to None
contact_class_num (int, optional) – Number of contact relationship classes, defaults to None
obj_classes (list, optional) – List of object class names, defaults to None
rel_classes (list, optional) – List of relationship class names, defaults to None
N_layer_num (int, optional) – Number of transformer layers, defaults to 1
enc_layer_num (int, optional) – Number of encoder layers, defaults to None
dec_layer_num (int, optional) – Number of decoder layers, defaults to None
pred_contact_threshold (float, optional) – Contact prediction threshold, defaults to 0.5
window_size (int, optional) – Temporal window size, defaults to 4
trainPrior (dict, optional) – Training prior information, defaults to None
use_spatial_prior (bool, optional) – Whether to use spatial priors, defaults to False
use_temporal_prior (bool, optional) – Whether to use temporal priors, defaults to False
- Returns:
None
- Return type:
None
- class m3sgg.core.models.stket.transformer_stket.TransformerEncoderLayer(embed_dim=1936, nhead=4, dim_feedforward=2048, dropout=0.1)[source]
Bases:
Module
STKET transformer encoder layer with prior knowledge integration.
Implements transformer encoder with additional prior knowledge integration for spatio-temporal knowledge-enhanced scene graph generation.
- Parameters:
nn.Module (class) – Base PyTorch module class
- __init__(embed_dim=1936, nhead=4, dim_feedforward=2048, dropout=0.1)[source]
Initialize STKET transformer encoder layer.
- Parameters:
- Returns:
None
- Return type:
None
- forward(src, prior, input_key_padding_mask)[source]
Forward pass with prior knowledge integration.
- Parameters:
src (torch.Tensor) – Source sequence tensor
prior (torch.Tensor) – Prior knowledge tensor
input_key_padding_mask (torch.Tensor) – Padding mask
- Returns:
Tuple containing output tensor and attention weights
- Return type:
- class m3sgg.core.models.stket.transformer_stket.TransformerDecoderLayer(embed_dim=1936, nhead=4, dim_feedforward=2048, dropout=0.1)[source]
Bases:
Module
- __init__(embed_dim=1936, nhead=4, dim_feedforward=2048, dropout=0.1)[source]
Initialize internal Module state, shared by both nn.Module and ScriptModule.
- forward(src, prior, input_key_padding_mask, position_embed)[source]
Define the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class m3sgg.core.models.stket.transformer_stket.TransformerEncoder(encoder_layer, num_layers)[source]
Bases:
Module
- __init__(encoder_layer, num_layers)[source]
Initialize internal Module state, shared by both nn.Module and ScriptModule.
- forward(input, prior, input_key_padding_mask)[source]
Define the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class m3sgg.core.models.stket.transformer_stket.TransformerDecoder(decoder_layer, num_layers, embed_dim)[source]
Bases:
Module
- __init__(decoder_layer, num_layers, embed_dim)[source]
Initialize internal Module state, shared by both nn.Module and ScriptModule.
- forward(input, prior, input_key_padding_mask, position_embed)[source]
Define the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class m3sgg.core.models.stket.transformer_stket.spatial_encoder(enc_layer_num=1, embed_dim=1936, nhead=8, dim_feedforward=2048, dropout=0.1, trainPrior=None, use_spatial_prior=False, obj_class_num=37, attention_class_num=3, spatial_class_num=6, contact_class_num=17)[source]
Bases:
Module
- __init__(enc_layer_num=1, embed_dim=1936, nhead=8, dim_feedforward=2048, dropout=0.1, trainPrior=None, use_spatial_prior=False, obj_class_num=37, attention_class_num=3, spatial_class_num=6, contact_class_num=17)[source]
Initialize internal Module state, shared by both nn.Module and ScriptModule.
- forward(features, im_idx, entry, mode)[source]
Define the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class m3sgg.core.models.stket.transformer_stket.temporal_decoder(dec_layer_num=3, embed_dim=1936, nhead=8, dim_feedforward=2048, dropout=0.1, pred_contact_threshold=0.5, trainPrior=None, use_temporal_prior=False, obj_class_num=37, attention_class_num=3, spatial_class_num=6, contact_class_num=17)[source]
Bases:
Module
- __init__(dec_layer_num=3, embed_dim=1936, nhead=8, dim_feedforward=2048, dropout=0.1, pred_contact_threshold=0.5, trainPrior=None, use_temporal_prior=False, obj_class_num=37, attention_class_num=3, spatial_class_num=6, contact_class_num=17)[source]
Initialize internal Module state, shared by both nn.Module and ScriptModule.
- forward(features, contact_distribution, im_idx, entry, mode)[source]
Define the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class m3sgg.core.models.stket.transformer_stket.ensemble_decoder(dec_layer_num=3, embed_dim=1936, nhead=8, dim_feedforward=2048, dropout=0.1, pred_contact_threshold=0.5, window_size=3, obj_class_num=37, attention_class_num=3, spatial_class_num=6, contact_class_num=17)[source]
Bases:
Module
- __init__(dec_layer_num=3, embed_dim=1936, nhead=8, dim_feedforward=2048, dropout=0.1, pred_contact_threshold=0.5, window_size=3, obj_class_num=37, attention_class_num=3, spatial_class_num=6, contact_class_num=17)[source]
Initialize internal Module state, shared by both nn.Module and ScriptModule.
- forward(spatial_features, temporal_features, contact_distribution, im_idx, entry, mode)[source]
Define the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
OED Model
OED Multi-frame Model Implementation
This module implements the OED architecture for multi-frame dynamic scene graph generation with the Progressively Refined Module (PRM) for temporal context aggregation.
- class m3sgg.core.models.oed.oed_multi.OEDMulti(conf, dataset)[source]
Bases:
Module
OED Multi-frame model for dynamic scene graph generation.
Implements the one-stage end-to-end framework with cascaded decoders and Progressively Refined Module (PRM) for temporal context aggregation.
- Parameters:
nn.Module (class) – Base PyTorch module class
- class m3sgg.core.models.oed.oed_multi.MLP(input_dim, hidden_dim, output_dim, num_layers)[source]
Bases:
Module
Multi-layer perceptron for bounding box prediction.
- Parameters:
nn.Module (class) – Base PyTorch module class
- forward(x)[source]
Forward pass through the MLP.
- Parameters:
x (torch.Tensor) – Input tensor
- Returns:
Output tensor
- Return type:
- m3sgg.core.models.oed.oed_multi.build_oed_multi(conf, dataset)[source]
Build OED multi-frame model.
OED Single-frame Model Implementation
This module implements the OED architecture for single-frame scene graph generation as a baseline for comparison with the multi-frame variant.
- class m3sgg.core.models.oed.oed_single.OEDSingle(conf, dataset)[source]
Bases:
Module
OED Single-frame model for scene graph generation.
Implements the one-stage end-to-end framework with cascaded decoders for single-frame processing.
- Parameters:
nn.Module (class) – Base PyTorch module class
- class m3sgg.core.models.oed.oed_single.MLP(input_dim, hidden_dim, output_dim, num_layers)[source]
Bases:
Module
Multi-layer perceptron for bounding box prediction.
- Parameters:
nn.Module (class) – Base PyTorch module class
- forward(x)[source]
Forward pass through the MLP.
- Parameters:
x (torch.Tensor) – Input tensor
- Returns:
Output tensor
- Return type:
- m3sgg.core.models.oed.oed_single.build_oed_single(conf, dataset)[source]
Build OED single-frame model.
Transformer module for OED model.
This module implements the cascaded decoders and transformer architecture for the OED model.
- class m3sgg.core.models.oed.transformer.Transformer(conf)[source]
Bases:
Module
Transformer with cascaded decoders for OED.
- Parameters:
nn.Module (class) – Base PyTorch module class
- __init__(conf)[source]
Initialize the transformer.
- Parameters:
conf (Config) – Configuration object
- Returns:
None
- Return type:
None
- forward(src, mask, query_embed, pos_embed, embed_dict=None, targets=None, cur_idx=0)[source]
Forward pass through the transformer.
- Parameters:
src (torch.Tensor) – Source features
mask (torch.Tensor) – Attention mask
query_embed (torch.Tensor) – Query embeddings
pos_embed (torch.Tensor) – Position embeddings
embed_dict (dict, optional) – Dictionary of embedding layers, defaults to None
targets (dict, optional) – Ground truth targets, defaults to None
cur_idx (int, optional) – Current frame index, defaults to 0
- Returns:
Tuple of outputs
- Return type:
- class m3sgg.core.models.oed.transformer.TransformerEncoder(encoder_layer, num_layers, norm=None)[source]
Bases:
Module
Transformer encoder.
- Parameters:
nn.Module (class) – Base PyTorch module class
- __init__(encoder_layer, num_layers, norm=None)[source]
Initialize the encoder.
- Parameters:
encoder_layer (TransformerEncoderLayer) – Encoder layer
num_layers (int) – Number of layers
norm (nn.Module, optional) – Normalization layer, defaults to None
- Returns:
None
- Return type:
None
- forward(src, mask=None, src_key_padding_mask=None, pos=None)[source]
Forward pass through encoder.
- Parameters:
src (torch.Tensor) – Source features
mask (torch.Tensor, optional) – Attention mask, defaults to None
src_key_padding_mask (torch.Tensor, optional) – Key padding mask, defaults to None
pos (torch.Tensor, optional) – Position embeddings, defaults to None
- Returns:
Encoded features
- Return type:
- class m3sgg.core.models.oed.transformer.TransformerDecoder(decoder_layer, num_layers, norm=None)[source]
Bases:
Module
Transformer decoder.
- Parameters:
nn.Module (class) – Base PyTorch module class
- __init__(decoder_layer, num_layers, norm=None)[source]
Initialize the decoder.
- Parameters:
decoder_layer (TransformerDecoderLayer) – Decoder layer
num_layers (int) – Number of layers
norm (nn.Module, optional) – Normalization layer, defaults to None
- Returns:
None
- Return type:
None
- forward(tgt, memory, tgt_mask=None, memory_mask=None, tgt_key_padding_mask=None, memory_key_padding_mask=None, pos=None, query_pos=None)[source]
Forward pass through decoder.
- Parameters:
tgt (torch.Tensor) – Target features
memory (torch.Tensor) – Memory features from encoder
tgt_mask (torch.Tensor, optional) – Target mask, defaults to None
memory_mask (torch.Tensor, optional) – Memory mask, defaults to None
tgt_key_padding_mask (torch.Tensor, optional) – Target key padding mask, defaults to None
memory_key_padding_mask (torch.Tensor, optional) – Memory key padding mask, defaults to None
pos (torch.Tensor, optional) – Position embeddings, defaults to None
query_pos (torch.Tensor, optional) – Query position embeddings, defaults to None
- Returns:
Decoded features
- Return type:
- class m3sgg.core.models.oed.transformer.TransformerEncoderLayer(d_model, nhead, dim_feedforward=2048, dropout=0.1, activation='relu', pre_norm=False)[source]
Bases:
Module
Transformer encoder layer.
- Parameters:
nn.Module (class) – Base PyTorch module class
- __init__(d_model, nhead, dim_feedforward=2048, dropout=0.1, activation='relu', pre_norm=False)[source]
Initialize the encoder layer.
- Parameters:
d_model (int) – Model dimension
nhead (int) – Number of attention heads
dim_feedforward (int, optional) – Feedforward dimension, defaults to 2048
dropout (float, optional) – Dropout rate, defaults to 0.1
activation (str, optional) – Activation function, defaults to “relu”
pre_norm (bool, optional) – Whether to use pre-norm, defaults to False
- Returns:
None
- Return type:
None
- forward(src, src_mask=None, src_key_padding_mask=None, pos=None)[source]
Forward pass through encoder layer.
- Parameters:
src (torch.Tensor) – Source features
src_mask (torch.Tensor, optional) – Source mask, defaults to None
src_key_padding_mask (torch.Tensor, optional) – Source key padding mask, defaults to None
pos (torch.Tensor, optional) – Position embeddings, defaults to None
- Returns:
Output features
- Return type:
- class m3sgg.core.models.oed.transformer.TransformerDecoderLayer(d_model, nhead, dim_feedforward=2048, dropout=0.1, activation='relu', pre_norm=False)[source]
Bases:
Module
Transformer decoder layer.
- Parameters:
nn.Module (class) – Base PyTorch module class
- __init__(d_model, nhead, dim_feedforward=2048, dropout=0.1, activation='relu', pre_norm=False)[source]
Initialize the decoder layer.
- Parameters:
d_model (int) – Model dimension
nhead (int) – Number of attention heads
dim_feedforward (int, optional) – Feedforward dimension, defaults to 2048
dropout (float, optional) – Dropout rate, defaults to 0.1
activation (str, optional) – Activation function, defaults to “relu”
pre_norm (bool, optional) – Whether to use pre-norm, defaults to False
- Returns:
None
- Return type:
None
- forward(tgt, memory, tgt_mask=None, memory_mask=None, tgt_key_padding_mask=None, memory_key_padding_mask=None, pos=None, query_pos=None)[source]
Forward pass through decoder layer.
- Parameters:
tgt (torch.Tensor) – Target features
memory (torch.Tensor) – Memory features
tgt_mask (torch.Tensor, optional) – Target mask, defaults to None
memory_mask (torch.Tensor, optional) – Memory mask, defaults to None
tgt_key_padding_mask (torch.Tensor, optional) – Target key padding mask, defaults to None
memory_key_padding_mask (torch.Tensor, optional) – Memory key padding mask, defaults to None
pos (torch.Tensor, optional) – Position embeddings, defaults to None
query_pos (torch.Tensor, optional) – Query position embeddings, defaults to None
- Returns:
Output features
- Return type:
- m3sgg.core.models.oed.transformer.build_transformer(conf)[source]
Build transformer network.
- Parameters:
conf (Config) – Configuration object
- Returns:
Transformer network
- Return type:
Criterion module for OED model.
This module implements the loss functions for training the OED model.
- class m3sgg.core.models.oed.criterion.SetCriterionOED(num_obj_classes, num_queries, matcher, weight_dict, eos_coef, losses, conf)[source]
Bases:
Module
Loss criterion for OED model.
- Parameters:
nn.Module (class) – Base PyTorch module class
- __init__(num_obj_classes, num_queries, matcher, weight_dict, eos_coef, losses, conf)[source]
Initialize the criterion.
- Parameters:
- Returns:
None
- Return type:
None
- loss_obj_labels(outputs, targets, indices, num_interactions, log=True)[source]
Object classification loss.
- loss_relation_labels(outputs, targets, indices, num_interactions)[source]
Relation classification loss.
- m3sgg.core.models.oed.criterion.accuracy(output, target, topk=(1,))[source]
Compute accuracy.
- Parameters:
output (torch.Tensor) – Model output
target (torch.Tensor) – Ground truth target
topk (tuple, optional) – Top-k accuracy, defaults to (1,)
- Returns:
Tuple of accuracies
- Return type:
Postprocessing module for OED model.
This module handles the post-processing of model outputs for evaluation.
- class m3sgg.core.models.oed.postprocess.PostProcessOED(conf)[source]
Bases:
Module
Post-processing for OED model outputs.
- Parameters:
nn.Module (class) – Base PyTorch module class
- __init__(conf)[source]
Initialize the post-processor.
- Parameters:
conf (Config) – Configuration object
- Returns:
None
- Return type:
None
- forward(outputs, target_sizes)[source]
Post-process model outputs.
- Parameters:
outputs (dict) – Model outputs
target_sizes (torch.Tensor) – Target image sizes
- Returns:
List of processed predictions
- Return type:
Utilities module for OED model.
This module provides utility functions for box operations and other helpers.
- m3sgg.core.models.oed.utils.box_cxcywh_to_xyxy(x)[source]
Convert boxes from center format (cx, cy, w, h) to corner format (x1, y1, x2, y2).
- Parameters:
x (torch.Tensor) – Boxes in center format
- Returns:
Boxes in corner format
- Return type:
- m3sgg.core.models.oed.utils.box_xyxy_to_cxcywh(x)[source]
Convert boxes from corner format (x1, y1, x2, y2) to center format (cx, cy, w, h).
- Parameters:
x (torch.Tensor) – Boxes in corner format
- Returns:
Boxes in center format
- Return type:
- m3sgg.core.models.oed.utils.generalized_box_iou(boxes1, boxes2)[source]
Generalized Intersection over Union between two sets of boxes.
- Parameters:
boxes1 (torch.Tensor) – First set of boxes
boxes2 (torch.Tensor) – Second set of boxes
- Returns:
Generalized IoU matrix
- Return type:
- m3sgg.core.models.oed.utils.box_area(boxes)[source]
Compute the area of a set of bounding boxes.
- Parameters:
boxes (torch.Tensor) – Bounding boxes in (x1, y1, x2, y2) format
- Returns:
Areas of the boxes
- Return type:
- m3sgg.core.models.oed.utils.nested_tensor_from_tensor_list(tensor_list)[source]
Convert a list of tensors to a nested tensor.
- Parameters:
tensor_list (list) – List of tensors
- Returns:
Nested tensor
- Return type:
- class m3sgg.core.models.oed.utils.NestedTensor(tensors, mask)[source]
Bases:
object
Nested tensor wrapper for efficient processing.
- Parameters:
object (class) – Base object class
- __init__(tensors, mask)[source]
Initialize the nested tensor.
- Parameters:
tensors (torch.Tensor) – Input tensors
mask (torch.Tensor) – Mask tensor
- Returns:
None
- Return type:
None
VLM Model
- class m3sgg.core.models.vlm.scene_graph_generator.VLMSceneGraphGenerator(mode='sgdet', attention_class_num=3, spatial_class_num=6, contact_class_num=17, obj_classes=None, model_name='apple/FastVLM-0.5B', device='cuda', few_shot_examples=None, use_chain_of_thought=True, use_tree_of_thought=False, confidence_threshold=0.5)[source]
Bases:
Module
- __init__(mode='sgdet', attention_class_num=3, spatial_class_num=6, contact_class_num=17, obj_classes=None, model_name='apple/FastVLM-0.5B', device='cuda', few_shot_examples=None, use_chain_of_thought=True, use_tree_of_thought=False, confidence_threshold=0.5)[source]
Initialize VLM Scene Graph Generator.
- Parameters:
mode – Scene graph generation mode (sgdet, sgcls, predcls)
attention_class_num – Number of attention relationship classes
spatial_class_num – Number of spatial relationship classes
contact_class_num – Number of contact relationship classes
obj_classes – List of object classes
model_name – HuggingFace model name for VLM
device – Device to run inference on
few_shot_examples – Few-shot examples for prompting
use_chain_of_thought – Whether to use chain-of-thought reasoning
use_tree_of_thought – Whether to use tree-of-thought reasoning
confidence_threshold – Threshold for relationship confidence
Transformer Components
- class m3sgg.utils.transformer.TransformerEncoderLayer(embed_dim=1936, nhead=4, dim_feedforward=2048, dropout=0.1)[source]
Bases:
Module
Transformer encoder layer with multi-head attention and feed-forward network.
Implements a single layer of the transformer encoder with self-attention mechanism, layer normalization, and position-wise feed-forward network.
- Parameters:
nn.Module (class) – Base PyTorch module class
- __init__(embed_dim=1936, nhead=4, dim_feedforward=2048, dropout=0.1)[source]
Initialize the transformer encoder layer.
- Parameters:
- Returns:
None
- Return type:
None
- forward(src, input_key_padding_mask)[source]
Forward pass through the transformer encoder layer.
- Parameters:
src (torch.Tensor) – Source sequence tensor
input_key_padding_mask (torch.Tensor) – Mask for padding tokens
- Returns:
Transformed sequence and attention weights
- Return type:
- class m3sgg.utils.transformer.TransformerDecoderLayer(embed_dim=1936, nhead=4, dim_feedforward=2048, dropout=0.1)[source]
Bases:
Module
Transformer decoder layer with masked self-attention and cross-attention.
Implements a single layer of the transformer decoder with masked self-attention, encoder-decoder attention, and position-wise feed-forward network.
- Parameters:
nn.Module (class) – Base PyTorch module class
- __init__(embed_dim=1936, nhead=4, dim_feedforward=2048, dropout=0.1)[source]
Initialize the transformer decoder layer.
- Parameters:
- Returns:
None
- Return type:
None
- forward(global_input, input_key_padding_mask, position_embed)[source]
Define the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class m3sgg.utils.transformer.TransformerEncoder(encoder_layer, num_layers)[source]
Bases:
Module
- __init__(encoder_layer, num_layers)[source]
Initialize internal Module state, shared by both nn.Module and ScriptModule.
- forward(input, input_key_padding_mask)[source]
Define the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class m3sgg.utils.transformer.TransformerDecoder(decoder_layer, num_layers, embed_dim)[source]
Bases:
Module
- __init__(decoder_layer, num_layers, embed_dim)[source]
Initialize internal Module state, shared by both nn.Module and ScriptModule.
- forward(global_input, input_key_padding_mask, position_embed)[source]
Define the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class m3sgg.utils.transformer.transformer(enc_layer_num=1, dec_layer_num=3, embed_dim=1936, nhead=8, dim_feedforward=2048, dropout=0.1, mode=None)[source]
Bases:
Module
Spatial Temporal Transformer.
- Parameters:
- __init__(enc_layer_num=1, dec_layer_num=3, embed_dim=1936, nhead=8, dim_feedforward=2048, dropout=0.1, mode=None)[source]
Initialize internal Module state, shared by both nn.Module and ScriptModule.
- forward(features, im_idx)[source]
Define the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Module
instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.