Models

This section documents the various scene graph generation models implemented in the project.

STTran Model

class m3sgg.core.models.sttran.sttran.ObjectClassifier(mode='sgdet', obj_classes=None)[source]

Bases: Module

Module for computing object contexts and edge contexts in scene graphs.

Handles object classification and contextual feature extraction for spatial-temporal transformer-based scene graph generation.

Parameters:

nn.Module (class) – Base PyTorch module class

__init__(mode='sgdet', obj_classes=None)[source]

Initialize the object classifier.

Parameters:
  • mode (str, optional) – Classification mode (‘predcls’, ‘sgcls’, ‘sgdet’), defaults to “sgdet”

  • obj_classes (list, optional) – List of object class names, defaults to None

Returns:

None

Return type:

None

clean_class(entry, b, class_idx)[source]
forward(entry)[source]

Define the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class m3sgg.core.models.sttran.sttran.STTran(mode='sgdet', attention_class_num=None, spatial_class_num=None, contact_class_num=None, obj_classes=None, rel_classes=None, enc_layer_num=None, dec_layer_num=None, window_size=4, trainPrior=None, use_spatial_prior=False, use_temporal_prior=False)[source]

Bases: Module

__init__(mode='sgdet', attention_class_num=None, spatial_class_num=None, contact_class_num=None, obj_classes=None, rel_classes=None, enc_layer_num=None, dec_layer_num=None, window_size=4, trainPrior=None, use_spatial_prior=False, use_temporal_prior=False)[source]
Parameters:
  • classes – Object classes

  • rel_classes – Relationship classes. None if were not using rel mode

  • mode – (sgcls, predcls, or sgdet)

forward(entry)[source]

Define the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

Let’s get the relationships yo

class m3sgg.core.detectors.easg.sttran_EASG.ObjectClassifier(mode='edgecls', obj_classes=None)[source]

Bases: Module

Module for computing object contexts and edge contexts for EASG.

EASG-specific implementation of object classification and contextual feature extraction for efficient scene graph generation.

Parameters:

nn.Module (class) – Base PyTorch module class

__init__(mode='edgecls', obj_classes=None)[source]

Initialize the EASG object classifier.

Parameters:
  • mode (str, optional) – Classification mode, defaults to “edgecls”

  • obj_classes (list, optional) – List of object class names, defaults to None

Returns:

None

Return type:

None

forward(entry)[source]

Define the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class m3sgg.core.detectors.easg.sttran_EASG.ActionClassifier(mode='edgecls', verb_classes=None)[source]

Bases: Module

__init__(mode='edgecls', verb_classes=None)[source]

Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(entry)[source]

Define the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class m3sgg.core.detectors.easg.sttran_EASG.STTran(mode='edgecls', obj_classes=None, verb_classes=None, edge_class_num=None, enc_layer_num=None, dec_layer_num=None, use_visual_features=False)[source]

Bases: Module

__init__(mode='edgecls', obj_classes=None, verb_classes=None, edge_class_num=None, enc_layer_num=None, dec_layer_num=None, use_visual_features=False)[source]

Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(entry)[source]

Define the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

SceneLLM Model

SceneLLM main module with imports from distributed components.

This module provides access to all SceneLLM components through a unified interface. The implementation has been distributed across multiple files for better organization:

  • vqvae.py: VQ-VAE quantizer implementation

  • sia.py: Spatial Information Aggregator and hierarchical graph functions

  • ot.py: Optimal Transport codebook updater

  • llm.py: SceneLLM LoRA implementation

  • network.py: Main SceneLLM model and SGG decoder

TODO: Compare different clustering methods TODO: Improve prompt template for LLM TODO: Add better LLM TODO: Improve GCN architecture TODO: Use Cross Entropy instead of MSE

class m3sgg.core.models.scenellm.scenellm.SceneLLM(cfg, dataset)[source]

Bases: Module

SceneLLM model for scene graph generation with language model integration.

Combines VQ-VAE quantization, Spatial Information Aggregator (SIA), optimal transport codebook updates, and LoRA-adapted language models for advanced scene graph generation and description.

Parameters:

nn.Module (class) – Base PyTorch module class

__init__(cfg, dataset)[source]

Initialize the SceneLLM model.

Parameters:
  • cfg (Config) – Configuration object containing model parameters

  • dataset (object) – Dataset information for model setup

Returns:

None

Return type:

None

forward(entry)[source]

Define the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

set_training_stage(stage)[source]

Set training stage and freeze/unfreeze components accordingly.

update_codebook_with_ot()[source]

Update codebook using Optimal Transport scheme.

class m3sgg.core.models.scenellm.scenellm.VQVAEQuantizer(input_dim=2048, dim=1024, codebook_size=8192, commitment_cost=0.25)[source]

Bases: Module

Vector Quantized Variational AutoEncoder (VQ-VAE) quantizer.

Implements discrete latent space quantization for scene representations with codebook learning and commitment loss for stable training.

Parameters:

nn.Module (class) – Base PyTorch module class

__init__(input_dim=2048, dim=1024, codebook_size=8192, commitment_cost=0.25)[source]

Initialize the VQ-VAE quantizer.

Parameters:
  • input_dim (int, optional) – Input feature dimension, defaults to 2048

  • dim (int, optional) – Latent dimension, defaults to 1024

  • codebook_size (int, optional) – Size of the discrete codebook, defaults to 8192

  • commitment_cost (float, optional) – Weight for commitment loss, defaults to 0.25

Returns:

None

Return type:

None

forward(roi_feats)[source]

Forward pass through VQ-VAE quantizer.

Parameters:

roi_feats (torch.Tensor) – ROI features tensor of shape [N, input_dim]

Returns:

Tuple containing reconstructed features, reconstruction loss, embedding loss, and commitment loss

Return type:

tuple

get_usage_histogram()[source]

Get current usage histogram for OT update.

reset_usage_count()[source]

Reset usage counter.

update_codebook(new_codebook_weights)[source]

Update codebook with new weights.

class m3sgg.core.models.scenellm.scenellm.SIA(dim=1024)[source]

Bases: Module

__init__(dim=1024)[source]

Spatial Information Aggregator - Embed (x, y, w, h) then fuse ROI tokens with spatial reasoning.

forward(feats, boxes)[source]

Define the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class m3sgg.core.models.scenellm.scenellm.OTCodebookUpdater(base_codebook, step=512, max_iterations=10)[source]

Bases: object

__init__(base_codebook, step=512, max_iterations=10)[source]
update(usage_hist)[source]

Update codebook using Optimal Transport scheme. :param usage_hist: tensor of shape [codebook_size] with usage frequencies

Returns:

new embedding weight matrix

Return type:

updated_codebook

class m3sgg.core.models.scenellm.scenellm.SceneLLMLoRA(model_name, fallback_dim=None, r=16, alpha=32, dropout=0.05)[source]

Bases: Module

SceneLLM with LoRA (Low-Rank Adaptation) for efficient fine-tuning.

Implements LoRA adaptation on language models for scene graph generation with fallback support when transformers library is unavailable.

Parameters:

nn.Module (class) – Base PyTorch module class

__init__(model_name, fallback_dim=None, r=16, alpha=32, dropout=0.05)[source]

Initialize SceneLLM with LoRA adaptation.

Parameters:
  • model_name (str) – Name of the base language model (e.g., ‘google/gemma-2-2b’)

  • fallback_dim (int, optional) – Dimension for fallback when transformers unavailable, defaults to None

  • r (int, optional) – LoRA rank parameter, defaults to 16

  • alpha (int, optional) – LoRA alpha parameter, defaults to 32

  • dropout (float, optional) – LoRA dropout rate, defaults to 0.05

Returns:

None

Return type:

None

forward(token_embeds)[source]

Define the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class m3sgg.core.models.scenellm.scenellm.SGGDecoder(hidden_dim, attn_c, spat_c, cont_c)[source]

Bases: Module

Scene Graph Generation decoder with transformer architecture.

Decodes hidden representations into attention, spatial, and contact relation predictions using transformer encoder and linear heads.

Parameters:

nn.Module (class) – Base PyTorch module class

__init__(hidden_dim, attn_c, spat_c, cont_c)[source]

Initialize the SGG decoder.

Parameters:
  • hidden_dim (int) – Hidden dimension size

  • attn_c (int) – Number of attention relation classes

  • spat_c (int) – Number of spatial relation classes

  • cont_c (int) – Number of contact relation classes

Returns:

None

Return type:

None

forward(seq)[source]

Forward pass through the SGG decoder.

Parameters:

seq (torch.Tensor) – Input sequence tensor of shape [B, T, D]

Returns:

Dictionary containing attention, spatial, and contact predictions

Return type:

dict

m3sgg.core.models.scenellm.scenellm.build_hierarchical_graph(boxes)[source]

Build hierarchical graph from bounding boxes using hierarchical clustering.

Creates a graph structure from spatial relationships between bounding boxes using hierarchical clustering algorithms.

Parameters:

boxes (torch.Tensor) – Tensor of normalized bounding boxes, shape [N, 4]

Returns:

DGL graph or simple edge list based on hierarchical clustering

Return type:

dgl.DGLGraph or dict

Main SceneLLM network and SGG decoder implementation. Credit to the authors of the original code: https://doi.org/10.1016/j.patcog.2025.111992.

class m3sgg.core.models.scenellm.network.SceneLLM(cfg, dataset)[source]

Bases: Module

SceneLLM model for scene graph generation with language model integration.

Combines VQ-VAE quantization, Spatial Information Aggregator (SIA), optimal transport codebook updates, and LoRA-adapted language models for advanced scene graph generation and description.

Parameters:

nn.Module (class) – Base PyTorch module class

__init__(cfg, dataset)[source]

Initialize the SceneLLM model.

Parameters:
  • cfg (Config) – Configuration object containing model parameters

  • dataset (object) – Dataset information for model setup

Returns:

None

Return type:

None

forward(entry)[source]

Define the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

set_training_stage(stage)[source]

Set training stage and freeze/unfreeze components accordingly.

update_codebook_with_ot()[source]

Update codebook using Optimal Transport scheme.

class m3sgg.core.models.scenellm.network.SGGDecoder(hidden_dim, attn_c, spat_c, cont_c)[source]

Bases: Module

Scene Graph Generation decoder with transformer architecture.

Decodes hidden representations into attention, spatial, and contact relation predictions using transformer encoder and linear heads.

Parameters:

nn.Module (class) – Base PyTorch module class

__init__(hidden_dim, attn_c, spat_c, cont_c)[source]

Initialize the SGG decoder.

Parameters:
  • hidden_dim (int) – Hidden dimension size

  • attn_c (int) – Number of attention relation classes

  • spat_c (int) – Number of spatial relation classes

  • cont_c (int) – Number of contact relation classes

Returns:

None

Return type:

None

forward(seq)[source]

Forward pass through the SGG decoder.

Parameters:

seq (torch.Tensor) – Input sequence tensor of shape [B, T, D]

Returns:

Dictionary containing attention, spatial, and contact predictions

Return type:

dict

SceneLLM LoRA implementation for SceneLLM. Credit to the authors of the original code: https://doi.org/10.1016/j.patcog.2025.111992.

class m3sgg.core.models.scenellm.llm.SceneLLMLoRA(model_name, fallback_dim=None, r=16, alpha=32, dropout=0.05)[source]

Bases: Module

SceneLLM with LoRA (Low-Rank Adaptation) for efficient fine-tuning.

Implements LoRA adaptation on language models for scene graph generation with fallback support when transformers library is unavailable.

Parameters:

nn.Module (class) – Base PyTorch module class

__init__(model_name, fallback_dim=None, r=16, alpha=32, dropout=0.05)[source]

Initialize SceneLLM with LoRA adaptation.

Parameters:
  • model_name (str) – Name of the base language model (e.g., ‘google/gemma-2-2b’)

  • fallback_dim (int, optional) – Dimension for fallback when transformers unavailable, defaults to None

  • r (int, optional) – LoRA rank parameter, defaults to 16

  • alpha (int, optional) – LoRA alpha parameter, defaults to 32

  • dropout (float, optional) – LoRA dropout rate, defaults to 0.05

Returns:

None

Return type:

None

forward(token_embeds)[source]

Define the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

VQ-VAE Quantizer implementation for SceneLLM. Credit to the authors of the original code: https://doi.org/10.1016/j.patcog.2025.111992.

class m3sgg.core.models.scenellm.vqvae.VQVAEQuantizer(input_dim=2048, dim=1024, codebook_size=8192, commitment_cost=0.25)[source]

Bases: Module

Vector Quantized Variational AutoEncoder (VQ-VAE) quantizer.

Implements discrete latent space quantization for scene representations with codebook learning and commitment loss for stable training.

Parameters:

nn.Module (class) – Base PyTorch module class

__init__(input_dim=2048, dim=1024, codebook_size=8192, commitment_cost=0.25)[source]

Initialize the VQ-VAE quantizer.

Parameters:
  • input_dim (int, optional) – Input feature dimension, defaults to 2048

  • dim (int, optional) – Latent dimension, defaults to 1024

  • codebook_size (int, optional) – Size of the discrete codebook, defaults to 8192

  • commitment_cost (float, optional) – Weight for commitment loss, defaults to 0.25

Returns:

None

Return type:

None

forward(roi_feats)[source]

Forward pass through VQ-VAE quantizer.

Parameters:

roi_feats (torch.Tensor) – ROI features tensor of shape [N, input_dim]

Returns:

Tuple containing reconstructed features, reconstruction loss, embedding loss, and commitment loss

Return type:

tuple

get_usage_histogram()[source]

Get current usage histogram for OT update.

reset_usage_count()[source]

Reset usage counter.

update_codebook(new_codebook_weights)[source]

Update codebook with new weights.

Optimal Transport Codebook Updater implementation for SceneLLM. Credit to the authors of the original code: https://doi.org/10.1016/j.patcog.2025.111992.

class m3sgg.core.models.scenellm.ot.OTCodebookUpdater(base_codebook, step=512, max_iterations=10)[source]

Bases: object

__init__(base_codebook, step=512, max_iterations=10)[source]
update(usage_hist)[source]

Update codebook using Optimal Transport scheme. :param usage_hist: tensor of shape [codebook_size] with usage frequencies

Returns:

new embedding weight matrix

Return type:

updated_codebook

Spatial Information Aggregator (SIA) implementation for SceneLLM. Credit to the authors of the original code: https://doi.org/10.1016/j.patcog.2025.111992.

m3sgg.core.models.scenellm.sia.build_hierarchical_graph(boxes)[source]

Build hierarchical graph from bounding boxes using hierarchical clustering.

Creates a graph structure from spatial relationships between bounding boxes using hierarchical clustering algorithms.

Parameters:

boxes (torch.Tensor) – Tensor of normalized bounding boxes, shape [N, 4]

Returns:

DGL graph or simple edge list based on hierarchical clustering

Return type:

dgl.DGLGraph or dict

class m3sgg.core.models.scenellm.sia.SIA(dim=1024)[source]

Bases: Module

__init__(dim=1024)[source]

Spatial Information Aggregator - Embed (x, y, w, h) then fuse ROI tokens with spatial reasoning.

forward(feats, boxes)[source]

Define the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

Tempura Model

class m3sgg.core.models.tempura.tempura.PositionalEncoding(d_model: int, dropout: float = 0.1, max_len: int = 5000)[source]

Bases: Module

Positional encoding for transformer-based models.

Implements sinusoidal positional encoding to provide temporal information to transformer architectures for sequence modeling.

Parameters:

nn.Module (class) – Base PyTorch module class

__init__(d_model: int, dropout: float = 0.1, max_len: int = 5000)[source]

Initialize positional encoding.

Parameters:
  • d_model (int) – Model dimension

  • dropout (float, optional) – Dropout probability, defaults to 0.1

  • max_len (int, optional) – Maximum sequence length, defaults to 5000

Returns:

None

Return type:

None

forward(x: Tensor, indices=None) Tensor[source]

Apply positional encoding to input tensor.

Parameters:
  • x (torch.Tensor) – Input tensor of shape [batch_size, seq_len, embedding_dim]

  • indices (torch.Tensor, optional) – Optional indices for position selection, defaults to None

Returns:

Input tensor with positional encoding added

Return type:

torch.Tensor

class m3sgg.core.models.tempura.tempura.ObjectClassifier(mode='sgdet', obj_head='gmm', K=4, obj_classes=None, mem_compute=None, selection=None, selection_lambda=0.5, tracking=None)[source]

Bases: Module

Tempura object classifier for computing object and edge contexts.

Implements the Tempura model’s approach to object classification and contextual feature extraction with memory-augmented learning and uncertainty estimation capabilities.

Parameters:

nn.Module (class) – Base PyTorch module class

__init__(mode='sgdet', obj_head='gmm', K=4, obj_classes=None, mem_compute=None, selection=None, selection_lambda=0.5, tracking=None)[source]

Initialize internal Module state, shared by both nn.Module and ScriptModule.

clean_class(entry, b, class_idx)[source]
mem_selection(feat)[source]
memory_hallucinator(memory, feat)[source]
classify(entry, obj_features, phase='train', unc=False)[source]
forward(entry, phase='train', unc=False)[source]

Define the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class m3sgg.core.models.tempura.tempura.TEMPURA(mode='sgdet', attention_class_num=None, spatial_class_num=None, contact_class_num=None, obj_classes=None, rel_classes=None, enc_layer_num=None, dec_layer_num=None, obj_mem_compute=None, rel_mem_compute=None, mem_fusion=None, selection=None, selection_lambda=0.5, take_obj_mem_feat=False, obj_head='gmm', rel_head='gmm', K=None, tracking=None)[source]

Bases: Module

__init__(mode='sgdet', attention_class_num=None, spatial_class_num=None, contact_class_num=None, obj_classes=None, rel_classes=None, enc_layer_num=None, dec_layer_num=None, obj_mem_compute=None, rel_mem_compute=None, mem_fusion=None, selection=None, selection_lambda=0.5, take_obj_mem_feat=False, obj_head='gmm', rel_head='gmm', K=None, tracking=None)[source]
Parameters:
  • classes – Object classes

  • rel_classes – Relationship classes. None if were not using rel mode

  • mode – (sgcls, predcls, or sgdet)

forward(entry, phase='train', unc=False)[source]

Define the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class m3sgg.core.models.tempura.transformer_tempura.TransformerEncoderLayer(embed_dim=1936, nhead=4, dim_feedforward=2048, dropout=0.1)[source]

Bases: Module

__init__(embed_dim=1936, nhead=4, dim_feedforward=2048, dropout=0.1)[source]

Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(src, input_key_padding_mask)[source]

Define the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class m3sgg.core.models.tempura.transformer_tempura.TransformerDecoderLayer(embed_dim=1936, nhead=4, dim_feedforward=2048, dropout=0.1)[source]

Bases: Module

__init__(embed_dim=1936, nhead=4, dim_feedforward=2048, dropout=0.1)[source]

Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(global_input, input_key_padding_mask, position_embed)[source]

Define the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class m3sgg.core.models.tempura.transformer_tempura.TransformerEncoder(encoder_layer, num_layers)[source]

Bases: Module

__init__(encoder_layer, num_layers)[source]

Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(input, input_key_padding_mask)[source]

Define the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class m3sgg.core.models.tempura.transformer_tempura.TransformerDecoder(decoder_layer, num_layers, embed_dim)[source]

Bases: Module

__init__(decoder_layer, num_layers, embed_dim)[source]

Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(global_input, input_key_padding_mask, position_embed)[source]

Define the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class m3sgg.core.models.tempura.transformer_tempura.transformer(enc_layer_num=1, dec_layer_num=3, embed_dim=1936, nhead=8, dim_feedforward=2048, dropout=0.1, mode=None, mem_compute=True, mem_fusion=None, selection=None, selection_lambda=0.5)[source]

Bases: Module

Spatial Temporal Transformer.

Parameters:
  • local_attention (object) – spatial encoder

  • global_attention (object) – temporal decoder

  • position_embedding (object) – frame encoding (window_size*dim)

  • mode (str) – both–use the features from both frames in the window, latter–use the features from the latter frame in the window

__init__(enc_layer_num=1, dec_layer_num=3, embed_dim=1936, nhead=8, dim_feedforward=2048, dropout=0.1, mode=None, mem_compute=True, mem_fusion=None, selection=None, selection_lambda=0.5)[source]

Initialize internal Module state, shared by both nn.Module and ScriptModule.

memory_hallucinator(memory, feat)[source]
mem_selection(feat)[source]
forward(features, im_idx, memory=[])[source]

Define the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class m3sgg.core.models.tempura.gmm_heads.GMM_head(hid_dim, num_classes, rel_type=None, k=4)[source]

Bases: Module

Gaussian Mixture Model head for uncertainty estimation in Tempura.

Implements a GMM-based classification head that models uncertainty through multiple Gaussian components with learnable means, variances, and mixture weights.

Parameters:

nn.Module (class) – Base PyTorch module class

__init__(hid_dim, num_classes, rel_type=None, k=4)[source]

Initialize the GMM head.

Parameters:
  • hid_dim (int) – Hidden dimension size

  • num_classes (int) – Number of output classes

  • rel_type (str, optional) – Type of relation (affects activation function), defaults to None

  • k (int, optional) – Number of Gaussian mixture components, defaults to 4

Returns:

None

Return type:

None

uncertainty(conf_mu_k, conf_var_k, conf_pi_k_)[source]

Compute epistemic and aleatoric uncertainty.

Parameters:
  • conf_mu_k (dict) – Mean predictions for each mixture component

  • conf_var_k (dict) – Variance predictions for each mixture component

  • conf_pi_k (list) – Mixture weights for each component

Returns:

Tuple containing prediction, aleatoric uncertainty, and epistemic uncertainty

Return type:

tuple

forward(x, phase='train', unc=False)[source]

Define the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

STKET Model

Spatio-Temporal Knowledge-Enhanced Transformer (STKET) for Scene Graph Generation.

This module implements the STKET model for video scene graph generation, combining spatial and temporal reasoning with transformer architectures.

class m3sgg.core.models.stket.stket.ObjectClassifier(mode='sgdet', obj_classes=None)[source]

Bases: Module

Module for computing object contexts and edge contexts in scene graphs.

Handles object classification and contextual feature extraction for spatial-temporal transformer-based scene graph generation.

Parameters:

nn.Module (class) – Base PyTorch module class

__init__(mode='sgdet', obj_classes=None)[source]

Initialize the object classifier.

Parameters:
  • mode (str, optional) – Classification mode (‘predcls’, ‘sgcls’, ‘sgdet’), defaults to “sgdet”

  • obj_classes (list, optional) – List of object class names, defaults to None

Returns:

None

Return type:

None

forward(entry)[source]

Forward pass for object classification.

Parameters:

entry (dict) – Dictionary containing input data

Returns:

Updated entry dictionary with predictions

Return type:

dict

class m3sgg.core.models.stket.stket.STKET(mode='sgdet', attention_class_num=None, spatial_class_num=None, contact_class_num=None, obj_classes=None, rel_classes=None, N_layer_num=1, enc_layer_num=None, dec_layer_num=None, pred_contact_threshold=0.5, window_size=4, trainPrior=None, use_spatial_prior=False, use_temporal_prior=False)[source]

Bases: Module

Spatio-Temporal Knowledge-Enhanced Transformer for Scene Graph Generation.

Implements the STKET model that combines spatial and temporal reasoning with transformer architectures for video scene graph generation.

Parameters:

nn.Module (class) – Base PyTorch module class

__init__(mode='sgdet', attention_class_num=None, spatial_class_num=None, contact_class_num=None, obj_classes=None, rel_classes=None, N_layer_num=1, enc_layer_num=None, dec_layer_num=None, pred_contact_threshold=0.5, window_size=4, trainPrior=None, use_spatial_prior=False, use_temporal_prior=False)[source]

Initialize the STKET model.

Parameters:
  • mode (str, optional) – Classification mode (‘sgdet’, ‘sgcls’, ‘predcls’), defaults to “sgdet”

  • attention_class_num (int, optional) – Number of attention relationship classes, defaults to None

  • spatial_class_num (int, optional) – Number of spatial relationship classes, defaults to None

  • contact_class_num (int, optional) – Number of contact relationship classes, defaults to None

  • obj_classes (list, optional) – List of object class names, defaults to None

  • rel_classes (list, optional) – List of relationship class names, defaults to None

  • N_layer_num (int, optional) – Number of transformer layers, defaults to 1

  • enc_layer_num (int, optional) – Number of encoder layers, defaults to None

  • dec_layer_num (int, optional) – Number of decoder layers, defaults to None

  • pred_contact_threshold (float, optional) – Contact prediction threshold, defaults to 0.5

  • window_size (int, optional) – Temporal window size, defaults to 4

  • trainPrior (dict, optional) – Training prior information, defaults to None

  • use_spatial_prior (bool, optional) – Whether to use spatial priors, defaults to False

  • use_temporal_prior (bool, optional) – Whether to use temporal priors, defaults to False

Returns:

None

Return type:

None

forward(entry)[source]

Forward pass for STKET model.

Parameters:

entry (dict) – Dictionary containing input data

Returns:

Updated entry dictionary with predictions

Return type:

dict

class m3sgg.core.models.stket.transformer_stket.TransformerEncoderLayer(embed_dim=1936, nhead=4, dim_feedforward=2048, dropout=0.1)[source]

Bases: Module

STKET transformer encoder layer with prior knowledge integration.

Implements transformer encoder with additional prior knowledge integration for spatio-temporal knowledge-enhanced scene graph generation.

Parameters:

nn.Module (class) – Base PyTorch module class

__init__(embed_dim=1936, nhead=4, dim_feedforward=2048, dropout=0.1)[source]

Initialize STKET transformer encoder layer.

Parameters:
  • embed_dim (int, optional) – Embedding dimension, defaults to 1936

  • nhead (int, optional) – Number of attention heads, defaults to 4

  • dim_feedforward (int, optional) – Feed-forward dimension, defaults to 2048

  • dropout (float, optional) – Dropout probability, defaults to 0.1

Returns:

None

Return type:

None

forward(src, prior, input_key_padding_mask)[source]

Forward pass with prior knowledge integration.

Parameters:
Returns:

Tuple containing output tensor and attention weights

Return type:

tuple

class m3sgg.core.models.stket.transformer_stket.TransformerDecoderLayer(embed_dim=1936, nhead=4, dim_feedforward=2048, dropout=0.1)[source]

Bases: Module

__init__(embed_dim=1936, nhead=4, dim_feedforward=2048, dropout=0.1)[source]

Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(src, prior, input_key_padding_mask, position_embed)[source]

Define the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class m3sgg.core.models.stket.transformer_stket.TransformerEncoder(encoder_layer, num_layers)[source]

Bases: Module

__init__(encoder_layer, num_layers)[source]

Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(input, prior, input_key_padding_mask)[source]

Define the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class m3sgg.core.models.stket.transformer_stket.TransformerDecoder(decoder_layer, num_layers, embed_dim)[source]

Bases: Module

__init__(decoder_layer, num_layers, embed_dim)[source]

Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(input, prior, input_key_padding_mask, position_embed)[source]

Define the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class m3sgg.core.models.stket.transformer_stket.spatial_encoder(enc_layer_num=1, embed_dim=1936, nhead=8, dim_feedforward=2048, dropout=0.1, trainPrior=None, use_spatial_prior=False, obj_class_num=37, attention_class_num=3, spatial_class_num=6, contact_class_num=17)[source]

Bases: Module

__init__(enc_layer_num=1, embed_dim=1936, nhead=8, dim_feedforward=2048, dropout=0.1, trainPrior=None, use_spatial_prior=False, obj_class_num=37, attention_class_num=3, spatial_class_num=6, contact_class_num=17)[source]

Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(features, im_idx, entry, mode)[source]

Define the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class m3sgg.core.models.stket.transformer_stket.temporal_decoder(dec_layer_num=3, embed_dim=1936, nhead=8, dim_feedforward=2048, dropout=0.1, pred_contact_threshold=0.5, trainPrior=None, use_temporal_prior=False, obj_class_num=37, attention_class_num=3, spatial_class_num=6, contact_class_num=17)[source]

Bases: Module

__init__(dec_layer_num=3, embed_dim=1936, nhead=8, dim_feedforward=2048, dropout=0.1, pred_contact_threshold=0.5, trainPrior=None, use_temporal_prior=False, obj_class_num=37, attention_class_num=3, spatial_class_num=6, contact_class_num=17)[source]

Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(features, contact_distribution, im_idx, entry, mode)[source]

Define the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class m3sgg.core.models.stket.transformer_stket.ensemble_decoder(dec_layer_num=3, embed_dim=1936, nhead=8, dim_feedforward=2048, dropout=0.1, pred_contact_threshold=0.5, window_size=3, obj_class_num=37, attention_class_num=3, spatial_class_num=6, contact_class_num=17)[source]

Bases: Module

__init__(dec_layer_num=3, embed_dim=1936, nhead=8, dim_feedforward=2048, dropout=0.1, pred_contact_threshold=0.5, window_size=3, obj_class_num=37, attention_class_num=3, spatial_class_num=6, contact_class_num=17)[source]

Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(spatial_features, temporal_features, contact_distribution, im_idx, entry, mode)[source]

Define the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

OED Model

OED Multi-frame Model Implementation

This module implements the OED architecture for multi-frame dynamic scene graph generation with the Progressively Refined Module (PRM) for temporal context aggregation.

class m3sgg.core.models.oed.oed_multi.OEDMulti(conf, dataset)[source]

Bases: Module

OED Multi-frame model for dynamic scene graph generation.

Implements the one-stage end-to-end framework with cascaded decoders and Progressively Refined Module (PRM) for temporal context aggregation.

Parameters:

nn.Module (class) – Base PyTorch module class

__init__(conf, dataset)[source]

Initialize the OED Multi-frame model.

Parameters:
  • conf (Config) – Configuration object containing model parameters

  • dataset (object) – Dataset information for model setup

Returns:

None

Return type:

None

forward(entry, targets=None)[source]

Forward pass through the OED model.

Parameters:
  • entry (dict) – Input data containing images and features

  • targets (dict, optional) – Ground truth targets for training, defaults to None

Returns:

Model predictions and outputs

Return type:

dict

class m3sgg.core.models.oed.oed_multi.MLP(input_dim, hidden_dim, output_dim, num_layers)[source]

Bases: Module

Multi-layer perceptron for bounding box prediction.

Parameters:

nn.Module (class) – Base PyTorch module class

__init__(input_dim, hidden_dim, output_dim, num_layers)[source]

Initialize the MLP.

Parameters:
  • input_dim (int) – Input dimension

  • hidden_dim (int) – Hidden dimension

  • output_dim (int) – Output dimension

  • num_layers (int) – Number of layers

Returns:

None

Return type:

None

forward(x)[source]

Forward pass through the MLP.

Parameters:

x (torch.Tensor) – Input tensor

Returns:

Output tensor

Return type:

torch.Tensor

m3sgg.core.models.oed.oed_multi.build_oed_multi(conf, dataset)[source]

Build OED multi-frame model.

Parameters:
  • conf (Config) – Configuration object

  • dataset (object) – Dataset object

Returns:

Tuple of (model, criterion, postprocessors)

Return type:

tuple

OED Single-frame Model Implementation

This module implements the OED architecture for single-frame scene graph generation as a baseline for comparison with the multi-frame variant.

class m3sgg.core.models.oed.oed_single.OEDSingle(conf, dataset)[source]

Bases: Module

OED Single-frame model for scene graph generation.

Implements the one-stage end-to-end framework with cascaded decoders for single-frame processing.

Parameters:

nn.Module (class) – Base PyTorch module class

__init__(conf, dataset)[source]

Initialize the OED Single-frame model.

Parameters:
  • conf (Config) – Configuration object containing model parameters

  • dataset (object) – Dataset information for model setup

Returns:

None

Return type:

None

forward(entry, targets=None)[source]

Forward pass through the OED model.

Parameters:
  • entry (dict) – Input data containing images and features from object detector

  • targets (dict, optional) – Ground truth targets for training, defaults to None

Returns:

Model predictions and outputs

Return type:

dict

class m3sgg.core.models.oed.oed_single.MLP(input_dim, hidden_dim, output_dim, num_layers)[source]

Bases: Module

Multi-layer perceptron for bounding box prediction.

Parameters:

nn.Module (class) – Base PyTorch module class

__init__(input_dim, hidden_dim, output_dim, num_layers)[source]

Initialize the MLP.

Parameters:
  • input_dim (int) – Input dimension

  • hidden_dim (int) – Hidden dimension

  • output_dim (int) – Output dimension

  • num_layers (int) – Number of layers

Returns:

None

Return type:

None

forward(x)[source]

Forward pass through the MLP.

Parameters:

x (torch.Tensor) – Input tensor

Returns:

Output tensor

Return type:

torch.Tensor

m3sgg.core.models.oed.oed_single.build_oed_single(conf, dataset)[source]

Build OED single-frame model.

Parameters:
  • conf (Config) – Configuration object

  • dataset (object) – Dataset object

Returns:

Tuple of (model, criterion, postprocessors)

Return type:

tuple

Transformer module for OED model.

This module implements the cascaded decoders and transformer architecture for the OED model.

class m3sgg.core.models.oed.transformer.Transformer(conf)[source]

Bases: Module

Transformer with cascaded decoders for OED.

Parameters:

nn.Module (class) – Base PyTorch module class

__init__(conf)[source]

Initialize the transformer.

Parameters:

conf (Config) – Configuration object

Returns:

None

Return type:

None

forward(src, mask, query_embed, pos_embed, embed_dict=None, targets=None, cur_idx=0)[source]

Forward pass through the transformer.

Parameters:
  • src (torch.Tensor) – Source features

  • mask (torch.Tensor) – Attention mask

  • query_embed (torch.Tensor) – Query embeddings

  • pos_embed (torch.Tensor) – Position embeddings

  • embed_dict (dict, optional) – Dictionary of embedding layers, defaults to None

  • targets (dict, optional) – Ground truth targets, defaults to None

  • cur_idx (int, optional) – Current frame index, defaults to 0

Returns:

Tuple of outputs

Return type:

tuple

class m3sgg.core.models.oed.transformer.TransformerEncoder(encoder_layer, num_layers, norm=None)[source]

Bases: Module

Transformer encoder.

Parameters:

nn.Module (class) – Base PyTorch module class

__init__(encoder_layer, num_layers, norm=None)[source]

Initialize the encoder.

Parameters:
  • encoder_layer (TransformerEncoderLayer) – Encoder layer

  • num_layers (int) – Number of layers

  • norm (nn.Module, optional) – Normalization layer, defaults to None

Returns:

None

Return type:

None

forward(src, mask=None, src_key_padding_mask=None, pos=None)[source]

Forward pass through encoder.

Parameters:
  • src (torch.Tensor) – Source features

  • mask (torch.Tensor, optional) – Attention mask, defaults to None

  • src_key_padding_mask (torch.Tensor, optional) – Key padding mask, defaults to None

  • pos (torch.Tensor, optional) – Position embeddings, defaults to None

Returns:

Encoded features

Return type:

torch.Tensor

class m3sgg.core.models.oed.transformer.TransformerDecoder(decoder_layer, num_layers, norm=None)[source]

Bases: Module

Transformer decoder.

Parameters:

nn.Module (class) – Base PyTorch module class

__init__(decoder_layer, num_layers, norm=None)[source]

Initialize the decoder.

Parameters:
  • decoder_layer (TransformerDecoderLayer) – Decoder layer

  • num_layers (int) – Number of layers

  • norm (nn.Module, optional) – Normalization layer, defaults to None

Returns:

None

Return type:

None

forward(tgt, memory, tgt_mask=None, memory_mask=None, tgt_key_padding_mask=None, memory_key_padding_mask=None, pos=None, query_pos=None)[source]

Forward pass through decoder.

Parameters:
  • tgt (torch.Tensor) – Target features

  • memory (torch.Tensor) – Memory features from encoder

  • tgt_mask (torch.Tensor, optional) – Target mask, defaults to None

  • memory_mask (torch.Tensor, optional) – Memory mask, defaults to None

  • tgt_key_padding_mask (torch.Tensor, optional) – Target key padding mask, defaults to None

  • memory_key_padding_mask (torch.Tensor, optional) – Memory key padding mask, defaults to None

  • pos (torch.Tensor, optional) – Position embeddings, defaults to None

  • query_pos (torch.Tensor, optional) – Query position embeddings, defaults to None

Returns:

Decoded features

Return type:

torch.Tensor

class m3sgg.core.models.oed.transformer.TransformerEncoderLayer(d_model, nhead, dim_feedforward=2048, dropout=0.1, activation='relu', pre_norm=False)[source]

Bases: Module

Transformer encoder layer.

Parameters:

nn.Module (class) – Base PyTorch module class

__init__(d_model, nhead, dim_feedforward=2048, dropout=0.1, activation='relu', pre_norm=False)[source]

Initialize the encoder layer.

Parameters:
  • d_model (int) – Model dimension

  • nhead (int) – Number of attention heads

  • dim_feedforward (int, optional) – Feedforward dimension, defaults to 2048

  • dropout (float, optional) – Dropout rate, defaults to 0.1

  • activation (str, optional) – Activation function, defaults to “relu”

  • pre_norm (bool, optional) – Whether to use pre-norm, defaults to False

Returns:

None

Return type:

None

forward(src, src_mask=None, src_key_padding_mask=None, pos=None)[source]

Forward pass through encoder layer.

Parameters:
  • src (torch.Tensor) – Source features

  • src_mask (torch.Tensor, optional) – Source mask, defaults to None

  • src_key_padding_mask (torch.Tensor, optional) – Source key padding mask, defaults to None

  • pos (torch.Tensor, optional) – Position embeddings, defaults to None

Returns:

Output features

Return type:

torch.Tensor

class m3sgg.core.models.oed.transformer.TransformerDecoderLayer(d_model, nhead, dim_feedforward=2048, dropout=0.1, activation='relu', pre_norm=False)[source]

Bases: Module

Transformer decoder layer.

Parameters:

nn.Module (class) – Base PyTorch module class

__init__(d_model, nhead, dim_feedforward=2048, dropout=0.1, activation='relu', pre_norm=False)[source]

Initialize the decoder layer.

Parameters:
  • d_model (int) – Model dimension

  • nhead (int) – Number of attention heads

  • dim_feedforward (int, optional) – Feedforward dimension, defaults to 2048

  • dropout (float, optional) – Dropout rate, defaults to 0.1

  • activation (str, optional) – Activation function, defaults to “relu”

  • pre_norm (bool, optional) – Whether to use pre-norm, defaults to False

Returns:

None

Return type:

None

forward(tgt, memory, tgt_mask=None, memory_mask=None, tgt_key_padding_mask=None, memory_key_padding_mask=None, pos=None, query_pos=None)[source]

Forward pass through decoder layer.

Parameters:
  • tgt (torch.Tensor) – Target features

  • memory (torch.Tensor) – Memory features

  • tgt_mask (torch.Tensor, optional) – Target mask, defaults to None

  • memory_mask (torch.Tensor, optional) – Memory mask, defaults to None

  • tgt_key_padding_mask (torch.Tensor, optional) – Target key padding mask, defaults to None

  • memory_key_padding_mask (torch.Tensor, optional) – Memory key padding mask, defaults to None

  • pos (torch.Tensor, optional) – Position embeddings, defaults to None

  • query_pos (torch.Tensor, optional) – Query position embeddings, defaults to None

Returns:

Output features

Return type:

torch.Tensor

m3sgg.core.models.oed.transformer.build_transformer(conf)[source]

Build transformer network.

Parameters:

conf (Config) – Configuration object

Returns:

Transformer network

Return type:

Transformer

Criterion module for OED model.

This module implements the loss functions for training the OED model.

class m3sgg.core.models.oed.criterion.SetCriterionOED(num_obj_classes, num_queries, matcher, weight_dict, eos_coef, losses, conf)[source]

Bases: Module

Loss criterion for OED model.

Parameters:

nn.Module (class) – Base PyTorch module class

__init__(num_obj_classes, num_queries, matcher, weight_dict, eos_coef, losses, conf)[source]

Initialize the criterion.

Parameters:
  • num_obj_classes (int) – Number of object classes

  • num_queries (int) – Number of queries

  • matcher (object) – Hungarian matcher

  • weight_dict (dict) – Dictionary of loss weights

  • eos_coef (float) – End-of-sequence coefficient

  • losses (list) – List of loss types

  • conf (Config) – Configuration object

Returns:

None

Return type:

None

loss_obj_labels(outputs, targets, indices, num_interactions, log=True)[source]

Object classification loss.

Parameters:
  • outputs (dict) – Model outputs

  • targets (list) – Ground truth targets

  • indices (list) – Matched indices

  • num_interactions (int) – Number of interactions

  • log (bool, optional) – Whether to log errors, defaults to True

Returns:

Dictionary of losses

Return type:

dict

loss_obj_cardinality(outputs, targets, indices, num_interactions)[source]

Object cardinality loss.

Parameters:
  • outputs (dict) – Model outputs

  • targets (list) – Ground truth targets

  • indices (list) – Matched indices

  • num_interactions (int) – Number of interactions

Returns:

Dictionary of losses

Return type:

dict

loss_relation_labels(outputs, targets, indices, num_interactions)[source]

Relation classification loss.

Parameters:
  • outputs (dict) – Model outputs

  • targets (list) – Ground truth targets

  • indices (list) – Matched indices

  • num_interactions (int) – Number of interactions

Returns:

Dictionary of losses

Return type:

dict

loss_sub_obj_boxes(outputs, targets, indices, num_interactions)[source]

Bounding box loss.

Parameters:
  • outputs (dict) – Model outputs

  • targets (list) – Ground truth targets

  • indices (list) – Matched indices

  • num_interactions (int) – Number of interactions

Returns:

Dictionary of losses

Return type:

dict

get_loss(loss, outputs, targets, indices, num, **kwargs)[source]

Get loss function.

Parameters:
  • loss (str) – Loss type

  • outputs (dict) – Model outputs

  • targets (list) – Ground truth targets

  • indices (list) – Matched indices

  • num (int) – Number of interactions

Returns:

Loss dictionary

Return type:

dict

forward(outputs, targets)[source]

Forward pass for loss computation.

Parameters:
  • outputs (dict) – Model outputs

  • targets (list) – Ground truth targets

Returns:

Dictionary of losses

Return type:

dict

m3sgg.core.models.oed.criterion.accuracy(output, target, topk=(1,))[source]

Compute accuracy.

Parameters:
  • output (torch.Tensor) – Model output

  • target (torch.Tensor) – Ground truth target

  • topk (tuple, optional) – Top-k accuracy, defaults to (1,)

Returns:

Tuple of accuracies

Return type:

tuple

Postprocessing module for OED model.

This module handles the post-processing of model outputs for evaluation.

class m3sgg.core.models.oed.postprocess.PostProcessOED(conf)[source]

Bases: Module

Post-processing for OED model outputs.

Parameters:

nn.Module (class) – Base PyTorch module class

__init__(conf)[source]

Initialize the post-processor.

Parameters:

conf (Config) – Configuration object

Returns:

None

Return type:

None

forward(outputs, target_sizes)[source]

Post-process model outputs.

Parameters:
  • outputs (dict) – Model outputs

  • target_sizes (torch.Tensor) – Target image sizes

Returns:

List of processed predictions

Return type:

list

Utilities module for OED model.

This module provides utility functions for box operations and other helpers.

m3sgg.core.models.oed.utils.box_cxcywh_to_xyxy(x)[source]

Convert boxes from center format (cx, cy, w, h) to corner format (x1, y1, x2, y2).

Parameters:

x (torch.Tensor) – Boxes in center format

Returns:

Boxes in corner format

Return type:

torch.Tensor

m3sgg.core.models.oed.utils.box_xyxy_to_cxcywh(x)[source]

Convert boxes from corner format (x1, y1, x2, y2) to center format (cx, cy, w, h).

Parameters:

x (torch.Tensor) – Boxes in corner format

Returns:

Boxes in center format

Return type:

torch.Tensor

m3sgg.core.models.oed.utils.generalized_box_iou(boxes1, boxes2)[source]

Generalized Intersection over Union between two sets of boxes.

Parameters:
Returns:

Generalized IoU matrix

Return type:

torch.Tensor

m3sgg.core.models.oed.utils.box_area(boxes)[source]

Compute the area of a set of bounding boxes.

Parameters:

boxes (torch.Tensor) – Bounding boxes in (x1, y1, x2, y2) format

Returns:

Areas of the boxes

Return type:

torch.Tensor

m3sgg.core.models.oed.utils.nested_tensor_from_tensor_list(tensor_list)[source]

Convert a list of tensors to a nested tensor.

Parameters:

tensor_list (list) – List of tensors

Returns:

Nested tensor

Return type:

NestedTensor

class m3sgg.core.models.oed.utils.NestedTensor(tensors, mask)[source]

Bases: object

Nested tensor wrapper for efficient processing.

Parameters:

object (class) – Base object class

__init__(tensors, mask)[source]

Initialize the nested tensor.

Parameters:
Returns:

None

Return type:

None

decompose()[source]

Decompose the nested tensor.

Returns:

Tuple of (tensors, mask)

Return type:

tuple

__repr__()[source]

String representation.

Returns:

String representation

Return type:

str

VLM Model

class m3sgg.core.models.vlm.scene_graph_generator.VLMSceneGraphGenerator(mode='sgdet', attention_class_num=3, spatial_class_num=6, contact_class_num=17, obj_classes=None, model_name='apple/FastVLM-0.5B', device='cuda', few_shot_examples=None, use_chain_of_thought=True, use_tree_of_thought=False, confidence_threshold=0.5)[source]

Bases: Module

__init__(mode='sgdet', attention_class_num=3, spatial_class_num=6, contact_class_num=17, obj_classes=None, model_name='apple/FastVLM-0.5B', device='cuda', few_shot_examples=None, use_chain_of_thought=True, use_tree_of_thought=False, confidence_threshold=0.5)[source]

Initialize VLM Scene Graph Generator.

Parameters:
  • mode – Scene graph generation mode (sgdet, sgcls, predcls)

  • attention_class_num – Number of attention relationship classes

  • spatial_class_num – Number of spatial relationship classes

  • contact_class_num – Number of contact relationship classes

  • obj_classes – List of object classes

  • model_name – HuggingFace model name for VLM

  • device – Device to run inference on

  • few_shot_examples – Few-shot examples for prompting

  • use_chain_of_thought – Whether to use chain-of-thought reasoning

  • use_tree_of_thought – Whether to use tree-of-thought reasoning

  • confidence_threshold – Threshold for relationship confidence

forward(entry: Dict, im_data: Tensor | None = None) Dict[source]

Forward pass through VLM Scene Graph Generator.

Parameters:

entry – Input dictionary containing image data and bounding boxes

Returns:

Dictionary with scene graph predictions

Transformer Components

class m3sgg.utils.transformer.TransformerEncoderLayer(embed_dim=1936, nhead=4, dim_feedforward=2048, dropout=0.1)[source]

Bases: Module

Transformer encoder layer with multi-head attention and feed-forward network.

Implements a single layer of the transformer encoder with self-attention mechanism, layer normalization, and position-wise feed-forward network.

Parameters:

nn.Module (class) – Base PyTorch module class

__init__(embed_dim=1936, nhead=4, dim_feedforward=2048, dropout=0.1)[source]

Initialize the transformer encoder layer.

Parameters:
  • embed_dim (int, optional) – Embedding dimension, defaults to 1936

  • nhead (int, optional) – Number of attention heads, defaults to 4

  • dim_feedforward (int, optional) – Dimension of feed-forward network, defaults to 2048

  • dropout (float, optional) – Dropout probability, defaults to 0.1

Returns:

None

Return type:

None

forward(src, input_key_padding_mask)[source]

Forward pass through the transformer encoder layer.

Parameters:
Returns:

Transformed sequence and attention weights

Return type:

tuple

class m3sgg.utils.transformer.TransformerDecoderLayer(embed_dim=1936, nhead=4, dim_feedforward=2048, dropout=0.1)[source]

Bases: Module

Transformer decoder layer with masked self-attention and cross-attention.

Implements a single layer of the transformer decoder with masked self-attention, encoder-decoder attention, and position-wise feed-forward network.

Parameters:

nn.Module (class) – Base PyTorch module class

__init__(embed_dim=1936, nhead=4, dim_feedforward=2048, dropout=0.1)[source]

Initialize the transformer decoder layer.

Parameters:
  • embed_dim (int, optional) – Embedding dimension, defaults to 1936

  • nhead (int, optional) – Number of attention heads, defaults to 4

  • dim_feedforward (int, optional) – Dimension of feed-forward network, defaults to 2048

  • dropout (float, optional) – Dropout probability, defaults to 0.1

Returns:

None

Return type:

None

forward(global_input, input_key_padding_mask, position_embed)[source]

Define the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class m3sgg.utils.transformer.TransformerEncoder(encoder_layer, num_layers)[source]

Bases: Module

__init__(encoder_layer, num_layers)[source]

Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(input, input_key_padding_mask)[source]

Define the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class m3sgg.utils.transformer.TransformerDecoder(decoder_layer, num_layers, embed_dim)[source]

Bases: Module

__init__(decoder_layer, num_layers, embed_dim)[source]

Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(global_input, input_key_padding_mask, position_embed)[source]

Define the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class m3sgg.utils.transformer.transformer(enc_layer_num=1, dec_layer_num=3, embed_dim=1936, nhead=8, dim_feedforward=2048, dropout=0.1, mode=None)[source]

Bases: Module

Spatial Temporal Transformer.

Parameters:
  • local_attention (object) – spatial encoder

  • global_attention (object) – temporal decoder

  • position_embedding (object) – frame encoding (window_size*dim)

  • mode (str) – both–use the features from both frames in the window, latter–use the features from the latter frame in the window

__init__(enc_layer_num=1, dec_layer_num=3, embed_dim=1936, nhead=8, dim_feedforward=2048, dropout=0.1, mode=None)[source]

Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(features, im_idx)[source]

Define the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.