Models 

class m3sgg.core.models.tempura.tempura.ObjectClassifier(mode='sgdet', obj_head='gmm', K=4, obj_classes=None, mem_compute=None, selection=None, selection_lambda=0.5, tracking=None)[source]

Bases: Module

Tempura object classifier for computing object and edge contexts.

Implements the Tempura model’s approach to object classification and contextual feature extraction with memory-augmented learning and uncertainty estimation capabilities.

Parameters:: nn.Module (class) – Base PyTorch module class

__init__(mode='sgdet', obj_head='gmm', K=4, obj_classes=None, mem_compute=None, selection=None, selection_lambda=0.5, tracking=None)[source]: Initialize internal Module state, shared by both nn.Module and ScriptModule.

clean_class(entry, b, class_idx)[source]

mem_selection(feat)[source]

memory_hallucinator(memory, feat)[source]

classify(entry, obj_features, phase='train', unc=False)[source]

forward(entry, phase='train', unc=False)[source]

Define the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class m3sgg.core.models.tempura.tempura.TEMPURA(mode='sgdet', attention_class_num=None, spatial_class_num=None, contact_class_num=None, obj_classes=None, rel_classes=None, enc_layer_num=None, dec_layer_num=None, obj_mem_compute=None, rel_mem_compute=None, mem_fusion=None, selection=None, selection_lambda=0.5, take_obj_mem_feat=False, obj_head='gmm', rel_head='gmm', K=None, tracking=None)[source]

Bases: Module

__init__(mode='sgdet', attention_class_num=None, spatial_class_num=None, contact_class_num=None, obj_classes=None, rel_classes=None, enc_layer_num=None, dec_layer_num=None, obj_mem_compute=None, rel_mem_compute=None, mem_fusion=None, selection=None, selection_lambda=0.5, take_obj_mem_feat=False, obj_head='gmm', rel_head='gmm', K=None, tracking=None)[source]

Parameters:

classes – Object classes
rel_classes – Relationship classes. None if were not using rel mode
mode – (sgcls, predcls, or sgdet)

forward(entry, phase='train', unc=False)[source]

Define the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class m3sgg.core.models.tempura.transformer_tempura.TransformerEncoderLayer(embed_dim=1936, nhead=4, dim_feedforward=2048, dropout=0.1)[source]

Bases: Module

__init__(embed_dim=1936, nhead=4, dim_feedforward=2048, dropout=0.1)[source]: Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(src, input_key_padding_mask)[source]

Define the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class m3sgg.core.models.tempura.transformer_tempura.TransformerDecoderLayer(embed_dim=1936, nhead=4, dim_feedforward=2048, dropout=0.1)[source]

Bases: Module

__init__(embed_dim=1936, nhead=4, dim_feedforward=2048, dropout=0.1)[source]: Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(global_input, input_key_padding_mask, position_embed)[source]

Define the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class m3sgg.core.models.tempura.transformer_tempura.TransformerEncoder(encoder_layer, num_layers)[source]

Bases: Module

__init__(encoder_layer, num_layers)[source]: Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(input, input_key_padding_mask)[source]

Define the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class m3sgg.core.models.tempura.transformer_tempura.TransformerDecoder(decoder_layer, num_layers, embed_dim)[source]

Bases: Module

__init__(decoder_layer, num_layers, embed_dim)[source]: Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(global_input, input_key_padding_mask, position_embed)[source]

Define the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class m3sgg.core.models.tempura.transformer_tempura.transformer(enc_layer_num=1, dec_layer_num=3, embed_dim=1936, nhead=8, dim_feedforward=2048, dropout=0.1, mode=None, mem_compute=True, mem_fusion=None, selection=None, selection_lambda=0.5)[source]

Bases: Module

Spatial Temporal Transformer.

Parameters:

local_attention (object) – spatial encoder
global_attention (object) – temporal decoder
position_embedding (object) – frame encoding (window_size*dim)
mode (str) – both–use the features from both frames in the window, latter–use the features from the latter frame in the window

__init__(enc_layer_num=1, dec_layer_num=3, embed_dim=1936, nhead=8, dim_feedforward=2048, dropout=0.1, mode=None, mem_compute=True, mem_fusion=None, selection=None, selection_lambda=0.5)[source]: Initialize internal Module state, shared by both nn.Module and ScriptModule.

memory_hallucinator(memory, feat)[source]

mem_selection(feat)[source]

forward(features, im_idx, memory=[])[source]

Define the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class m3sgg.core.models.tempura.gmm_heads.GMM_head(hid_dim, num_classes, rel_type=None, k=4)[source]

Bases: Module

Gaussian Mixture Model head for uncertainty estimation in Tempura.

Implements a GMM-based classification head that models uncertainty through multiple Gaussian components with learnable means, variances, and mixture weights.

Parameters:: nn.Module (class) – Base PyTorch module class

__init__(hid_dim, num_classes, rel_type=None, k=4)[source]

Initialize the GMM head.

Parameters:

hid_dim (int) – Hidden dimension size
num_classes (int) – Number of output classes
rel_type (str, optional) – Type of relation (affects activation function), defaults to None
k (int, optional) – Number of Gaussian mixture components, defaults to 4

Returns:

None

Return type:

None

uncertainty(conf_mu_k, conf_var_k, conf_pi_k_)[source]

Compute epistemic and aleatoric uncertainty.

Parameters:

conf_mu_k (dict) – Mean predictions for each mixture component
conf_var_k (dict) – Variance predictions for each mixture component
conf_pi_k (list) – Mixture weights for each component

Returns:

Tuple containing prediction, aleatoric uncertainty, and epistemic uncertainty

Return type:

forward(x, phase='train', unc=False)[source]

Define the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

STKET Model

Spatio-Temporal Knowledge-Enhanced Transformer (STKET) for Scene Graph Generation.

This module implements the STKET model for video scene graph generation, combining spatial and temporal reasoning with transformer architectures.

class m3sgg.core.models.stket.stket.ObjectClassifier(mode='sgdet', obj_classes=None)[source]

Bases: Module

Module for computing object contexts and edge contexts in scene graphs.

Handles object classification and contextual feature extraction for spatial-temporal transformer-based scene graph generation.

Parameters:: nn.Module (class) – Base PyTorch module class

__init__(mode='sgdet', obj_classes=None)[source]

Initialize the object classifier.

Parameters:

mode (str, optional) – Classification mode (‘predcls’, ‘sgcls’, ‘sgdet’), defaults to “sgdet”
obj_classes (list, optional) – List of object class names, defaults to None

Returns:

None

Return type:

None

forward(entry)[source]

Forward pass for object classification.

Parameters:: entry (dict) – Dictionary containing input data
Returns:: Updated entry dictionary with predictions
Return type:: dict

class m3sgg.core.models.stket.stket.STKET(mode='sgdet', attention_class_num=None, spatial_class_num=None, contact_class_num=None, obj_classes=None, rel_classes=None, N_layer_num=1, enc_layer_num=None, dec_layer_num=None, pred_contact_threshold=0.5, window_size=4, trainPrior=None, use_spatial_prior=False, use_temporal_prior=False)[source]

Bases: Module

Spatio-Temporal Knowledge-Enhanced Transformer for Scene Graph Generation.

Implements the STKET model that combines spatial and temporal reasoning with transformer architectures for video scene graph generation.

Parameters:: nn.Module (class) – Base PyTorch module class

__init__(mode='sgdet', attention_class_num=None, spatial_class_num=None, contact_class_num=None, obj_classes=None, rel_classes=None, N_layer_num=1, enc_layer_num=None, dec_layer_num=None, pred_contact_threshold=0.5, window_size=4, trainPrior=None, use_spatial_prior=False, use_temporal_prior=False)[source]

Initialize the STKET model.

Parameters:

mode (str, optional) – Classification mode (‘sgdet’, ‘sgcls’, ‘predcls’), defaults to “sgdet”
attention_class_num (int, optional) – Number of attention relationship classes, defaults to None
spatial_class_num (int, optional) – Number of spatial relationship classes, defaults to None
contact_class_num (int, optional) – Number of contact relationship classes, defaults to None
obj_classes (list, optional) – List of object class names, defaults to None
rel_classes (list, optional) – List of relationship class names, defaults to None
N_layer_num (int, optional) – Number of transformer layers, defaults to 1
enc_layer_num (int, optional) – Number of encoder layers, defaults to None
dec_layer_num (int, optional) – Number of decoder layers, defaults to None
pred_contact_threshold (float, optional) – Contact prediction threshold, defaults to 0.5
window_size (int, optional) – Temporal window size, defaults to 4
trainPrior (dict, optional) – Training prior information, defaults to None
use_spatial_prior (bool, optional) – Whether to use spatial priors, defaults to False
use_temporal_prior (bool, optional) – Whether to use temporal priors, defaults to False

Returns:

None

Return type:

None

forward(entry)[source]

Forward pass for STKET model.

Parameters:: entry (dict) – Dictionary containing input data
Returns:: Updated entry dictionary with predictions
Return type:: dict

class m3sgg.core.models.stket.transformer_stket.TransformerEncoderLayer(embed_dim=1936, nhead=4, dim_feedforward=2048, dropout=0.1)[source]

Bases: Module

STKET transformer encoder layer with prior knowledge integration.

Implements transformer encoder with additional prior knowledge integration for spatio-temporal knowledge-enhanced scene graph generation.

Parameters:: nn.Module (class) – Base PyTorch module class

__init__(embed_dim=1936, nhead=4, dim_feedforward=2048, dropout=0.1)[source]

Initialize STKET transformer encoder layer.

Parameters:

embed_dim (int, optional) – Embedding dimension, defaults to 1936
nhead (int, optional) – Number of attention heads, defaults to 4
dim_feedforward (int, optional) – Feed-forward dimension, defaults to 2048
dropout (float, optional) – Dropout probability, defaults to 0.1

Returns:

None

Return type:

None

forward(src, prior, input_key_padding_mask)[source]

Forward pass with prior knowledge integration.

Parameters:

src (torch.Tensor) – Source sequence tensor
prior (torch.Tensor) – Prior knowledge tensor
input_key_padding_mask (torch.Tensor) – Padding mask

Returns:

Tuple containing output tensor and attention weights

Return type:

class m3sgg.core.models.stket.transformer_stket.TransformerDecoderLayer(embed_dim=1936, nhead=4, dim_feedforward=2048, dropout=0.1)[source]

Bases: Module

__init__(embed_dim=1936, nhead=4, dim_feedforward=2048, dropout=0.1)[source]: Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(src, prior, input_key_padding_mask, position_embed)[source]

Define the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class m3sgg.core.models.stket.transformer_stket.TransformerEncoder(encoder_layer, num_layers)[source]

Bases: Module

__init__(encoder_layer, num_layers)[source]: Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(input, prior, input_key_padding_mask)[source]

Define the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class m3sgg.core.models.stket.transformer_stket.TransformerDecoder(decoder_layer, num_layers, embed_dim)[source]

Bases: Module

__init__(decoder_layer, num_layers, embed_dim)[source]: Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(input, prior, input_key_padding_mask, position_embed)[source]

Define the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class m3sgg.core.models.stket.transformer_stket.spatial_encoder(enc_layer_num=1, embed_dim=1936, nhead=8, dim_feedforward=2048, dropout=0.1, trainPrior=None, use_spatial_prior=False, obj_class_num=37, attention_class_num=3, spatial_class_num=6, contact_class_num=17)[source]

Bases: Module

__init__(enc_layer_num=1, embed_dim=1936, nhead=8, dim_feedforward=2048, dropout=0.1, trainPrior=None, use_spatial_prior=False, obj_class_num=37, attention_class_num=3, spatial_class_num=6, contact_class_num=17)[source]: Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(features, im_idx, entry, mode)[source]

Define the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class m3sgg.core.models.stket.transformer_stket.temporal_decoder(dec_layer_num=3, embed_dim=1936, nhead=8, dim_feedforward=2048, dropout=0.1, pred_contact_threshold=0.5, trainPrior=None, use_temporal_prior=False, obj_class_num=37, attention_class_num=3, spatial_class_num=6, contact_class_num=17)[source]

Bases: Module

__init__(dec_layer_num=3, embed_dim=1936, nhead=8, dim_feedforward=2048, dropout=0.1, pred_contact_threshold=0.5, trainPrior=None, use_temporal_prior=False, obj_class_num=37, attention_class_num=3, spatial_class_num=6, contact_class_num=17)[source]: Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(features, contact_distribution, im_idx, entry, mode)[source]

Define the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

class m3sgg.core.models.stket.transformer_stket.ensemble_decoder(dec_layer_num=3, embed_dim=1936, nhead=8, dim_feedforward=2048, dropout=0.1, pred_contact_threshold=0.5, window_size=3, obj_class_num=37, attention_class_num=3, spatial_class_num=6, contact_class_num=17)[source]

Bases: Module

__init__(dec_layer_num=3, embed_dim=1936, nhead=8, dim_feedforward=2048, dropout=0.1, pred_contact_threshold=0.5, window_size=3, obj_class_num=37, attention_class_num=3, spatial_class_num=6, contact_class_num=17)[source]: Initialize internal Module state, shared by both nn.Module and ScriptModule.

forward(spatial_features, temporal_features, contact_distribution, im_idx, entry, mode)[source]

Define the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

OED Model

OED Multi-frame Model Implementation

This module implements the OED architecture for multi-frame dynamic scene graph generation with the Progressively Refined Module (PRM) for temporal context aggregation.

class m3sgg.core.models.oed.oed_multi.OEDMulti(conf, dataset)[source]

Bases: Module

OED Multi-frame model for dynamic scene graph generation.

Implements the one-stage end-to-end framework with cascaded decoders and Progressively Refined Module (PRM) for temporal context aggregation.

Parameters:: nn.Module (class) – Base PyTorch module class

__init__(conf, dataset)[source]

Initialize the OED Multi-frame model.

Parameters:

conf (Config) – Configuration object containing model parameters
dataset (object) – Dataset information for model setup

Returns:

None

Return type:

None

forward(entry, targets=None)[source]

Forward pass through the OED model.

Parameters:

entry (dict) – Input data containing images and features
targets (dict, optional) – Ground truth targets for training, defaults to None

Returns:

Model predictions and outputs

Return type:

class m3sgg.core.models.oed.oed_multi.MLP(input_dim, hidden_dim, output_dim, num_layers)[source]

Bases: Module

Multi-layer perceptron for bounding box prediction.

Parameters:: nn.Module (class) – Base PyTorch module class

__init__(input_dim, hidden_dim, output_dim, num_layers)[source]

Initialize the MLP.

Parameters:

input_dim (int) – Input dimension
hidden_dim (int) – Hidden dimension
output_dim (int) – Output dimension
num_layers (int) – Number of layers

Returns:

None

Return type:

None

forward(x)[source]

Forward pass through the MLP.

Parameters:: x (torch.Tensor) – Input tensor
Returns:: Output tensor
Return type:: torch.Tensor

m3sgg.core.models.oed.oed_multi.build_oed_multi(conf, dataset)[source]

Build OED multi-frame model.

Parameters:

conf (Config) – Configuration object
dataset (object) – Dataset object

Returns:

Tuple of (model, criterion, postprocessors)

Return type:

OED Single-frame Model Implementation

This module implements the OED architecture for single-frame scene graph generation as a baseline for comparison with the multi-frame variant.

class m3sgg.core.models.oed.oed_single.OEDSingle(conf, dataset)[source]

Bases: Module

OED Single-frame model for scene graph generation.

Implements the one-stage end-to-end framework with cascaded decoders for single-frame processing.

Parameters:: nn.Module (class) – Base PyTorch module class

__init__(conf, dataset)[source]

Initialize the OED Single-frame model.

Parameters:

conf (Config) – Configuration object containing model parameters
dataset (object) – Dataset information for model setup

Returns:

None

Return type:

None

forward(entry, targets=None)[source]

Forward pass through the OED model.

Parameters:

entry (dict) – Input data containing images and features from object detector
targets (dict, optional) – Ground truth targets for training, defaults to None

Returns:

Model predictions and outputs

Return type:

class m3sgg.core.models.oed.oed_single.MLP(input_dim, hidden_dim, output_dim, num_layers)[source]

Bases: Module

Multi-layer perceptron for bounding box prediction.

Parameters:: nn.Module (class) – Base PyTorch module class

__init__(input_dim, hidden_dim, output_dim, num_layers)[source]

Initialize the MLP.

Parameters:

input_dim (int) – Input dimension
hidden_dim (int) – Hidden dimension
output_dim (int) – Output dimension
num_layers (int) – Number of layers

Returns:

None

Return type:

None

forward(x)[source]

Forward pass through the MLP.

Parameters:: x (torch.Tensor) – Input tensor
Returns:: Output tensor
Return type:: torch.Tensor

m3sgg.core.models.oed.oed_single.build_oed_single(conf, dataset)[source]

Build OED single-frame model.

Parameters:

conf (Config) – Configuration object
dataset (object) – Dataset object

Returns:

Tuple of (model, criterion, postprocessors)

Return type:

Transformer module for OED model.

This module implements the cascaded decoders and transformer architecture for the OED model.

class m3sgg.core.models.oed.transformer.Transformer(conf)[source]

Bases: Module

Transformer with cascaded decoders for OED.

Parameters:: nn.Module (class) – Base PyTorch module class

__init__(conf)[source]

Initialize the transformer.

Parameters:: conf (Config) – Configuration object
Returns:: None
Return type:: None

forward(src, mask, query_embed, pos_embed, embed_dict=None, targets=None, cur_idx=0)[source]

Forward pass through the transformer.

Parameters:

src (torch.Tensor) – Source features
mask (torch.Tensor) – Attention mask
query_embed (torch.Tensor) – Query embeddings
pos_embed (torch.Tensor) – Position embeddings
embed_dict (dict, optional) – Dictionary of embedding layers, defaults to None
targets (dict, optional) – Ground truth targets, defaults to None
cur_idx (int, optional) – Current frame index, defaults to 0

Returns:

Tuple of outputs

Return type:

class m3sgg.core.models.oed.transformer.TransformerEncoder(encoder_layer, num_layers, norm=None)[source]

Bases: Module

Transformer encoder.

Parameters:: nn.Module (class) – Base PyTorch module class

__init__(encoder_layer, num_layers, norm=None)[source]

Initialize the encoder.

Parameters:

encoder_layer (TransformerEncoderLayer) – Encoder layer
num_layers (int) – Number of layers
norm (nn.Module, optional) – Normalization layer, defaults to None

Returns:

None

Return type:

None

forward(src, mask=None, src_key_padding_mask=None, pos=None)[source]

Forward pass through encoder.

Parameters:

src (torch.Tensor) – Source features
mask (torch.Tensor, optional) – Attention mask, defaults to None
src_key_padding_mask (torch.Tensor, optional) – Key padding mask, defaults to None
pos (torch.Tensor, optional) – Position embeddings, defaults to None

Returns:

Encoded features

Return type:

class m3sgg.core.models.oed.transformer.TransformerDecoder(decoder_layer, num_layers, norm=None)[source]

Bases: Module

Transformer decoder.

Parameters:: nn.Module (class) – Base PyTorch module class

__init__(decoder_layer, num_layers, norm=None)[source]

Initialize the decoder.

Parameters:

decoder_layer (TransformerDecoderLayer) – Decoder layer
num_layers (int) – Number of layers
norm (nn.Module, optional) – Normalization layer, defaults to None

Returns:

None

Return type:

None

forward(tgt, memory, tgt_mask=None, memory_mask=None, tgt_key_padding_mask=None, memory_key_padding_mask=None, pos=None, query_pos=None)[source]

Forward pass through decoder.

Parameters:

tgt (torch.Tensor) – Target features
memory (torch.Tensor) – Memory features from encoder
tgt_mask (torch.Tensor, optional) – Target mask, defaults to None
memory_mask (torch.Tensor, optional) – Memory mask, defaults to None
tgt_key_padding_mask (torch.Tensor, optional) – Target key padding mask, defaults to None
memory_key_padding_mask (torch.Tensor, optional) – Memory key padding mask, defaults to None
pos (torch.Tensor, optional) – Position embeddings, defaults to None
query_pos (torch.Tensor, optional) – Query position embeddings, defaults to None

Returns:

Decoded features

Return type:

class m3sgg.core.models.oed.transformer.TransformerEncoderLayer(d_model, nhead, dim_feedforward=2048, dropout=0.1, activation='relu', pre_norm=False)[source]

Bases: Module

Transformer encoder layer.

Parameters:: nn.Module (class) – Base PyTorch module class

__init__(d_model, nhead, dim_feedforward=2048, dropout=0.1, activation='relu', pre_norm=False)[source]

Initialize the encoder layer.

Parameters:

d_model (int) – Model dimension
nhead (int) – Number of attention heads
dim_feedforward (int, optional) – Feedforward dimension, defaults to 2048
dropout (float, optional) – Dropout rate, defaults to 0.1
activation (str, optional) – Activation function, defaults to “relu”
pre_norm (bool, optional) – Whether to use pre-norm, defaults to False

Returns:

None

Return type:

None

forward(src, src_mask=None, src_key_padding_mask=None, pos=None)[source]

Forward pass through encoder layer.

Parameters:

src (torch.Tensor) – Source features
src_mask (torch.Tensor, optional) – Source mask, defaults to None
src_key_padding_mask (torch.Tensor, optional) – Source key padding mask, defaults to None
pos (torch.Tensor, optional) – Position embeddings, defaults to None

Returns:

Output features

Return type:

class m3sgg.core.models.oed.transformer.TransformerDecoderLayer(d_model, nhead, dim_feedforward=2048, dropout=0.1, activation='relu', pre_norm=False)[source]

Bases: Module

Transformer decoder layer.

Parameters:: nn.Module (class) – Base PyTorch module class

__init__(d_model, nhead, dim_feedforward=2048, dropout=0.1, activation='relu', pre_norm=False)[source]

Initialize the decoder layer.

Parameters:

d_model (int) – Model dimension
nhead (int) – Number of attention heads
dim_feedforward (int, optional) – Feedforward dimension, defaults to 2048
dropout (float, optional) – Dropout rate, defaults to 0.1
activation (str, optional) – Activation function, defaults to “relu”
pre_norm (bool, optional) – Whether to use pre-norm, defaults to False

Returns:

None

Return type:

None

forward(tgt, memory, tgt_mask=None, memory_mask=None, tgt_key_padding_mask=None, memory_key_padding_mask=None, pos=None, query_pos=None)[source]

Forward pass through decoder layer.

Parameters:

tgt (torch.Tensor) – Target features
memory (torch.Tensor) – Memory features
tgt_mask (torch.Tensor, optional) – Target mask, defaults to None
memory_mask (torch.Tensor, optional) – Memory mask, defaults to None
tgt_key_padding_mask (torch.Tensor, optional) – Target key padding mask, defaults to None
memory_key_padding_mask (torch.Tensor, optional) – Memory key padding mask, defaults to None
pos (torch.Tensor, optional) – Position embeddings, defaults to None
query_pos (torch.Tensor, optional) – Query position embeddings, defaults to None

Returns:

Output features

Return type:

m3sgg.core.models.oed.transformer.build_transformer(conf)[source]

Build transformer network.

Parameters:: conf (Config) – Configuration object
Returns:: Transformer network
Return type:: Transformer

Criterion module for OED model.

This module implements the loss functions for training the OED model.

class m3sgg.core.models.oed.criterion.SetCriterionOED(num_obj_classes, num_queries, matcher, weight_dict, eos_coef, losses, conf)[source]

Bases: Module

Loss criterion for OED model.

Parameters:: nn.Module (class) – Base PyTorch module class

__init__(num_obj_classes, num_queries, matcher, weight_dict, eos_coef, losses, conf)[source]

Initialize the criterion.

Parameters:

num_obj_classes (int) – Number of object classes
num_queries (int) – Number of queries
matcher (object) – Hungarian matcher
weight_dict (dict) – Dictionary of loss weights
eos_coef (float) – End-of-sequence coefficient
losses (list) – List of loss types
conf (Config) – Configuration object

Returns:

None

Return type:

None

loss_obj_labels(outputs, targets, indices, num_interactions, log=True)[source]

Object classification loss.

Parameters:

outputs (dict) – Model outputs
targets (list) – Ground truth targets
indices (list) – Matched indices
num_interactions (int) – Number of interactions
log (bool, optional) – Whether to log errors, defaults to True

Returns:

Dictionary of losses

Return type:

loss_obj_cardinality(outputs, targets, indices, num_interactions)[source]

Object cardinality loss.

Parameters:

outputs (dict) – Model outputs
targets (list) – Ground truth targets
indices (list) – Matched indices
num_interactions (int) – Number of interactions

Returns:

Dictionary of losses

Return type:

loss_relation_labels(outputs, targets, indices, num_interactions)[source]

Relation classification loss.

Parameters:

outputs (dict) – Model outputs
targets (list) – Ground truth targets
indices (list) – Matched indices
num_interactions (int) – Number of interactions

Returns:

Dictionary of losses

Return type:

loss_sub_obj_boxes(outputs, targets, indices, num_interactions)[source]

Bounding box loss.

Parameters:

outputs (dict) – Model outputs
targets (list) – Ground truth targets
indices (list) – Matched indices
num_interactions (int) – Number of interactions

Returns:

Dictionary of losses

Return type:

get_loss(loss, outputs, targets, indices, num, **kwargs)[source]

Get loss function.

Parameters:

loss (str) – Loss type
outputs (dict) – Model outputs
targets (list) – Ground truth targets
indices (list) – Matched indices
num (int) – Number of interactions

Returns:

Loss dictionary

Return type:

forward(outputs, targets)[source]

Forward pass for loss computation.

Parameters:

outputs (dict) – Model outputs
targets (list) – Ground truth targets

Returns:

Dictionary of losses

Return type:

m3sgg.core.models.oed.criterion.accuracy(output, target, topk=(1,))[source]

Compute accuracy.

Parameters:

output (torch.Tensor) – Model output
target (torch.Tensor) – Ground truth target
topk (tuple, optional) – Top-k accuracy, defaults to (1,)

Returns:

Tuple of accuracies

Return type:

Postprocessing module for OED model.

This module handles the post-processing of model outputs for evaluation.

class m3sgg.core.models.oed.postprocess.PostProcessOED(conf)[source]

Bases: Module

Post-processing for OED model outputs.

Parameters:: nn.Module (class) – Base PyTorch module class

__init__(conf)[source]

Initialize the post-processor.

Parameters:: conf (Config) – Configuration object
Returns:: None
Return type:: None

forward(outputs, target_sizes)[source]

Post-process model outputs.

Parameters:

outputs (dict) – Model outputs
target_sizes (torch.Tensor) – Target image sizes

Returns:

List of processed predictions

Return type:

list

Utilities module for OED model.

This module provides utility functions for box operations and other helpers.

m3sgg.core.models.oed.utils.box_cxcywh_to_xyxy(x)[source]

Convert boxes from center format (cx, cy, w, h) to corner format (x1, y1, x2, y2).

Parameters:: x (torch.Tensor) – Boxes in center format
Returns:: Boxes in corner format
Return type:: torch.Tensor

m3sgg.core.models.oed.utils.box_xyxy_to_cxcywh(x)[source]

Convert boxes from corner format (x1, y1, x2, y2) to center format (cx, cy, w, h).

Parameters:: x (torch.Tensor) – Boxes in corner format
Returns:: Boxes in center format
Return type:: torch.Tensor

m3sgg.core.models.oed.utils.generalized_box_iou(boxes1, boxes2)[source]

Generalized Intersection over Union between two sets of boxes.

Parameters:

boxes1 (torch.Tensor) – First set of boxes
boxes2 (torch.Tensor) – Second set of boxes

Returns:

Generalized IoU matrix

Return type:

m3sgg.core.models.oed.utils.box_area(boxes)[source]

Compute the area of a set of bounding boxes.

Parameters:: boxes (torch.Tensor) – Bounding boxes in (x1, y1, x2, y2) format
Returns:: Areas of the boxes
Return type:: torch.Tensor

m3sgg.core.models.oed.utils.nested_tensor_from_tensor_list(tensor_list)[source]

Convert a list of tensors to a nested tensor.

Parameters:: tensor_list (list) – List of tensors
Returns:: Nested tensor
Return type:: NestedTensor

class m3sgg.core.models.oed.utils.NestedTensor(tensors, mask)[source]

Bases: object

Nested tensor wrapper for efficient processing.

Parameters:: object (class) – Base object class

__init__(tensors, mask)[source]

Initialize the nested tensor.

Parameters:

tensors (torch.Tensor) – Input tensors
mask (torch.Tensor) – Mask tensor

Returns:

None

Return type:

None

decompose()[source]

Decompose the nested tensor.

Returns:: Tuple of (tensors, mask)
Return type:: tuple

__repr__()[source]

String representation.

Returns:: String representation
Return type:: str

VLM Model

class m3sgg.core.models.vlm.scene_graph_generator.VLMSceneGraphGenerator(mode='sgdet', attention_class_num=3, spatial_class_num=6, contact_class_num=17, obj_classes=None, model_name='apple/FastVLM-0.5B', device='cuda', few_shot_examples=None, use_chain_of_thought=True, use_tree_of_thought=False, confidence_threshold=0.5)[source]

Bases: Module

__init__(mode='sgdet', attention_class_num=3, spatial_class_num=6, contact_class_num=17, obj_classes=None, model_name='apple/FastVLM-0.5B', device='cuda', few_shot_examples=None, use_chain_of_thought=True, use_tree_of_thought=False, confidence_threshold=0.5)[source]

Initialize VLM Scene Graph Generator.

Parameters:

mode – Scene graph generation mode (sgdet, sgcls, predcls)
attention_class_num – Number of attention relationship classes
spatial_class_num – Number of spatial relationship classes
contact_class_num – Number of contact relationship classes
obj_classes – List of object classes
model_name – HuggingFace model name for VLM
device – Device to run inference on
few_shot_examples – Few-shot examples for prompting
use_chain_of_thought – Whether to use chain-of-thought reasoning
use_tree_of_thought – Whether to use tree-of-thought reasoning
confidence_threshold – Threshold for relationship confidence

forward(entry: Dict, im_data: Tensor | None = None) → Dict[source]

Forward pass through VLM Scene Graph Generator.

Parameters:: entry – Input dictionary containing image data and bounding boxes
Returns:: Dictionary with scene graph predictions

Transformer Components

class m3sgg.utils.transformer.TransformerEncoderLayer(embed_dim=1936, nhead=4, dim_feedforward=2048, dropout=0.1)[source]

Bases: Module

Transformer encoder layer with multi-head attention and feed-forward network.

Implements a single layer of the transformer encoder with self-attention mechanism, layer normalization, and position-wise feed-forward network.

Parameters:: nn.Module (class) – Base PyTorch module class

__init__(embed_dim=1936, nhead=4, dim_feedforward=2048, dropout=0.1)[source]

Initialize the transformer encoder layer.

Parameters:

embed_dim (int, optional) – Embedding dimension, defaults to 1936
nhead (int, optional) – Number of attention heads, defaults to 4
dim_feedforward (int, optional) – Dimension of feed-forward network, defaults to 2048
dropout (float, optional) – Dropout probability, defaults to 0.1

Returns:

None

Return type:

None

forward(src, input_key_padding_mask)[source]

Forward pass through the transformer encoder layer.

Parameters:

src (torch.Tensor) – Source sequence tensor
input_key_padding_mask (torch.Tensor) – Mask for padding tokens

Returns:

Transformed sequence and attention weights

Return type: