Datasets
========

This section provides detailed information about the datasets supported by M3SGG.

Supported Datasets
------------------

Action Genome
~~~~~~~~~~~~~

The Action Genome dataset is the primary dataset for video scene graph generation.

**Overview**

* **Type**: Video Scene Graph Dataset
* **Domain**: Human activities and object interactions
* **Size**: ~10,000 videos with dense annotations
* **Format**: MP4 videos with JSON annotations

**Dataset Structure**

.. code-block:: text

   action_genome/
   ├── annotations/     # Ground truth scene graph annotations
   │   ├── train/
   │   ├── val/
   │   └── test/
   ├── frames/         # Extracted video frames
   │   ├── video_001/
   │   ├── video_002/
   │   └── ...
   └── videos/         # Original video files
       ├── video_001.mp4
       ├── video_002.mp4
       └── ...

**Annotation Format**

Each annotation file contains:

.. code-block:: javascript

   {
     "video_id": "video_001",
     "frame_annotations": [
       {
         "frame_id": 1,
         "objects": [
           {
             "object_id": 1,
             "bbox": [100, 50, 200, 150],
             "class": "person",
             "attributes": ["adult", "standing"]
           }
         ],
         "relationships": [
           {
             "subject_id": 1,
             "object_id": 2,
             "predicate": "holding"
           }
         ]
       }
     ]
   }

**Download and Setup**

1. Visit https://www.actiongenome.org/#download
2. Download the complete dataset
3. Process using the ActionGenome Toolkit
4. Place in ``data/action_genome/`` directory

EASG Dataset
~~~~~~~~~~~~

The EASG (Enhanced Action Scene Graph) dataset provides additional annotations and features.

**Overview**

* **Type**: Enhanced Video Scene Graph Dataset
* **Domain**: Extended human activities with fine-grained annotations
* **Features**: Additional semantic features and temporal annotations

**Dataset Structure**

.. code-block:: text

   EASG/
   ├── EASG/
   │   ├── annotations/
   │   └── features/
   ├── frames/
   ├── features_verb.pt
   ├── verb_features.pt
   └── model_final.pth

**Setup Instructions**

TODO: Add detailed EASG setup instructions

Visual Genome Dataset
~~~~~~~~~~~~~~~~~~~~~

Visual Genome provides static image scene graphs that can be used for pretraining.

**Overview**

* **Type**: Static Image Scene Graph Dataset
* **Domain**: General object relationships in images
* **Size**: ~100,000 images with scene graph annotations

**Setup Instructions**

TODO: Add Visual Genome integration details

Dataset Processing
------------------

Data Preprocessing
~~~~~~~~~~~~~~~~~~

The framework includes several preprocessing utilities:

**Frame Extraction**

.. code-block:: python

   from m3sgg.datasets.action_genome import ActionGenomeDataset
   
   # Initialize dataset
   dataset = ActionGenomeDataset(
       data_path="data/action_genome",
       split="train",
       mode="predcls"
   )

**Annotation Processing**

.. code-block:: python

   # Load and process annotations
   annotations = dataset.load_annotations()
   processed_data = dataset.preprocess_annotations(annotations)

**Feature Extraction**

.. code-block:: python

   # Extract visual features
   features = dataset.extract_features(video_path)

Data Loading
~~~~~~~~~~~~

**Basic Usage**

.. code-block:: python

   from torch.utils.data import DataLoader
   from m3sgg.datasets.action_genome import ActionGenomeDataset
   
   # Create dataset
   dataset = ActionGenomeDataset(
       data_path="data/action_genome",
       split="train",
       mode="predcls"
   )
   
   # Create data loader
   dataloader = DataLoader(
       dataset,
       batch_size=1,
       shuffle=True,
       num_workers=4
   )
   
   # Iterate through data
   for batch in dataloader:
       frames, annotations, metadata = batch
       # Process batch...

**Advanced Configuration**

.. code-block:: python

   # Custom dataset configuration
   dataset = ActionGenomeDataset(
       data_path="data/action_genome",
       split="train",
       mode="predcls",
       filter_duplicate_relations=True,
       filter_multiple_preds=False,
       frame_sample_rate=1
   )

Dataset Statistics
------------------

Action Genome Statistics
~~~~~~~~~~~~~~~~~~~~~~~~

.. list-table:: Action Genome Dataset Statistics
   :widths: 25 25 25 25
   :header-rows: 1

   * - Split
     - Videos
     - Frames
     - Relationships
   * - Train
     - 7,842
     - 476,583
     - 1,752,524
   * - Validation
     - 1,960
     - 119,145
     - 438,131
   * - Test
     - 1,960
     - 119,170
     - 438,384

**Object Classes**

The dataset includes 35 object categories:

.. code-block:: text

   person, chair, table, cup, plate, food, bag, bed, book, laptop,
   phone, tv, remote, mouse, keyboard, bottle, wine_glass, fork,
   knife, spoon, bowl, banana, apple, sandwich, orange, broccoli,
   carrot, hot_dog, pizza, donut, cake, refrigerator, oven,
   microwave, toaster

**Relationship Predicates**

The dataset includes 25 relationship types:

.. code-block:: text

   looking_at, not_looking_at, unsure, above, beneath, in_front_of,
   behind, on_the_side_of, in, carrying, covered_by, drinking_from,
   eating, have_it_on_the_back, holding, leaning_on, lying_on,
   not_contacting, other_relationship, sitting_on, standing_on,
   touching, twisting, wearing, wiping

Quality Assurance
-----------------

Data Validation
~~~~~~~~~~~~~~~

The framework includes validation utilities:

.. code-block:: python

   from utils.validation import validate_dataset
   
   # Validate dataset integrity
   validation_report = validate_dataset("data/action_genome")
   print(validation_report)

**Common Validation Checks**

* File existence and accessibility
* Annotation format consistency
* Bounding box validity
* Frame-annotation alignment
* Missing or corrupted files

Performance Considerations
--------------------------

Loading Optimization
~~~~~~~~~~~~~~~~~~~~

* **Caching**: Enable feature caching for faster loading
* **Parallel Loading**: Use multiple workers for data loading
* **Memory Management**: Monitor memory usage with large datasets

.. code-block:: python

   # Optimized data loading
   dataloader = DataLoader(
       dataset,
       batch_size=4,
       shuffle=True,
       num_workers=8,
       pin_memory=True,
       persistent_workers=True
   )

Storage Requirements
~~~~~~~~~~~~~~~~~~~~

.. list-table:: Storage Requirements
   :widths: 30 35 35
   :header-rows: 1

   * - Dataset
     - Raw Size
     - Processed Size
   * - Action Genome
     - ~500GB
     - ~200GB
   * - EASG
     - ~100GB
     - ~50GB
   * - Visual Genome
     - ~15GB
     - ~10GB

Custom Datasets
---------------

Adding New Datasets
~~~~~~~~~~~~~~~~~~~

To add support for a new dataset:

1. Create a new dataloader class inheriting from base dataset
2. Implement required methods: ``__init__``, ``__len__``, ``__getitem__``
3. Add dataset-specific preprocessing functions
4. Update configuration files

.. code-block:: python

   from dataloader.base import BaseDataset
   
   class CustomDataset(BaseDataset):
       def __init__(self, data_path, split, mode):
           super().__init__(data_path, split, mode)
           # Custom initialization
       
       def __getitem__(self, idx):
           # Load and return data sample
           pass
       
       def __len__(self):
           # Return dataset size
           pass

Dataset Conversion
~~~~~~~~~~~~~~~~~~

Utilities for converting between dataset formats:

.. code-block:: bash

   # Convert from custom format to Action Genome format
   python scripts/datasets/convert_dataset.py --input custom_data --output action_genome_format

Next Steps
----------

* :doc:`training` - Learn how to train models on these datasets
* :doc:`api/dataloader` - Detailed API documentation for data loading
* :doc:`evaluation` - Understand evaluation metrics and procedures