Avala provides multiple ways to ingest data depending on your dataset size, infrastructure, and automation needs. This page covers each import method, when to use it, and how to build automated data pipelines.
Import Methods Overview
| Method | Best For | Max Size | Automation | Setup |
|---|
| Mission Control upload | Small datasets, one-off imports | 5 GB | Manual | None |
| Presigned URL upload | Programmatic uploads from any language | 5 GB per file | Full | API key |
| Cloud storage (S3/GCS) | Large datasets, zero-copy access | Unlimited | Full | Bucket config |
| MCAP import | Multi-sensor robotics data | 10 GB per file | Full | API key |
| SDK bulk upload | Medium datasets with progress tracking | 5 GB per file | Full | SDK installed |
Mission Control Upload
The simplest way to get data into Avala. Drag and drop files directly in the web interface.
Steps
- Go to Mission Control > Datasets > Create Dataset
- Name your dataset and select the data type
- Drag files into the upload area or click Browse
- Wait for processing to complete
Limitations
- Browser-based upload is limited by your connection speed and browser memory
- Not suitable for datasets with more than 1,000 files
- No resumable uploads — interrupted uploads must restart
For datasets larger than a few hundred files, use the SDK or presigned URL approach instead.
Presigned URL Upload
Presigned URLs let you upload files directly to Avala’s storage from any HTTP client. This is the most flexible programmatic upload method and works from any language or tool that can make HTTP requests.
How It Works
- Request a presigned upload URL from the Avala API
- Upload your file directly to the presigned URL using an HTTP PUT request
- Confirm the upload to register the item in the dataset
Example: Upload with cURL
# Step 1: Get a presigned upload URL
curl -X POST https://api.avala.ai/api/v1/datasets/{dataset_uid}/items/upload-url/ \
-H "X-Avala-Api-Key: your-api-key" \
-H "Content-Type: application/json" \
-d '{
"filename": "frame_001.jpg",
"content_type": "image/jpeg"
}'
# Response:
# { "upload_url": "https://s3.amazonaws.com/...", "item_uid": "itm_abc123" }
# Step 2: Upload the file to the presigned URL
curl -X PUT "https://s3.amazonaws.com/..." \
-H "Content-Type: image/jpeg" \
--data-binary @frame_001.jpg
# Step 3: Confirm the upload
curl -X POST https://api.avala.ai/api/v1/datasets/{dataset_uid}/items/{item_uid}/confirm/ \
-H "X-Avala-Api-Key: your-api-key"
Example: Upload with Python SDK
from avala import Client
client = Client()
dataset = client.datasets.list(name="my-dataset").items[0]
# Upload a single file
client.datasets.upload_items(
dataset_uid=dataset.uid,
files=["path/to/image.jpg"]
)
# Upload a directory of files
import glob
files = glob.glob("data/images/*.jpg")
client.datasets.upload_items(
dataset_uid=dataset.uid,
files=files
)
Cloud Storage Integration
For large-scale datasets, connect your own S3 or GCS bucket so Avala reads data directly from your storage — no file transfers, no copies.
When to Use Cloud Storage
| Scenario | Use Cloud Storage? |
|---|
| Dataset > 10,000 items | Yes |
| Dataset > 100 GB total | Yes |
| Data must stay in your infrastructure | Yes |
| Quick prototype with < 100 items | No — direct upload is faster |
| Data is spread across multiple buckets | Yes — connect multiple storage configs |
Setup
- Configure your bucket with the appropriate IAM policy (see Cloud Storage guide)
- Add the storage configuration in Mission Control > Settings > Storage
- Create a dataset and select your connected storage as the data source
- Reference items by their storage paths
Example: Create Dataset from S3
from avala import Client
client = Client()
# Create a dataset backed by cloud storage
dataset = client.datasets.create(
name="driving-data-2026-02",
data_type="image",
storage_config_uid="stg_your_config_uid"
)
# Register items by their S3 paths
items = [
{"path": "s3://your-bucket/captures/frame_001.jpg"},
{"path": "s3://your-bucket/captures/frame_002.jpg"},
{"path": "s3://your-bucket/captures/frame_003.jpg"},
]
for item in items:
client.datasets.create_item(
dataset_uid=dataset.uid,
source_url=item["path"]
)
Cloud storage datasets load faster in the annotation editor because images are served directly from your bucket’s region, avoiding cross-region transfers.
MCAP Import
MCAP files contain synchronized multi-sensor data (cameras, LiDAR, IMU). Avala parses MCAP files to extract and align sensor streams for annotation.
Supported Message Types
| Message Type | Description |
|---|
sensor_msgs/Image | Camera images |
sensor_msgs/CompressedImage | Compressed camera images |
sensor_msgs/PointCloud2 | LiDAR point clouds |
sensor_msgs/Imu | IMU readings |
geometry_msgs/TransformStamped | Sensor transforms (TF) |
sensor_msgs/NavSatFix | GPS coordinates |
Import Workflow
- Upload MCAP files via the SDK or presigned URLs
- Avala processes the file, extracting camera frames and point cloud scans
- Sensor streams are synchronized by timestamp
- Camera images and projected LiDAR data appear together in the annotation editor
For detailed MCAP setup, see the MCAP / ROS integration guide.
Building Import Pipelines
For production workflows, automate data ingestion so new data flows into Avala as it is collected.
Pipeline Architecture
Data Source Avala
┌──────────────┐ ┌─────────────────┐
│ Collection │ │ Dataset │
│ System │──upload──→ │ (items created) │
│ (cameras, │ │ │
│ sensors) │ │ Project │
└──────────────┘ │ (tasks assigned) │
└────────┬────────┘
│
webhook │
▼
┌─────────────────┐
│ Your Pipeline │
│ (export, train) │
└─────────────────┘
Example: Automated Ingestion with Webhooks
Combine the SDK upload with webhooks to build a fully automated pipeline:
# upload_pipeline.py
import glob
import os
from avala import Client
client = Client()
DATASET_UID = os.environ["AVALA_DATASET_UID"]
def ingest_new_data(data_directory: str) -> int:
"""Upload all new images from a directory to Avala."""
files = glob.glob(os.path.join(data_directory, "*.jpg"))
if not files:
return 0
client.datasets.upload_items(
dataset_uid=DATASET_UID,
files=files
)
return len(files)
if __name__ == "__main__":
count = ingest_new_data("/data/incoming")
print(f"Uploaded {count} items")
Schedule this script with cron, Airflow, or any task scheduler to periodically ingest new data.
Example: Watch Directory and Upload
#!/bin/bash
# watch_and_upload.sh - Upload new files as they appear
WATCH_DIR="/data/incoming"
DATASET_UID="ds_abc123"
inotifywait -m -e create "$WATCH_DIR" --format '%f' | while read filename; do
if [[ "$filename" == *.jpg || "$filename" == *.png ]]; then
avala datasets upload-items "$DATASET_UID" "$WATCH_DIR/$filename"
echo "Uploaded: $filename"
fi
done
Choosing an Import Method
Use this decision tree to select the right approach:
| Question | If Yes | If No |
|---|
| Fewer than 100 files? | Mission Control upload | Continue |
| Data already in S3/GCS? | Cloud storage integration | Continue |
| MCAP or ROS bag files? | MCAP import | Continue |
| Need automation? | SDK bulk upload or presigned URLs | Mission Control upload |
| Using Python or TypeScript? | SDK bulk upload | Presigned URL (any language) |
Next Steps