Data Management
Organize Datasets by Purpose
Structure your datasets around how they will be used, not just how the raw data is organized.| Pattern | When to Use | Example |
|---|---|---|
| By task type | Different annotation workflows | pedestrian-detection, lane-segmentation |
| By data source | Multiple cameras or collection runs | front-camera-2026-02, lidar-top-2026-02 |
| By model version | Training successive model iterations | training-v1, training-v2, validation |
| By priority | Triage incoming data | urgent-review, standard-queue, backlog |
Optimize File Sizes
Large files slow down uploads, viewer loading, and annotator productivity.| Data Type | Recommended Max | Format Tips |
|---|---|---|
| Images | 20 MB | Use JPEG at 85-95% quality for photos; PNG only for diagrams or screenshots |
| Video | 2 GB | H.264 codec, 1080p resolution is sufficient for most annotation tasks |
| Point clouds | 500 MB per frame | Downsample to relevant density; remove ground points if not needed |
| MCAP bags | 5 GB | Split long recordings into shorter segments (2-5 minutes) |
| Gaussian Splats | 500 MB | Use compressed PLY format |
Use Cloud Storage for Large Datasets
For datasets over 10,000 items or 100 GB total, use cloud storage integration instead of direct uploads. Benefits:- No data transfer: Avala reads directly from your S3 or GCS bucket
- Your encryption: Data stays encrypted with your KMS keys
- Your retention: Control lifecycle policies independently
- Faster onboarding: No upload step — just point Avala to your bucket
API Usage
Paginate Large Result Sets
Never fetch all records in a single request. Use cursor-based pagination to iterate through results efficiently.Respect Rate Limits
Avala enforces per-endpoint rate limits. Build retry logic into your integration from the start — don’t wait for production traffic to hit limits.Use Exports for Bulk Data Retrieval
Don’t loop through individual items to download annotations. Use the export API to generate a single export file containing all annotations for a dataset or project.Annotation Workflows
Design Projects with Clear Instructions
Well-defined annotation guidelines reduce rework and improve consistency. Effective project setup checklist:- Label taxonomy: Define all labels before annotating. Adding labels mid-project creates inconsistency.
- Examples: Provide 5-10 annotated examples for each label class, covering edge cases.
- Edge case rules: Document what to do with partially occluded objects, truncated objects at image boundaries, and ambiguous cases.
- Quality bar: Define what “good enough” looks like — perfect pixel-level accuracy is not always necessary.
Use Multi-Stage Review Pipelines
For production annotation workflows, use a multi-stage review pipeline:- Spot check: Randomly review 10-20% of submissions to identify systemic issues
- Targeted review: Focus reviews on annotations flagged by AutoTag or low-confidence predictions
- Full review: Reserve for high-value or safety-critical datasets
Leverage Consensus for Validation
For critical datasets, have multiple annotators label the same items independently. Consensus scoring identifies:- Items where annotators disagree (review these first)
- Annotators who consistently deviate from the group
- Label classes that are ambiguously defined
Batch Work Effectively
Group work into batches of 100-500 items for optimal throughput:| Batch Size | Pros | Cons |
|---|---|---|
| < 50 items | Quick turnaround | High overhead per item |
| 100-500 items | Good balance of throughput and review cycles | — |
| > 1000 items | Fewest batches to manage | Long wait for review; hard to catch errors early |
Performance Optimization
Optimize Upload Throughput
For large dataset uploads, parallelize your upload requests. See Performance Tuning for detailed concurrency recommendations and code examples.Optimize Export Performance
- Export by dataset, not by individual items
- Use COCO format for the fastest export generation
- For very large datasets (100K+ items), exports run asynchronously — poll the export status instead of waiting synchronously
Monitor with the MCP Server
Use the MCP server to monitor your workflows from your IDE or AI assistant:Cost Management
Right-Size Your Data
Not all data needs annotation. Filter before annotating:- Remove duplicates: Deduplicate images/frames before uploading
- Sample strategically: For video, annotate every Nth frame instead of every frame (common: every 5th or 10th frame)
- Use active learning: Prioritize items where the model is least confident, not random sampling
- Pre-filter with models: Use Batch Auto-Labeling to auto-label easy cases and focus human annotation on hard cases
Minimize API Calls
| Instead of | Do this |
|---|---|
| Fetching items one at a time | Use list endpoints with pagination |
| Polling export status in a tight loop | Use exponential backoff (1s, 2s, 4s, 8s) |
| Re-fetching unchanged data | Cache responses with ETags or timestamps |
| Downloading all annotations | Use export API for bulk retrieval |
Security
Protect Your API Keys
- Store keys in environment variables, never in source code
- Rotate keys periodically (generate new key, update integrations, delete old key)
- Use separate keys for development and production
- Keys are only displayed once at creation — store them securely
Use Cloud Storage with Least Privilege
When connecting S3 or GCS buckets, grant only the permissions Avala needs:- Read-only for datasets:
s3:GetObject,s3:ListBucket - Read-write for exports: Add
s3:PutObject - Never grant
s3:DeleteObjectunless absolutely necessary
Next Steps
- Quickstart — Get up and running in under 60 seconds
- Examples — Code examples for common workflows
- Rate Limits — Understand API limits and retry strategies
- Cloud Storage — Connect your own S3 or GCS bucket