Usage Guide

Modified

October 15, 2025

Complete guide for running data migrations with the omeka2dsp system. This guide provides comprehensive instructions for successfully running data migrations with the omeka2dsp system. Follow the best practices and monitoring guidelines to ensure smooth and reliable data transfers.

Quick Start

Pre-flight Checklist

Before running a migration, verify:

Environment variables are configured in .env
DSP project and data model exist
Omeka collection is accessible
You have sufficient permissions on both systems
Network connectivity is stable

Basic Migration

Run a complete migration of your Omeka collection:

# Navigate to project directory
cd omeka2dsp

# Run full migration
uv run python scripts/data_2_dasch.py

# Or explicitly specify all_data mode
uv run python scripts/data_2_dasch.py -m all_data

Expected output:

2024-01-15 10:30:15 - INFO - Login successful
2024-01-15 10:30:16 - INFO - project Iri: http://rdfh.ch/projects/IbwoJlv8SEa6L13vXyCzMg
2024-01-15 10:30:17 - INFO - Got Lists from project
2024-01-15 10:30:18 - INFO - Processing item: abb13025
2024-01-15 10:30:19 - INFO - Resource created successfully
2024-01-15 10:30:20 - INFO - Uploaded media: image.jpg
...

Processing Modes

The system supports three processing modes to handle different use cases:

1. All Data Mode (Production)

Processes the entire Omeka collection.

uv run python scripts/data_2_dasch.py -m all_data

Use Cases:

Production migrations
Complete data transfers
Initial system setup

Considerations:

Can take several hours for large collections
Requires stable network connection
Monitor disk space for media downloads

2. Sample Data Mode (Testing)

Processes a random sample of items for testing purposes.

uv run python scripts/data_2_dasch.py -m sample_data

Configuration:

Edit NUMBER_RANDOM_OBJECTS in data_2_dasch.py:

NUMBER_RANDOM_OBJECTS = 10  # Process 10 random items

Use Cases:

Testing configuration changes
Validating data transformations
Performance testing with smaller datasets

3. Test Data Mode (Development)

Processes specific predefined items for development and debugging.

uv run python scripts/data_2_dasch.py -m test_data

Configuration:

Edit TEST_DATA set in data_2_dasch.py:

TEST_DATA = {
    'abb13025',  # Specific item identifier
    'abb14375',  # Another item identifier
    'abb41033'   # Add more as needed
}

Use Cases:

Debugging specific items
Testing edge cases
Development workflow

Command Line Usage

Basic Syntax

uv run python scripts/data_2_dasch.py [OPTIONS]

Available Options

Option	Description	Default
`-m, --mode`	Processing mode: `all_data`, `sample_data`, `test_data`	`all_data`
`-h, --help`	Show help message and exit	-

Examples

# Process all data (default)
uv run python scripts/data_2_dasch.py

# Process sample data with explicit mode
uv run python scripts/data_2_dasch.py --mode sample_data

# Process test data
uv run python scripts/data_2_dasch.py -m test_data

# Show help
uv run python scripts/data_2_dasch.py --help

Monitoring Migration Progress

Real-time Monitoring

The script provides real-time progress information through console output:

# Run with output to terminal and file
uv run python scripts/data_2_dasch.py 2>&1 | tee migration.log

Log Files

Monitor detailed progress in log files:

# Watch log file in real-time
tail -f data_2_dasch.log

# Search for errors
grep "ERROR" data_2_dasch.log

# Count successful creations
grep "Resource created successfully" data_2_dasch.log | wc -l

Progress Indicators

Monitor these key indicators:

Successful Operations

INFO - Resource created successfully
INFO - Resource updated successfully  
INFO - Uploaded media: filename.jpg
INFO - Media linked to parent resource

Warning Indicators

WARNING - Resource already exists, checking for updates
WARNING - Media file not found: missing.jpg
WARNING - List value not found for: unknown_value

Error Indicators

ERROR - Authentication failed
ERROR - Failed to create resource
ERROR - File upload failed
ERROR - Network timeout

Performance Metrics

Track performance metrics:

# Count total items processed
grep "Processing item:" data_2_dasch.log | wc -l

# Average processing time per item
grep "Processing time:" data_2_dasch.log | awk '{sum+=$4; count++} END {print sum/count}'

# Failed operations
grep "FAILED" data_2_dasch.log | wc -l

Data Synchronization

The system automatically handles synchronization between Omeka and DSP:

Initial Migration

For new resources:

System checks if resource exists in DSP
If not found, creates new resource
Uploads and links associated media files
Logs creation success

Incremental Updates

For existing resources:

System retrieves current DSP data
Compares with current Omeka data
Identifies differences (additions, deletions, changes)
Applies only necessary updates
Logs synchronization results

Synchronization Process Flow

flowchart TD
    A[Process Omeka Item] --> B[Check DSP Resource Exists]
    B -->|Not Found| C[Create New Resource]
    B -->|Found| D[Compare Data]
    
    C --> E[Upload Media Files]
    D --> F{Has Changes?}
    F -->|Yes| G[Update Resource]
    F -->|No| H[Skip - No Changes]
    
    G --> I[Apply Updates]
    I --> E
    E --> J[Link Media to Resource]
    H --> K[Process Next Item]
    J --> K
    
    style C fill:#86bbd8
    style G fill:#ffe880
    style H fill:#dbfe87

Conflict Resolution

The system handles common conflicts:

Value Changes

Omeka has new value, DSP has old: Updates DSP value
Omeka removes value, DSP has value: Deletes DSP value
Both systems have different values: Omeka value takes precedence

Array Properties

Omeka adds item: Creates new value in DSP
Omeka removes item: Deletes value from DSP
Complex changes: Calculates minimal set of operations

Media Files

File changed in Omeka: Re-uploads file to DSP
File removed from Omeka: Logs warning (manual review needed)
File corrupted: Skips file, logs error

File Handling

Media Processing Workflow

sequenceDiagram
    participant S as Script
    participant O as Omeka
    participant T as Temp Storage
    participant D as DSP
    
    S->>O: Request media file
    O->>S: Return file URL
    S->>O: Download file stream
    O->>T: Stream file data
    T->>S: Temporary file created
    
    S->>S: Determine file type
    S->>S: Check if compression needed
    
    alt File > 10MB
        S->>T: Create ZIP archive
        T->>S: Compressed file ready
    end
    
    S->>D: Upload file (multipart)
    D->>S: Return internal filename
    S->>S: Update resource with filename
    S->>T: Clean up temporary files

Supported File Types

Category	MIME Types	DSP Class
Images	`image/jpeg`, `image/png`, `image/gif`, `image/tiff`	`sgb_MEDIA_IMAGE`
Documents	`application/pdf`, `text/plain`, `text/html`	`sgb_MEDIA_ARCHIV`
Archives	`application/zip`, `application/x-tar`	`sgb_MEDIA_ARCHIV`
Other	All other types	`sgb_MEDIA_ARCHIV` (default)

File Processing Options

Configure file handling behavior:

# Edit in data_2_dasch.py
FILE_PROCESSING_CONFIG = {
    'max_file_size': 100 * 1024 * 1024,  # 100MB
    'compression_threshold': 10 * 1024 * 1024,  # 10MB
    'supported_formats': [
        'image/jpeg', 'image/png', 'application/pdf'
    ],
    'skip_large_files': False,  # Skip files exceeding max size
    'create_thumbnails': False  # Generate thumbnails (if supported)
}

Troubleshooting Common Issues

Refer to the comprehensive Troubleshooting Guide as well.

Authentication Problems

Issue: Login failed or token expired

# Test authentication
uv run python -c "
from scripts.data_2_dasch import login
import os
try:
    token = login(os.getenv('DSP_USER'), os.getenv('DSP_PWD'))
    print('Authentication successful')
except Exception as e:
    print(f'Authentication failed: {e}')
"

Solutions:

Verify username and password
Check if account is locked
Ensure API endpoints are correct

Network Issues

Issue: Connection timeouts or failures

# Test network connectivity
curl -v https://api.dasch.swiss/health
ping api.dasch.swiss

Solutions:

Check internet connection
Verify firewall settings
Consider proxy configuration

Data Validation Errors

Issue: Invalid payload or data format errors

# Check specific item data
uv run python -c "
from scripts.process_data_from_omeka import get_items_from_collection
items = get_items_from_collection('$ITEM_SET_ID')
problem_item = next(item for item in items if item['o:id'] == 'PROBLEM_ID')
print(json.dumps(problem_item, indent=2))
"

Solutions:

Verify data model compatibility
Check required field presence
Validate list value mappings

File Upload Failures

Issue: Media files fail to upload

# Check file accessibility
curl -I "https://omeka.unibe.ch/files/original/98d8559515187ec4a710347c7b9e6cda0bdd58d2.tif"

Solutions:

Verify file exists and is accessible
Check file size limits
Ensure stable network connection
Retry upload for temporary failures

Best Practices

Pre-Migration Planning

Data Assessment

# Count items in collection
uv run python -c "
from scripts.process_data_from_omeka import get_items_from_collection
items = get_items_from_collection('$ITEM_SET_ID')
print(f'Total items: {len(items)}')

# Estimate processing time (1-2 items per second)
estimated_minutes = len(items) / 60
print(f'Estimated time: {estimated_minutes:.1f} minutes')
"

Test Run

# Always start with sample data
uv run python scripts/data_2_dasch.py -m sample_data

Backup Strategy

# Create backup of current DSP state
# (Use DSP export tools if available)

During Migration

Monitor Progress

# Run in screen/tmux for long migrations
screen -S migration
uv run python scripts/data_2_dasch.py 2>&1 | tee migration_$(date +%Y%m%d_%H%M%S).log

Resource Management

# Monitor system resources
htop
df -h  # Check disk space

Error Handling

# Monitor for errors in real-time
tail -f data_2_dasch.log | grep -E "(ERROR|FAILED)"

Post-Migration

Validation

# Count migrated resources
grep "Resource created successfully" data_2_dasch.log | wc -l
grep "Resource updated successfully" data_2_dasch.log | wc -l

Quality Assurance

# Check for missing media
grep "Media file not found" data_2_dasch.log

# Check for validation errors
grep "Validation failed" data_2_dasch.log

Documentation

# Create migration report
echo "Migration completed on $(date)" > migration_report.txt
echo "Items processed: $(grep 'Processing item:' data_2_dasch.log | wc -l)" >> migration_report.txt
echo "Resources created: $(grep 'Resource created successfully' data_2_dasch.log | wc -l)" >> migration_report.txt
echo "Resources updated: $(grep 'Resource updated successfully' data_2_dasch.log | wc -l)" >> migration_report.txt
echo "Errors encountered: $(grep 'ERROR' data_2_dasch.log | wc -l)" >> migration_report.txt

Performance Optimization

Batch Processing
- Process large collections in smaller batches
- Use sample_data mode with increasing sample sizes
Network Optimization
- Ensure stable, high-speed internet connection
- Consider running on server closer to DSP infrastructure
Resource Management
- Monitor memory usage for large files
- Clean temporary files regularly
- Use SSD storage for better I/O performance

Error Recovery

Resuming Interrupted Migrations

# The system automatically skips existing resources
# Simply re-run the same command
uv run python scripts/data_2_dasch.py -m all_data

Selective Re-processing

# Process specific problematic items
# Edit TEST_DATA with failed identifiers
uv run python scripts/data_2_dasch.py -m test_data

Clean Up and Retry

# Clean temporary files
rm -rf /tmp/omeka2dsp_*

# Clear logs for fresh start
rm data_2_dasch.log

# Retry migration
uv run python scripts/data_2_dasch.py