flowchart TD A[Process Omeka Item] --> B[Check DSP Resource Exists] B -->|Not Found| C[Create New Resource] B -->|Found| D[Compare Data] C --> E[Upload Media Files] D --> F{Has Changes?} F -->|Yes| G[Update Resource] F -->|No| H[Skip - No Changes] G --> I[Apply Updates] I --> E E --> J[Link Media to Resource] H --> K[Process Next Item] J --> K style C fill:#e3f2fd style G fill:#fff8e1 style H fill:#e8f5e8
Usage Guide
Complete guide for running data migrations with the omeka2dsp system. This guide provides comprehensive instructions for successfully running data migrations with the omeka2dsp system. Follow the best practices and monitoring guidelines to ensure smooth and reliable data transfers.
Quick Start
Pre-flight Checklist
Before running a migration, verify:
Basic Migration
Run a complete migration of your Omeka collection:
# Navigate to project directory
cd omeka2dsp
# Run full migration
uv run python scripts/data_2_dasch.py
# Or explicitly specify all_data mode
uv run python scripts/data_2_dasch.py -m all_data
Expected output:
2024-01-15 10:30:15 - INFO - Login successful
2024-01-15 10:30:16 - INFO - project Iri: http://rdfh.ch/projects/IbwoJlv8SEa6L13vXyCzMg
2024-01-15 10:30:17 - INFO - Got Lists from project
2024-01-15 10:30:18 - INFO - Processing item: abb13025
2024-01-15 10:30:19 - INFO - Resource created successfully
2024-01-15 10:30:20 - INFO - Uploaded media: image.jpg
...
Processing Modes
The system supports three processing modes to handle different use cases:
1. All Data Mode (Production)
Processes the entire Omeka collection.
uv run python scripts/data_2_dasch.py -m all_data
Use Cases:
- Production migrations
- Complete data transfers
- Initial system setup
Considerations:
- Can take several hours for large collections
- Requires stable network connection
- Monitor disk space for media downloads
2. Sample Data Mode (Testing)
Processes a random sample of items for testing purposes.
uv run python scripts/data_2_dasch.py -m sample_data
Configuration:
Edit NUMBER_RANDOM_OBJECTS
in data_2_dasch.py
:
= 10 # Process 10 random items NUMBER_RANDOM_OBJECTS
Use Cases:
- Testing configuration changes
- Validating data transformations
- Performance testing with smaller datasets
3. Test Data Mode (Development)
Processes specific predefined items for development and debugging.
uv run python scripts/data_2_dasch.py -m test_data
Configuration:
Edit TEST_DATA
set in data_2_dasch.py
:
= {
TEST_DATA 'abb13025', # Specific item identifier
'abb14375', # Another item identifier
'abb41033' # Add more as needed
}
Use Cases:
- Debugging specific items
- Testing edge cases
- Development workflow
Command Line Usage
Basic Syntax
uv run python scripts/data_2_dasch.py [OPTIONS]
Available Options
Option | Description | Default |
---|---|---|
-m, --mode |
Processing mode: all_data , sample_data , test_data |
all_data |
-h, --help |
Show help message and exit | - |
Examples
# Process all data (default)
uv run python scripts/data_2_dasch.py
# Process sample data with explicit mode
uv run python scripts/data_2_dasch.py --mode sample_data
# Process test data
uv run python scripts/data_2_dasch.py -m test_data
# Show help
uv run python scripts/data_2_dasch.py --help
Monitoring Migration Progress
Real-time Monitoring
The script provides real-time progress information through console output:
# Run with output to terminal and file
uv run python scripts/data_2_dasch.py 2>&1 | tee migration.log
Log Files
Monitor detailed progress in log files:
# Watch log file in real-time
tail -f data_2_dasch.log
# Search for errors
grep "ERROR" data_2_dasch.log
# Count successful creations
grep "Resource created successfully" data_2_dasch.log | wc -l
Progress Indicators
Monitor these key indicators:
Successful Operations
Warning Indicators
Error Indicators
Performance Metrics
Track performance metrics:
# Count total items processed
grep "Processing item:" data_2_dasch.log | wc -l
# Average processing time per item
grep "Processing time:" data_2_dasch.log | awk '{sum+=$4; count++} END {print sum/count}'
# Failed operations
grep "FAILED" data_2_dasch.log | wc -l
Data Synchronization
The system automatically handles synchronization between Omeka and DSP:
Initial Migration
For new resources:
- System checks if resource exists in DSP
- If not found, creates new resource
- Uploads and links associated media files
- Logs creation success
Incremental Updates
For existing resources:
- System retrieves current DSP data
- Compares with current Omeka data
- Identifies differences (additions, deletions, changes)
- Applies only necessary updates
- Logs synchronization results
Synchronization Process Flow
Conflict Resolution
The system handles common conflicts:
Value Changes
- Omeka has new value, DSP has old: Updates DSP value
- Omeka removes value, DSP has value: Deletes DSP value
- Both systems have different values: Omeka value takes precedence
Array Properties
- Omeka adds item: Creates new value in DSP
- Omeka removes item: Deletes value from DSP
- Complex changes: Calculates minimal set of operations
Media Files
- File changed in Omeka: Re-uploads file to DSP
- File removed from Omeka: Logs warning (manual review needed)
- File corrupted: Skips file, logs error
File Handling
Media Processing Workflow
sequenceDiagram participant S as Script participant O as Omeka participant T as Temp Storage participant D as DSP S->>O: Request media file O->>S: Return file URL S->>O: Download file stream O->>T: Stream file data T->>S: Temporary file created S->>S: Determine file type S->>S: Check if compression needed alt File > 10MB S->>T: Create ZIP archive T->>S: Compressed file ready end S->>D: Upload file (multipart) D->>S: Return internal filename S->>S: Update resource with filename S->>T: Clean up temporary files
Supported File Types
Category | MIME Types | DSP Class |
---|---|---|
Images | image/jpeg , image/png , image/gif , image/tiff |
sgb_MEDIA_IMAGE |
Documents | application/pdf , text/plain , text/html |
sgb_MEDIA_ARCHIV |
Archives | application/zip , application/x-tar |
sgb_MEDIA_ARCHIV |
Other | All other types | sgb_MEDIA_ARCHIV (default) |
File Processing Options
Configure file handling behavior:
# Edit in data_2_dasch.py
= {
FILE_PROCESSING_CONFIG 'max_file_size': 100 * 1024 * 1024, # 100MB
'compression_threshold': 10 * 1024 * 1024, # 10MB
'supported_formats': [
'image/jpeg', 'image/png', 'application/pdf'
],'skip_large_files': False, # Skip files exceeding max size
'create_thumbnails': False # Generate thumbnails (if supported)
}
Troubleshooting Common Issues
Refer to the comprehensive Troubleshooting Guide as well.
Authentication Problems
Issue: Login failed or token expired
# Test authentication
uv run python -c "
from scripts.data_2_dasch import login
import os
try:
token = login(os.getenv('DSP_USER'), os.getenv('DSP_PWD'))
print('Authentication successful')
except Exception as e:
print(f'Authentication failed: {e}')
"
Solutions:
- Verify username and password
- Check if account is locked
- Ensure API endpoints are correct
Network Issues
Issue: Connection timeouts or failures
# Test network connectivity
curl -v https://api.dasch.swiss/health
ping api.dasch.swiss
Solutions:
- Check internet connection
- Verify firewall settings
- Consider proxy configuration
Data Validation Errors
Issue: Invalid payload or data format errors
# Check specific item data
uv run python -c "
from scripts.process_data_from_omeka import get_items_from_collection
items = get_items_from_collection('$ITEM_SET_ID')
problem_item = next(item for item in items if item['o:id'] == 'PROBLEM_ID')
print(json.dumps(problem_item, indent=2))
"
Solutions:
- Verify data model compatibility
- Check required field presence
- Validate list value mappings
File Upload Failures
Issue: Media files fail to upload
# Check file accessibility
curl -I "https://omeka.unibe.ch/files/original/98d8559515187ec4a710347c7b9e6cda0bdd58d2.tif"
Solutions:
- Verify file exists and is accessible
- Check file size limits
- Ensure stable network connection
- Retry upload for temporary failures
Best Practices
Pre-Migration Planning
Data Assessment
# Count items in collection uv run python -c " from scripts.process_data_from_omeka import get_items_from_collection items = get_items_from_collection('$ITEM_SET_ID') print(f'Total items: {len(items)}') # Estimate processing time (1-2 items per second) estimated_minutes = len(items) / 60 print(f'Estimated time: {estimated_minutes:.1f} minutes') "
Test Run
# Always start with sample data uv run python scripts/data_2_dasch.py -m sample_data
Backup Strategy
# Create backup of current DSP state # (Use DSP export tools if available)
During Migration
Monitor Progress
# Run in screen/tmux for long migrations screen -S migration uv run python scripts/data_2_dasch.py 2>&1 | tee migration_$(date +%Y%m%d_%H%M%S).log
Resource Management
# Monitor system resources htop df -h # Check disk space
Error Handling
# Monitor for errors in real-time tail -f data_2_dasch.log | grep -E "(ERROR|FAILED)"
Post-Migration
Validation
# Count migrated resources grep "Resource created successfully" data_2_dasch.log | wc -l grep "Resource updated successfully" data_2_dasch.log | wc -l
Quality Assurance
# Check for missing media grep "Media file not found" data_2_dasch.log # Check for validation errors grep "Validation failed" data_2_dasch.log
Documentation
# Create migration report echo "Migration completed on $(date)" > migration_report.txt echo "Items processed: $(grep 'Processing item:' data_2_dasch.log | wc -l)" >> migration_report.txt echo "Resources created: $(grep 'Resource created successfully' data_2_dasch.log | wc -l)" >> migration_report.txt echo "Resources updated: $(grep 'Resource updated successfully' data_2_dasch.log | wc -l)" >> migration_report.txt echo "Errors encountered: $(grep 'ERROR' data_2_dasch.log | wc -l)" >> migration_report.txt
Performance Optimization
- Batch Processing
- Process large collections in smaller batches
- Use
sample_data
mode with increasing sample sizes
- Network Optimization
- Ensure stable, high-speed internet connection
- Consider running on server closer to DSP infrastructure
- Resource Management
- Monitor memory usage for large files
- Clean temporary files regularly
- Use SSD storage for better I/O performance
Error Recovery
Resuming Interrupted Migrations
# The system automatically skips existing resources # Simply re-run the same command uv run python scripts/data_2_dasch.py -m all_data
Selective Re-processing
# Process specific problematic items # Edit TEST_DATA with failed identifiers uv run python scripts/data_2_dasch.py -m test_data
Clean Up and Retry
# Clean temporary files rm -rf /tmp/omeka2dsp_* # Clear logs for fresh start rm data_2_dasch.log # Retry migration uv run python scripts/data_2_dasch.py