graph TD
A[data_2_dasch.py<br/>Main Migration Script] --> B[process_data_from_omeka.py<br/>Omeka Data Extraction]
A --> C[api_get_project.py<br/>DSP Project Info]
A --> D[api_get_lists.py<br/>DSP Lists]
A --> E[api_get_lists_detailed.py<br/>Detailed List Data]
click A href "#data_2_dasch.py" "Jump to data_2_dasch.py"
click B href "#process_data_from_omeka.py" "Jump to process_data_from_omeka.py"
click C href "#api_get_project.py" "Jump to api_get_project.py"
click D href "#api_get_lists.py" "Jump to api_get_lists.py"
click E href "#api_get_lists_detailed.py" "Jump to api_get_lists_detailed.py"
style A fill:#86bbd8
style B fill:#ffe880
style C fill:#3a1e3e,color:#fff
style D fill:#3a1e3e,color:#fff
style E fill:#3a1e3e,color:#fff
API Reference Documentation
This document provides comprehensive documentation for all Python modules and functions in the omeka2dsp system.
Module Overview
The omeka2dsp system consists of five main Python modules:
data_2_dasch.py
The main migration script that orchestrates the entire data transfer process from Omeka to DSP.
Core Functions
main() -> None
Purpose: Entry point that orchestrates the entire migration process.
Workflow:
- Parse command-line arguments for processing mode
- Fetch and filter data based on mode (all_data, sample_data, test_data)
- Authenticate with DSP and retrieve project information
- Process each item: create new or synchronize existing resources
- Handle associated media files
Parameters: None (uses command-line arguments)
Returns: None
Example Usage:
uv run python scripts/data_2_dasch.py -m sample_dataparse_arguments() -> Namespace
Purpose: Parses command-line arguments for processing mode selection.
Parameters: None (reads from sys.argv)
Returns:
Namespace: Contains parsed arguments withmodeattribute
Available Modes:
all_data: Process entire collectionsample_data: Process random sample (configurable size)test_data: Process predefined test dataset
Example:
args = parse_arguments()
print(args.mode) # 'sample_data'Authentication & Project Functions
login(email: str, password: str) -> str
Purpose: Authenticates with DSP API and retrieves JWT token.
Parameters:
email: DSP user email addresspassword: DSP user password
Returns:
str: JWT authentication token
Raises:
requests.RequestException: On authentication failureKeyError: If response format is unexpected
Example:
token = login("user@example.com", "password")get_project() -> str
Purpose: Retrieves project information from DSP API using project shortcode.
Parameters: None (uses PROJECT_SHORT_CODE environment variable)
Returns:
str: Project IRI/identifier
Side Effects: Logs project information
Example:
project_iri = get_project()
# Returns: "http://rdfh.ch/projects/IbwoJlv8SEa6L13vXyCzMg"get_lists(project_iri: str) -> list
Purpose: Retrieves all list configurations for a DSP project.
Parameters:
project_iri: The project IRI to fetch lists for
Returns:
list: Array of complete list objects with nodes and values
Process:
- Fetches list summary from
/admin/lists/ - For each list, fetches detailed information from
/v2/lists/{id} - Returns complete list data for mapping operations
Example:
lists = get_lists(project_iri)
for list_obj in lists:
print(f"List: {list_obj['listinfo']['name']}")Resource Management Functions
get_full_resource(token: str, resource_iri: str) -> dict
Purpose: Retrieves complete resource data from DSP API.
Parameters:
token: JWT authentication tokenresource_iri: URL-encoded resource IRI
Returns:
dict: Complete resource JSON object
Usage: Used for synchronization to compare existing DSP data with Omeka data.
Example:
resource_data = get_full_resource(token, urllib.parse.quote(resource_iri, safe=''))get_resource_by_id(token: str, object_class: str, identifier: str) -> dict
Purpose: Finds a resource by its identifier using SPARQL query.
Parameters:
token: JWT authentication tokenobject_class: DSP resource class (e.g., “SGB:Parent”)identifier: Unique identifier to search for
Returns:
dict: Resource data if found, empty dict if not found
SPARQL Query: Constructs and executes a SPARQL query to find resources by identifier.
Example:
resource = get_resource_by_id(token, f"{PREFIX}Parent", "abb13025")
if resource:
print(f"Found resource: {resource['@id']}")create_resource(payload: dict, token: str) -> None
Purpose: Creates a new resource in DSP using the provided payload.
Parameters:
payload: Complete DSP resource payload (JSON-LD format)token: JWT authentication token
Returns: None
Side Effects:
- Creates resource in DSP
- Logs creation success/failure
Example:
payload = construct_payload(omeka_item, f"{PREFIX}Parent", project_iri, lists, "", "")
create_resource(payload, token)Data Extraction & Transformation Functions
extract_dasch_propvalue(item: dict, prop: str) -> str
Purpose: Extracts a single property value from a DSP resource.
Parameters:
item: DSP resource dataprop: Property name (without prefix)
Returns:
str: Property value or empty string if not found
Supported Value Types: TextValue, ListValue, LinkValue, UriValue
Example:
title = extract_dasch_propvalue(dsp_resource, "title")extract_dasch_propvalue_multiple(item: dict, prop: str) -> list
Purpose: Extracts multiple values for a property from a DSP resource.
Parameters:
item: DSP resource dataprop: Property name (without prefix)
Returns:
list: Array of property values
Usage: For properties that can have multiple values (arrays).
Example:
subjects = extract_dasch_propvalue_multiple(dsp_resource, "subject")extract_value_from_entry(entry: dict) -> str
Purpose: Extracts the actual value from a DSP property entry based on its type.
Parameters:
entry: DSP property entry with @type and value fields
Returns:
str: Extracted value or None
Supported Types:
knora-api:TextValue: Returnsknora-api:valueAsStringknora-api:ListValue: Returns node IRI fromknora-api:listValueAsListNodeknora-api:LinkValue: Returns target IRI fromknora-api:linkValueHasTargetIriknora-api:UriValue: Returns URI fromknora-api:uriValueAsUri
Example:
value = extract_value_from_entry({
"@type": "knora-api:TextValue",
"knora-api:valueAsString": "Example text"
})
# Returns: "Example text"construct_payload(item: dict, type: str, project_iri: str, lists: list, parent_iri: str, internalMediaFilename: str) -> dict
Purpose: Converts Omeka item data into DSP-compatible JSON-LD payload.
Parameters:
item: Omeka item datatype: DSP resource type (e.g., “SGB:Parent”, “SGB:Image”, “SGB:Document”)project_iri: Project IRI for resource associationlists: DSP lists for value mappingparent_iri: Parent resource IRI for linkinginternalMediaFilename: Internal filename for media resources
Returns:
dict: Complete DSP resource payload in JSON-LD format
Key Transformations:
| Omeka Property | DSP Property | Value Type | Notes |
|---|---|---|---|
dcterms:title |
rdfs:label |
String | Resource label |
dcterms:identifier |
hasIdentifier |
TextValue | Unique identifier |
dcterms:description |
hasDescription |
TextValue | Item description |
dcterms:creator |
hasCreator |
TextValue Array | Multi-valued creator entries |
dcterms:date |
hasDate |
TextValue | EDTF date string |
dcterms:subject |
hasSubjectList |
ListValue Array | Iconclass subject list |
dcterms:type |
hasTypeList |
ListValue | DCMI Type vocabulary |
dcterms:format |
hasFormatList |
ListValue | Internet media type list |
dcterms:language |
hasLanguageList |
ListValue | ISO 639-1 codes |
dcterms:source |
hasSource |
TextValue Array | Provenance/source notes |
dcterms:relation |
hasRelation |
TextValue Array | Related resources |
dcterms:rights |
hasRights |
TextValue | Rights statement |
dcterms:license |
hasLicenseList |
ListValue | Controlled license list (CC/Rights) |
Example:
payload = construct_payload(
item=omeka_item,
type=f"{PREFIX}Parent",
project_iri=project_iri,
lists=project_lists,
parent_iri="",
internalMediaFilename=""
)extract_listvalueiri_from_value(value: str, list_label: str, lists: list) -> str
Purpose: Maps an Omeka value to a DSP list node IRI.
Parameters:
value: Value to map (e.g., “image/jpeg”)list_label: Name of the DSP list to search inlists: Array of DSP list objects
Returns:
str: DSP list node IRI if found, empty string otherwise
Process:
- Finds the list with matching label
- Searches through list nodes for matching value
- Returns the node IRI for API operations
Example:
format_iri = extract_listvalueiri_from_value(
"image/jpeg",
"Internet Media Type",
project_lists
)
# Returns: "http://rdfh.ch/lists/IbwoJlv8SEa6L13vXyCzMg/image-jpeg"Synchronization Functions
check_values(dasch_item: dict, omeka_item: dict, lists: list) -> list
Purpose: Compares DSP and Omeka data to identify changes that need synchronization.
Parameters:
dasch_item: Current DSP resource dataomeka_item: Current Omeka item datalists: DSP lists for value mapping
Returns:
list: Array of change operations (create, update, delete)
Change Detection:
- Compares each property between systems
- Identifies additions, deletions, and modifications
- Handles both single values and arrays
Example:
changes = check_values(dsp_resource, omeka_item, project_lists)
for change in changes:
print(f"Action: {change['type']}, Field: {change['field']}")sync_value(prop: str, prop_type: str, dasch_value: str, omeka_value: str) -> list
Purpose: Generates sync operations for single-value properties.
Parameters:
prop: Property nameprop_type: DSP property type (TextValue, ListValue, etc.)dasch_value: Current value in DSPomeka_value: Current value in Omeka
Returns:
list: Array of change operations
Logic:
- If values are different, creates update operation
- If DSP has value but Omeka doesn’t, creates delete operation
- If Omeka has value but DSP doesn’t, creates create operation
sync_array_value(prop: str, prop_type: str, dasch_array: list, omeka_array: list) -> list
Purpose: Generates sync operations for multi-value properties.
Parameters:
prop: Property nameprop_type: DSP property typedasch_array: Current values in DSPomeka_array: Current values in Omeka
Returns:
list: Array of change operations
Algorithm:
- Converts arrays to sets for comparison
- Calculates additions (in Omeka but not DSP)
- Calculates deletions (in DSP but not Omeka)
- Generates corresponding create/delete operations
update_value(token: str, item: dict, value: str, field: str, field_type: str, type_of_change: str) -> None
Purpose: Executes a single value update operation via DSP API.
Parameters:
token: JWT authentication tokenitem: DSP resource datavalue: New value to setfield: Property namefield_type: DSP value typetype_of_change: Operation type (“create”, “update”, “delete”)
Returns: None
Side Effects:
- Modifies DSP resource via API
- Logs operation results
File Upload Functions
upload_file_from_url(file_url: str, token: str, zip: bool = False) -> str
Purpose: Downloads a file from Omeka and uploads it to DSP storage.
Parameters:
file_url: URL of file in Omekatoken: JWT authentication tokenzip: Whether to compress file before upload
Returns:
str: Internal filename assigned by DSP
Process:
- Downloads file from Omeka URL
- Saves to temporary file
- Optionally creates ZIP archive
- Uploads to DSP via multipart form
- Returns DSP internal filename
Example:
internal_filename = upload_file_from_url(
"https://omeka.example.com/files/image.jpg",
token,
zip=False
)specify_mediaclass(media_type: str) -> str
Purpose: Determines appropriate DSP media class based on MIME type.
Parameters:
media_type: MIME type string (e.g., “image/jpeg”)
Returns:
str: DSP media class name
Mapping:
image/*→SGB:Imageapplication/pdf,text/*, archives →SGB:Document
Example:
media_class = specify_mediaclass("image/jpeg")
# Returns: "SGB:Image"Utility Functions
arrays_equal(array1: list, array2: list) -> bool
Purpose: Compares two arrays for equality, ignoring order.
Parameters:
array1: First array to comparearray2: Second array to compare
Returns:
bool: True if arrays contain the same elements
Usage: Used in synchronization to detect array changes.
process_data_from_omeka.py
Handles data extraction and processing from the Omeka API.
Core Functions
get_items_from_collection(collection_id: str) -> list
Purpose: Retrieves all items from a specified Omeka collection with pagination handling.
Parameters:
collection_id: Omeka collection/item set ID
Returns:
list: Array of all items in the collection
Features:
- Automatic pagination handling
- Rate limiting compliance
- Error recovery for temporary failures
Example:
items = get_items_from_collection("10780")
print(f"Found {len(items)} items")get_media(item_id: str) -> list
Purpose: Retrieves all media files associated with a specific Omeka item.
Parameters:
item_id: Omeka item ID
Returns:
list: Array of media objects with metadata and file URLs
Example:
media_files = get_media("12345")
for media in media_files:
print(f"Media: {media.get('o:filename')}")get_paginated_items(url: str, params: dict) -> list
Purpose: Generic function to handle paginated API requests.
Parameters:
url: Base API endpoint URLparams: Query parameters for first request
Returns:
list: Combined results from all pages
Features:
- Follows pagination links automatically
- Handles rate limiting
- Error recovery
Data Extraction Functions
extract_property(props: list, prop_id: int, as_uri: bool = False, only_label: bool = False) -> str
Purpose: Extracts a specific property value from Omeka property array.
Parameters:
props: Array of Omeka property objectsprop_id: Numerical ID of property to extractas_uri: Return as formatted URI link (default: False)only_label: Return only the label (default: False)
Returns:
str: Property value in requested format
Formats:
- Default: Returns
@valuefield as_uri=True: Returns[label](uri)markdown formatonly_label=True: Returnso:labelfield only
Example:
title = extract_property(item.get("dcterms:title", []), 1)
creator_link = extract_property(item.get("dcterms:creator", []), 2, as_uri=True)extract_combined_values(props: list) -> list
Purpose: Combines text values and URI references from properties into a single array.
Parameters:
props: Array of Omeka property objects
Returns:
list: Combined array of text values and formatted URI links
Process:
- Extracts all
@valuetext fields - Formats URI references as HTML links
- Escapes semicolons to prevent conflicts
- Returns combined array
Example:
subjects = extract_combined_values(item.get("dcterms:subject", []))
# Returns: ["History", "Basel", "<a href='...'>Authority Record</a>"]Utility Functions
is_valid_url(url: str) -> bool
Purpose: Validates if a string is a properly formatted URL.
Parameters:
url: URL string to validate
Returns:
bool: True if URL is valid
Example:
valid = is_valid_url("https://example.com/file.jpg")download_file(url: str, dest_path: str) -> None
Purpose: Downloads a file from URL to local path.
Parameters:
url: Source file URLdest_path: Destination file path
Returns: None
Features:
- Creates directories as needed
- Streaming download for large files
- Error handling and logging
api_get_project.py
Standalone script to fetch DSP project information.
get_project() -> None
Purpose: Fetches project data from DSP API and saves to file.
Environment Variables:
PROJECT_SHORT_CODE: DSP project shortcodeAPI_HOST: DSP API base URL
Output: Saves project data to ../data/project_data.json
Example Usage:
export PROJECT_SHORT_CODE="0123"
export API_HOST="https://api.dasch.swiss"
uv run python scripts/api_get_project.pyapi_get_lists.py
Standalone script to fetch DSP list configurations.
get_lists() -> None
Purpose: Fetches list summary from DSP API and saves to file.
Configuration:
- Hardcoded project IRI (should be updated for different projects)
- Fixed API host (should be made configurable)
Output: Saves list data to ../data/data_lists.json
Example Usage:
uv run python scripts/api_get_lists.pyapi_get_lists_detailed.py
Standalone script to fetch detailed DSP list information.
get_complete_list(list_id: str) -> dict
Purpose: Fetches complete list data for a single list ID.
Parameters:
list_id: DSP list IRI
Returns:
dict: Complete list object with all nodes and values
Process:
- URL-encodes the list IRI
- Requests detailed list data from
/v2/lists/{id} - Returns complete list structure
Main Script Logic:
- Loads list summary from
data_lists.json - Iterates through each list
- Fetches detailed information for each
- Saves all detailed lists to
data_lists_detail.json
Configuration Constants
Environment Variables
| Variable | Type | Required | Description | Default |
|---|---|---|---|---|
ITEM_SET_ID |
string | No | Omeka collection ID | “10780” |
PROJECT_SHORT_CODE |
string | Yes | DSP project shortcode | None |
API_HOST |
string | Yes | DSP API base URL | None |
INGEST_HOST |
string | Yes | DSP ingest service URL | None |
DSP_USER |
string | Yes | DSP username | None |
DSP_PWD |
string | Yes | DSP password | None |
ONTOLOGY_NAME |
string | No | Ontology name | “SGB” |
OMEKA_API_URL |
string | No | Omeka API base URL | “https://omeka.unibe.ch/api/” |
KEY_IDENTITY |
string | Yes | Omeka API key identity | None |
KEY_CREDENTIAL |
string | Yes | Omeka API key credential | None |
Processing Constants
| Constant | Value | Description |
|---|---|---|
NUMBER_RANDOM_OBJECTS |
2 | Number of items for sample mode |
TEST_DATA |
Set of identifiers | Specific items for test mode |
Error Handling
Exception Types
The system handles several types of errors:
- Authentication Errors: Invalid credentials, expired tokens
- Network Errors: Connection timeouts, API unavailability
- Data Validation Errors: Invalid payloads, missing required fields
- Rate Limiting: API quota exceeded
- File System Errors: Permission issues, disk space
Logging Configuration
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s
-%(levelname)s
-%(message)s",
handlers=[
logging.StreamHandler(), # Console output
logging.FileHandler("data_2_dasch.log", mode='w') # File output
]
)Error Recovery Strategies
- Retry with Exponential Backoff: For temporary network issues
- Skip and Continue: For individual item processing errors
- Fail Fast: For critical configuration or authentication errors
- Graceful Degradation: Continue with reduced functionality when possible
Common Error Scenarios
| Error | Cause | Recovery Strategy |
|---|---|---|
| Authentication failure | Invalid credentials | Re-authenticate or exit |
| Resource not found | Item doesn’t exist in DSP | Create new resource |
| Rate limit exceeded | Too many API requests | Wait and retry |
| Invalid payload | Data format error | Log error, skip item |
| Network timeout | Connection issues | Retry with backoff |
| File upload failure | File system or network issue | Retry or skip media |