graph TD A[data_2_dasch.py<br/>Main Migration Script] --> B[process_data_from_omeka.py<br/>Omeka Data Extraction] A --> C[api_get_project.py<br/>DSP Project Info] A --> D[api_get_lists.py<br/>DSP Lists] A --> E[api_get_lists_detailed.py<br/>Detailed List Data] click A href "#data_2_dasch.py" "Jump to data_2_dasch.py" click B href "#process_data_from_omeka.py" "Jump to process_data_from_omeka.py" click C href "#api_get_project.py" "Jump to api_get_project.py" click D href "#api_get_lists.py" "Jump to api_get_lists.py" click E href "#api_get_lists_detailed.py" "Jump to api_get_lists_detailed.py" style A fill:#e3f2fd style B fill:#fff3e0 style C fill:#f3e5f5 style D fill:#f3e5f5 style E fill:#f3e5f5
API Reference Documentation
This document provides comprehensive documentation for all Python modules and functions in the omeka2dsp system.
Module Overview
The omeka2dsp system consists of five main Python modules:
data_2_dasch.py
The main migration script that orchestrates the entire data transfer process from Omeka to DSP.
Core Functions
main() -> None
Purpose: Entry point that orchestrates the entire migration process.
Workflow:
- Parse command-line arguments for processing mode
- Fetch and filter data based on mode (all_data, sample_data, test_data)
- Authenticate with DSP and retrieve project information
- Process each item: create new or synchronize existing resources
- Handle associated media files
Parameters: None (uses command-line arguments)
Returns: None
Example Usage:
uv run python scripts/data_2_dasch.py -m sample_data
parse_arguments() -> Namespace
Purpose: Parses command-line arguments for processing mode selection.
Parameters: None (reads from sys.argv)
Returns:
Namespace
: Contains parsed arguments withmode
attribute
Available Modes:
all_data
: Process entire collectionsample_data
: Process random sample (configurable size)test_data
: Process predefined test dataset
Example:
= parse_arguments()
args print(args.mode) # 'sample_data'
Authentication & Project Functions
login(email: str, password: str) -> str
Purpose: Authenticates with DSP API and retrieves JWT token.
Parameters:
email
: DSP user email addresspassword
: DSP user password
Returns:
str
: JWT authentication token
Raises:
requests.RequestException
: On authentication failureKeyError
: If response format is unexpected
Example:
= login("user@example.com", "password") token
get_project() -> str
Purpose: Retrieves project information from DSP API using project shortcode.
Parameters: None (uses PROJECT_SHORT_CODE
environment variable)
Returns:
str
: Project IRI/identifier
Side Effects: Logs project information
Example:
= get_project()
project_iri # Returns: "http://rdfh.ch/projects/IbwoJlv8SEa6L13vXyCzMg"
get_lists(project_iri: str) -> list
Purpose: Retrieves all list configurations for a DSP project.
Parameters:
project_iri
: The project IRI to fetch lists for
Returns:
list
: Array of complete list objects with nodes and values
Process:
- Fetches list summary from
/admin/lists/
- For each list, fetches detailed information from
/v2/lists/{id}
- Returns complete list data for mapping operations
Example:
= get_lists(project_iri)
lists for list_obj in lists:
print(f"List: {list_obj['listinfo']['name']}")
Resource Management Functions
get_full_resource(token: str, resource_iri: str) -> dict
Purpose: Retrieves complete resource data from DSP API.
Parameters:
token
: JWT authentication tokenresource_iri
: URL-encoded resource IRI
Returns:
dict
: Complete resource JSON object
Usage: Used for synchronization to compare existing DSP data with Omeka data.
Example:
= get_full_resource(token, urllib.parse.quote(resource_iri, safe='')) resource_data
get_resource_by_id(token: str, object_class: str, identifier: str) -> dict
Purpose: Finds a resource by its identifier using SPARQL query.
Parameters:
token
: JWT authentication tokenobject_class
: DSP resource class (e.g., “sgb_OBJECT”)identifier
: Unique identifier to search for
Returns:
dict
: Resource data if found, empty dict if not found
SPARQL Query: Constructs and executes a SPARQL query to find resources by identifier.
Example:
= get_resource_by_id(token, f"{PREFIX}sgb_OBJECT", "abb13025")
resource if resource:
print(f"Found resource: {resource['@id']}")
create_resource(payload: dict, token: str) -> None
Purpose: Creates a new resource in DSP using the provided payload.
Parameters:
payload
: Complete DSP resource payload (JSON-LD format)token
: JWT authentication token
Returns: None
Side Effects:
- Creates resource in DSP
- Logs creation success/failure
Example:
= construct_payload(omeka_item, f"{PREFIX}sgb_OBJECT", project_iri, lists, "", "")
payload create_resource(payload, token)
Data Extraction & Transformation Functions
extract_dasch_propvalue(item: dict, prop: str) -> str
Purpose: Extracts a single property value from a DSP resource.
Parameters:
item
: DSP resource dataprop
: Property name (without prefix)
Returns:
str
: Property value or empty string if not found
Supported Value Types: TextValue, ListValue, LinkValue, UriValue
Example:
= extract_dasch_propvalue(dsp_resource, "title") title
extract_dasch_propvalue_multiple(item: dict, prop: str) -> list
Purpose: Extracts multiple values for a property from a DSP resource.
Parameters:
item
: DSP resource dataprop
: Property name (without prefix)
Returns:
list
: Array of property values
Usage: For properties that can have multiple values (arrays).
Example:
= extract_dasch_propvalue_multiple(dsp_resource, "subject") subjects
extract_value_from_entry(entry: dict) -> str
Purpose: Extracts the actual value from a DSP property entry based on its type.
Parameters:
entry
: DSP property entry with @type and value fields
Returns:
str
: Extracted value or None
Supported Types:
knora-api:TextValue
: Returnsknora-api:valueAsString
knora-api:ListValue
: Returns node IRI fromknora-api:listValueAsListNode
knora-api:LinkValue
: Returns target IRI fromknora-api:linkValueHasTargetIri
knora-api:UriValue
: Returns URI fromknora-api:uriValueAsUri
Example:
= extract_value_from_entry({
value "@type": "knora-api:TextValue",
"knora-api:valueAsString": "Example text"
})# Returns: "Example text"
construct_payload(item: dict, type: str, project_iri: str, lists: list, parent_iri: str, internalMediaFilename: str) -> dict
Purpose: Converts Omeka item data into DSP-compatible JSON-LD payload.
Parameters:
item
: Omeka item datatype
: DSP resource type (e.g., “sgb_OBJECT”, “sgb_MEDIA_IMAGE”)project_iri
: Project IRI for resource associationlists
: DSP lists for value mappingparent_iri
: Parent resource IRI for linkinginternalMediaFilename
: Internal filename for media resources
Returns:
dict
: Complete DSP resource payload in JSON-LD format
Key Transformations:
Omeka Property | DSP Property | Value Type | Notes |
---|---|---|---|
dcterms:title |
rdfs:label |
String | Required field |
dcterms:identifier |
identifier |
TextValue | Unique identifier |
dcterms:description |
description |
TextValue | Item description |
dcterms:creator |
creator |
TextValue | Creator information |
dcterms:date |
date |
TextValue | Date information |
dcterms:subject |
subject |
TextValue Array | Subject tags |
dcterms:type |
type |
ListValue | Mapped to DSP lists |
dcterms:format |
format |
ListValue | Media format mapping |
dcterms:language |
language |
ListValue | Language mapping |
dcterms:rights |
rights |
TextValue | Rights information |
dcterms:license |
license |
UriValue | License URL |
Example:
= construct_payload(
payload =omeka_item,
itemtype=f"{PREFIX}sgb_OBJECT",
=project_iri,
project_iri=project_lists,
lists="",
parent_iri=""
internalMediaFilename )
extract_listvalueiri_from_value(value: str, list_label: str, lists: list) -> str
Purpose: Maps an Omeka value to a DSP list node IRI.
Parameters:
value
: Value to map (e.g., “image/jpeg”)list_label
: Name of the DSP list to search inlists
: Array of DSP list objects
Returns:
str
: DSP list node IRI if found, empty string otherwise
Process:
- Finds the list with matching label
- Searches through list nodes for matching value
- Returns the node IRI for API operations
Example:
= extract_listvalueiri_from_value(
format_iri "image/jpeg",
"Internet Media Type",
project_lists
)# Returns: "http://rdfh.ch/lists/IbwoJlv8SEa6L13vXyCzMg/image-jpeg"
Synchronization Functions
check_values(dasch_item: dict, omeka_item: dict, lists: list) -> list
Purpose: Compares DSP and Omeka data to identify changes that need synchronization.
Parameters:
dasch_item
: Current DSP resource dataomeka_item
: Current Omeka item datalists
: DSP lists for value mapping
Returns:
list
: Array of change operations (create, update, delete)
Change Detection:
- Compares each property between systems
- Identifies additions, deletions, and modifications
- Handles both single values and arrays
Example:
= check_values(dsp_resource, omeka_item, project_lists)
changes for change in changes:
print(f"Action: {change['type']}, Field: {change['field']}")
sync_value(prop: str, prop_type: str, dasch_value: str, omeka_value: str) -> list
Purpose: Generates sync operations for single-value properties.
Parameters:
prop
: Property nameprop_type
: DSP property type (TextValue, ListValue, etc.)dasch_value
: Current value in DSPomeka_value
: Current value in Omeka
Returns:
list
: Array of change operations
Logic:
- If values are different, creates update operation
- If DSP has value but Omeka doesn’t, creates delete operation
- If Omeka has value but DSP doesn’t, creates create operation
sync_array_value(prop: str, prop_type: str, dasch_array: list, omeka_array: list) -> list
Purpose: Generates sync operations for multi-value properties.
Parameters:
prop
: Property nameprop_type
: DSP property typedasch_array
: Current values in DSPomeka_array
: Current values in Omeka
Returns:
list
: Array of change operations
Algorithm:
- Converts arrays to sets for comparison
- Calculates additions (in Omeka but not DSP)
- Calculates deletions (in DSP but not Omeka)
- Generates corresponding create/delete operations
update_value(token: str, item: dict, value: str, field: str, field_type: str, type_of_change: str) -> None
Purpose: Executes a single value update operation via DSP API.
Parameters:
token
: JWT authentication tokenitem
: DSP resource datavalue
: New value to setfield
: Property namefield_type
: DSP value typetype_of_change
: Operation type (“create”, “update”, “delete”)
Returns: None
Side Effects:
- Modifies DSP resource via API
- Logs operation results
File Upload Functions
upload_file_from_url(file_url: str, token: str, zip: bool = False) -> str
Purpose: Downloads a file from Omeka and uploads it to DSP storage.
Parameters:
file_url
: URL of file in Omekatoken
: JWT authentication tokenzip
: Whether to compress file before upload
Returns:
str
: Internal filename assigned by DSP
Process:
- Downloads file from Omeka URL
- Saves to temporary file
- Optionally creates ZIP archive
- Uploads to DSP via multipart form
- Returns DSP internal filename
Example:
= upload_file_from_url(
internal_filename "https://omeka.example.com/files/image.jpg",
token,zip=False
)
specify_mediaclass(media_type: str) -> str
Purpose: Determines appropriate DSP media class based on MIME type.
Parameters:
media_type
: MIME type string (e.g., “image/jpeg”)
Returns:
str
: DSP media class name
Mapping:
image/*
→sgb_MEDIA_IMAGE
application/pdf
,text/*
,application/zip
→sgb_MEDIA_ARCHIV
- All others →
sgb_MEDIA_ARCHIV
(default)
Example:
= specify_mediaclass("image/jpeg")
media_class # Returns: "StadtGeschichteBasel_v1:sgb_MEDIA_IMAGE"
Utility Functions
arrays_equal(array1: list, array2: list) -> bool
Purpose: Compares two arrays for equality, ignoring order.
Parameters:
array1
: First array to comparearray2
: Second array to compare
Returns:
bool
: True if arrays contain the same elements
Usage: Used in synchronization to detect array changes.
process_data_from_omeka.py
Handles data extraction and processing from the Omeka API.
Core Functions
get_items_from_collection(collection_id: str) -> list
Purpose: Retrieves all items from a specified Omeka collection with pagination handling.
Parameters:
collection_id
: Omeka collection/item set ID
Returns:
list
: Array of all items in the collection
Features:
- Automatic pagination handling
- Rate limiting compliance
- Error recovery for temporary failures
Example:
= get_items_from_collection("10780")
items print(f"Found {len(items)} items")
get_media(item_id: str) -> list
Purpose: Retrieves all media files associated with a specific Omeka item.
Parameters:
item_id
: Omeka item ID
Returns:
list
: Array of media objects with metadata and file URLs
Example:
= get_media("12345")
media_files for media in media_files:
print(f"Media: {media.get('o:filename')}")
get_paginated_items(url: str, params: dict) -> list
Purpose: Generic function to handle paginated API requests.
Parameters:
url
: Base API endpoint URLparams
: Query parameters for first request
Returns:
list
: Combined results from all pages
Features:
- Follows pagination links automatically
- Handles rate limiting
- Error recovery
Data Extraction Functions
extract_property(props: list, prop_id: int, as_uri: bool = False, only_label: bool = False) -> str
Purpose: Extracts a specific property value from Omeka property array.
Parameters:
props
: Array of Omeka property objectsprop_id
: Numerical ID of property to extractas_uri
: Return as formatted URI link (default: False)only_label
: Return only the label (default: False)
Returns:
str
: Property value in requested format
Formats:
- Default: Returns
@value
field as_uri=True
: Returns[label](uri)
markdown formatonly_label=True
: Returnso:label
field only
Example:
= extract_property(item.get("dcterms:title", []), 1)
title = extract_property(item.get("dcterms:creator", []), 2, as_uri=True) creator_link
extract_combined_values(props: list) -> list
Purpose: Combines text values and URI references from properties into a single array.
Parameters:
props
: Array of Omeka property objects
Returns:
list
: Combined array of text values and formatted URI links
Process:
- Extracts all
@value
text fields - Formats URI references as HTML links
- Escapes semicolons to prevent conflicts
- Returns combined array
Example:
= extract_combined_values(item.get("dcterms:subject", []))
subjects # Returns: ["History", "Basel", "<a href='...'>Authority Record</a>"]
Utility Functions
is_valid_url(url: str) -> bool
Purpose: Validates if a string is a properly formatted URL.
Parameters:
url
: URL string to validate
Returns:
bool
: True if URL is valid
Example:
= is_valid_url("https://example.com/file.jpg") valid
download_file(url: str, dest_path: str) -> None
Purpose: Downloads a file from URL to local path.
Parameters:
url
: Source file URLdest_path
: Destination file path
Returns: None
Features:
- Creates directories as needed
- Streaming download for large files
- Error handling and logging
api_get_project.py
Standalone script to fetch DSP project information.
get_project() -> None
Purpose: Fetches project data from DSP API and saves to file.
Environment Variables:
PROJECT_SHORT_CODE
: DSP project shortcodeAPI_HOST
: DSP API base URL
Output: Saves project data to ../data/project_data.json
Example Usage:
export PROJECT_SHORT_CODE="0123"
export API_HOST="https://api.dasch.swiss"
uv run python scripts/api_get_project.py
api_get_lists.py
Standalone script to fetch DSP list configurations.
get_lists() -> None
Purpose: Fetches list summary from DSP API and saves to file.
Configuration:
- Hardcoded project IRI (should be updated for different projects)
- Fixed API host (should be made configurable)
Output: Saves list data to ../data/data_lists.json
Example Usage:
uv run python scripts/api_get_lists.py
api_get_lists_detailed.py
Standalone script to fetch detailed DSP list information.
get_complete_list(list_id: str) -> dict
Purpose: Fetches complete list data for a single list ID.
Parameters:
list_id
: DSP list IRI
Returns:
dict
: Complete list object with all nodes and values
Process:
- URL-encodes the list IRI
- Requests detailed list data from
/v2/lists/{id}
- Returns complete list structure
Main Script Logic:
- Loads list summary from
data_lists.json
- Iterates through each list
- Fetches detailed information for each
- Saves all detailed lists to
data_lists_detail.json
Configuration Constants
Environment Variables
Variable | Type | Required | Description | Default |
---|---|---|---|---|
ITEM_SET_ID |
string | No | Omeka collection ID | “10780” |
PROJECT_SHORT_CODE |
string | Yes | DSP project shortcode | None |
API_HOST |
string | Yes | DSP API base URL | None |
INGEST_HOST |
string | Yes | DSP ingest service URL | None |
DSP_USER |
string | Yes | DSP username | None |
DSP_PWD |
string | Yes | DSP password | None |
PREFIX |
string | No | Ontology prefix | “StadtGeschichteBasel_v1:” |
OMEKA_API_URL |
string | No | Omeka API base URL | “https://omeka.unibe.ch/api/” |
KEY_IDENTITY |
string | Yes | Omeka API key identity | None |
KEY_CREDENTIAL |
string | Yes | Omeka API key credential | None |
Processing Constants
Constant | Value | Description |
---|---|---|
NUMBER_RANDOM_OBJECTS |
2 | Number of items for sample mode |
TEST_DATA |
Set of identifiers | Specific items for test mode |
Error Handling
Exception Types
The system handles several types of errors:
- Authentication Errors: Invalid credentials, expired tokens
- Network Errors: Connection timeouts, API unavailability
- Data Validation Errors: Invalid payloads, missing required fields
- Rate Limiting: API quota exceeded
- File System Errors: Permission issues, disk space
Logging Configuration
logging.basicConfig(=logging.INFO,
levelformat="%(asctime)s
-%(levelname)s
-%(message)s",
handlers=[
# Console output
logging.StreamHandler(), "data_2_dasch.log", mode='w') # File output
logging.FileHandler(
] )
Error Recovery Strategies
- Retry with Exponential Backoff: For temporary network issues
- Skip and Continue: For individual item processing errors
- Fail Fast: For critical configuration or authentication errors
- Graceful Degradation: Continue with reduced functionality when possible
Common Error Scenarios
Error | Cause | Recovery Strategy |
---|---|---|
Authentication failure | Invalid credentials | Re-authenticate or exit |
Resource not found | Item doesn’t exist in DSP | Create new resource |
Rate limit exceeded | Too many API requests | Wait and retry |
Invalid payload | Data format error | Log error, skip item |
Network timeout | Connection issues | Retry with backoff |
File upload failure | File system or network issue | Retry or skip media |