SGB Data Validator Implementation
This document describes the implementation of the Omeka S data validator for the Stadt.Geschichte.Basel project.
Overview
The validator validates all items and media in the Omeka S item set 10780 against a comprehensive data model using pydantic. It checks for required fields, controlled vocabularies, well-formed URIs, and empty values.
Architecture
System Overview
graph TB
subgraph "CLI Interface"
CLI[validate.py]
end
subgraph "Data Validation Layer"
VAL[OmekaValidator]
ITEM[Item Model]
MEDIA[Media Model]
end
subgraph "External Services"
API[Omeka S API]
end
subgraph "Data Storage"
VOCAB[vocabularies.json]
end
CLI --> VAL
VAL --> ITEM
VAL --> MEDIA
VAL --> API
VAL --> VOCAB
ITEM --> VOCAB
MEDIA --> VOCAB
Core Components
validate.py
- Main CLI script- Connects to Omeka S API using httpx
- Validates items and media using pydantic models
- Generates comprehensive error reports
- Supports optional Omeka S authentication (key_identity and key_credential) via CLI or .env file
- Loads configuration from .env file (command-line args override)
src/models.py
- Pydantic data modelsOmekaProperty
- Base model for Omeka property valuesItem
- Model for Omeka S items with Dublin Core fieldsMedia
- Model for Omeka S media with Dublin Core fields- Includes field validators for required fields and formats
src/vocabularies.py
- Vocabulary loader- Loads controlled vocabularies from JSON
- Validates Era, MIME types, Licenses, and Iconclass terms
- Uses ISO 639-1 standard for language code validation
- Provides lookup methods for validation
3a. src/iso639.py
- ISO 639-1 language code validation
- Complete set of 184 ISO 639-1 two-letter language codes
- Case-insensitive validation
- Immutable frozenset for efficient lookups
- Standalone module for language code validation
data/raw/vocabularies.json
- Controlled vocabularies- Stadt.Geschichte.Basel Epoche (7 terms)
- Internet Media Types (17 terms)
- Licenses (7 terms)
- Iconclass (570+ terms)
test/test_validation.py
- Test suite- Tests validation with sample Omeka S data
- Tests error handling for invalid data
5a. test/test_iso639.py
- ISO 639-1 test suite
- Tests valid and invalid language codes
- Tests case-insensitive validation
- Tests edge cases and immutability
Features Implemented
Validation Process Flow
sequenceDiagram
participant User
participant CLI as validate.py
participant Validator as OmekaValidator
participant API as Omeka S API
participant Models as Pydantic Models
participant Vocab as VocabularyLoader
User->>CLI: Run validation command
CLI->>Validator: Initialize with config
Validator->>Vocab: Load controlled vocabularies
Vocab-->>Validator: Vocabularies loaded
loop For each page of items
Validator->>API: GET /api/items?item_set_id=10780
API-->>Validator: Return items batch
end
loop For each item
Validator->>Models: Validate item data
Models->>Vocab: Check vocabulary values
Vocab-->>Models: Validation result
Models-->>Validator: Item validation result
Validator->>API: GET /api/media?item_id={id}
API-->>Validator: Return media list
loop For each media
Validator->>Models: Validate media data
Models->>Vocab: Check vocabulary values
Vocab-->>Models: Validation result
Models-->>Validator: Media validation result
end
end
Validator->>CLI: Return validation report
CLI->>User: Display/save report
Core Requirements
✅ Validation Rules
✅ CLI Features
Configuration
The validator can be configured using a .env
file or command-line arguments. Command-line arguments override .env
values.
Setting up .env file
cp example.env .env
Edit .env
with your configuration:
OMEKA_URL=https://omeka.unibe.ch
KEY_IDENTITY=YOUR_KEY_IDENTITY
KEY_CREDENTIAL=YOUR_KEY_CREDENTIAL
ITEM_SET_ID=10780
Omeka S Authentication
The validator supports optional authentication with Omeka S using key_identity and key_credential parameters. These are Omeka S API credentials, not a single API key. Authentication is optional for read operations on public resources but may be required for:
- Accessing private items or item sets
- Rate limiting relief
- Write operations (when using the API module)
You can provide credentials via:
.env
file (recommended for development)- Command-line arguments (useful for CI/CD or one-off runs)
- No credentials (works for public resources)
Note: The validator performs read-only operations by default and does not require authentication for public item sets like 10780.
Usage Examples
Basic validation (uses .env if present)
uv run python validate.py
Save report to file
uv run python validate.py --output validation_report.txt
Use Omeka S authentication (can also be set in .env file)
uv run python validate.py --key-identity YOUR_KEY_IDENTITY --key-credential YOUR_KEY_CREDENTIAL
Validate different item set
uv run python validate.py --item-set-id 12345
Full options (override .env values)
uv run python validate.py \
--base-url https://omeka.unibe.ch \
--item-set-id 10780 \
--key-identity YOUR_KEY_IDENTITY \
--key-credential YOUR_KEY_CREDENTIAL \
--output report.txt
URI checking and data profiling
# Check URIs for broken links (with User-Agent rotation to avoid 403 errors)
uv run python validate.py --check-uris
# Check URIs and detect redirects
uv run python validate.py --check-uris --check-redirects
# Generate data profiling reports
uv run python validate.py --profile --profile-output my_analysis/
URL Detection in Literal Fields
The validator automatically checks all literal-type fields for URLs and generates warnings if any are found. This helps prevent unintentional inclusion of links in fields that are intended to be plain text values.
What is checked:
- All
dcterms:*
fields withtype: "literal"
- Detects URLs starting with
http://
,https://
,ftp://
, orwww.
- Detects URLs embedded within text
What is NOT checked:
- URI-type fields (e.g.,
dcterms:creator
withtype: "uri"
) - Fields that are supposed to contain URLs
Example warning:
[Item 10777] dcterms:description[0]: Literal field contains URL: Visit https://example.com for more
This feature was implemented in response to issue #22.
Development
Install dependencies
pip install uv
uv sync
Run linter
uv run ruff check .
Format code
uv run ruff format .
Run tests
uv run python test/test_validation.py
Data Model
The validator implements the complete data model from issue #1:
classDiagram
class Item {
+int o_id
+bool o_is_public
+str o_title
+datetime o_created
+datetime o_modified
+str dcterms_identifier
+str dcterms_title
+list dcterms_subject
+str dcterms_description
+str dcterms_temporal
+str dcterms_language
+str dcterms_isPartOf
+validate()
}
class Media {
+int o_id
+bool o_is_public
+str o_title
+str o_media_type
+int o_size
+str o_filename
+str o_original_url
+str o_sha256
+str dcterms_identifier
+str dcterms_title
+str dcterms_creator
+str dcterms_publisher
+str dcterms_date
+str dcterms_format
+str dcterms_license
+validate()
}
class OmekaValidator {
+str base_url
+str key_identity
+str key_credential
+httpx.Client client
+fetch_items()
+fetch_media()
+validate_item()
+validate_media()
+generate_report()
}
class VocabularyLoader {
+set eras
+set mime_types
+set licenses
+set iconclass
+is_valid_era()
+is_valid_mime_type()
+is_valid_license()
+is_valid_iconclass()
}
OmekaValidator --> Item
OmekaValidator --> Media
Item --> VocabularyLoader
Media --> VocabularyLoader
Item Fields
o:id
,o:is_public
,o:title
(required)o:created
,o:modified
(datetime)dcterms:identifier
(required)dcterms:title
(required, must not be empty)dcterms:subject
(Iconclass terms)dcterms:description
dcterms:temporal
(Era vocabulary)dcterms:language
(ISO 639-1)dcterms:isPartOf
Media Fields
o:id
,o:is_public
,o:title
(required)o:ingester
,o:renderer
o:media_type
(MIME vocabulary)o:size
,o:filename
,o:original_url
o:sha256
(hash)dcterms:identifier
(required)dcterms:title
(required, must not be empty)dcterms:subject
(Iconclass terms)dcterms:description
dcterms:creator
,dcterms:publisher
(URI or text)dcterms:date
(EDTF format)dcterms:temporal
(Era vocabulary)dcterms:type
(DCMI Type URI)dcterms:format
(MIME vocabulary)dcterms:extent
dcterms:source
(URI or text)dcterms:language
(ISO 639-1)dcterms:relation
(URI or text)dcterms:rights
dcterms:license
(License URI vocabulary)o:alt_text
Future Enhancements
Potential improvements for future versions:
URI Reachability Check- ✅ Implemented with--check-uris
flag- EDTF Validation - Validate dates conform to Extended Date/Time Format
ISO 639-1 Validation- ✅ Implemented with full ISO 639-1 standard support- Batch Processing - Support validating multiple item sets in one run
- JSON Report Format - Output reports in JSON for programmatic processing
Statistics Dashboard- ✅ Implemented with--profile
flag (ydata-profiling)- Incremental Validation - Only validate items modified since last run
- Custom Vocabularies - Support loading additional custom vocabularies
Notes
- The validator uses pydantic’s
extra="allow"
to permit additional fields while still validating known ones - Network access to omeka.unibe.ch is required for API validation
- The validator continues on errors to check all items/media in the set
- Sample data validation works offline using test/test_validation.py