SGB Data Validator

Data Validation and Management Suite for Omeka S

Overview

The sgb-data-validator is a comprehensive Python toolkit for validating and managing metadata in Omeka S digital collections. Built for the Stadt.Geschichte.Basel project, it ensures cultural heritage data quality through schema validation, controlled vocabularies, and automated data transformations.

NoteKey Features
  • βœ… Schema validation using Pydantic models with Dublin Core metadata
  • πŸ“š Controlled vocabularies (Era, MIME types, Licenses, ICONCLASS notation)
  • 🌍 ISO 639-1 language validation for all 184 two-letter codes
  • πŸ”— URI validation with reachability checks and redirect detection
  • πŸ“Š CSV reports for data quality review with direct edit links
  • πŸ“ˆ Data profiling with interactive HTML reports (ydata-profiling)
  • πŸ”„ Data transformation (Unicode NFC, HTML entities, whitespace normalization)
  • πŸ’Ύ Offline workflow for batch editing (download β†’ transform β†’ validate β†’ upload)
  • πŸ”’ Privacy management (automatic private flag propagation for placeholder images)
  • πŸš€ Fast and efficient with asynchronous processing and User-Agent rotation

GitHub issues GitHub stars Code license Data license

Quick Start

Installation

This project uses uv for fast dependency management:

pip install uv
uv sync

Configuration

Create a .env file from the example:

cp example.env .env

Edit with your Omeka S credentials:

OMEKA_URL=https://omeka.unibe.ch
KEY_IDENTITY=YOUR_KEY_IDENTITY
KEY_CREDENTIAL=YOUR_KEY_CREDENTIAL
ITEM_SET_ID=10780
Tip

Command-line parameters override .env values, allowing default settings in .env with per-run customization.

Basic Usage

Validate an item set:

# Use settings from .env
uv run python validate.py

# Override specific settings
uv run python validate.py --item-set-id 12345 --output report.txt

# Enable URI checking and data profiling
uv run python validate.py --check-uris --profile

# Export validation results as CSV
uv run python validate.py --export-csv

Get help:

uv run python validate.py --help

Core Capabilities

1. Validation & Quality Assurance

Comprehensive metadata validation against Omeka S data models:

  • Required fields: Ensures presence of mandatory Dublin Core elements
  • Controlled vocabularies: Validates against Stadt.Geschichte.Basel vocabularies
  • URL detection: Warns when literal fields contain URLs (should use URI type)
  • URI validation: Checks HTTP reachability with configurable severity
  • ICONCLASS notation: Validates complex notation syntax (e.g., 25F23(+8))
# Standard validation
uv run python validate.py

# Check for broken links
uv run python validate.py --check-uris

# Treat broken links as errors
uv run python validate.py --check-uris --uri-check-severity error

2. CSV Reports

Export validation results for review and batch corrections:

uv run python validate.py --export-csv --csv-output my_reports/

Generated files:

  • items_validation.csv - One row per item, one column per field
  • media_validation.csv - One row per media object, one column per field
  • validation_summary.csv - Aggregated statistics

Each row includes a direct edit link to the resource in Omeka S admin interface.

Note

πŸ“– For detailed CSV report documentation, see Validation Reports

3. Data Profiling

Generate comprehensive statistical analysis with ydata-profiling:

# Full profiling
uv run python validate.py --profile

# Minimal mode (faster)
uv run python validate.py --profile --profile-minimal --profile-output analysis/

Produces interactive HTML reports with:

  • Dataset statistics and distributions
  • Correlation analysis
  • Missing data patterns
  • Variable type detection

4. Data Transformation

Clean and normalize metadata with comprehensive transformations:

Issue #31 - Comprehensive transformations:

  • Unicode NFC normalization (diacritics: ΓΆ, Γ€, ΓΌ)
  • HTML entity conversion (ä β†’ Γ€)
  • Markdown link formatting ((URL)[label] β†’ [label](URL))
  • Abbreviation normalization (d.j. β†’ d. J., d.Γ€. β†’ d. Γ„.)
  • Wikidata URL normalization (m.wikidata.org/wiki/Q123 β†’ https://www.wikidata.org/wiki/Q123)
  • URL standardization (add www. prefix, remove trailing slashes)
  • HTTP to HTTPS upgrade (with availability checking)

Issue #28 - Whitespace normalization:

  • Remove soft hyphens (U+00AD)
  • Normalize non-breaking spaces (U+00A0, U+202F)
  • Remove zero-width characters
  • Collapse multiple spaces
  • Normalize line breaks

Issue #36 - Privacy management:

  • Automatically set media with placeholder images to private
  • Propagate private flag from media children to parent items
from src.transformations import apply_text_transformations

text = "über d.j. m.wikidata.org/wiki/Q123"
result = apply_text_transformations(text)
# Result: "ΓΌber d. J. https://www.wikidata.org/wiki/Q123"

5. Offline Workflow

Complete workflow for batch editing:

# 1. Download raw data
uv run python workflow.py download --item-set-id 10780

# 2. Transform (applies all transformations by default)
uv run python workflow.py transform data/raw_itemset_10780_*/

# 3. Edit JSON files offline with any text editor
# (files: items_transformed.json, media_transformed.json, item_set_transformed.json)

# 4. Validate before upload
uv run python workflow.py validate data/transformed_itemset_10780_*/

# 5. Dry run (preview changes)
uv run python workflow.py upload data/transformed_itemset_10780_*/

# 6. Upload for real
uv run python workflow.py upload data/transformed_itemset_10780_*/ --no-dry-run
Warning

Always run with --dry-run (default) before uploading to production!

Architecture

System Components

graph TB
    subgraph "CLI Interface"
        CLI[validate.py / workflow.py]
    end

    subgraph "Core Modules"
        MODELS[models.py<br/>Pydantic validation]
        TRANS[transformations.py<br/>Data cleaning]
        API[api.py<br/>Omeka S client]
        VOCAB[vocabularies.py<br/>Controlled terms]
    end

    subgraph "Data Layer"
        JSON[vocabularies.json<br/>Era, MIME, Licenses, ICONCLASS]
        ISO[iso639.py<br/>Language codes]
    end

    subgraph "External Services"
        OMEKA[Omeka S API]
    end

    CLI --> MODELS
    CLI --> TRANS
    CLI --> API
    MODELS --> VOCAB
    VOCAB --> JSON
    VOCAB --> ISO
    API --> OMEKA
    TRANS --> MODELS

Data Model

The validator implements comprehensive Dublin Core metadata validation:

Item fields:

  • Core: o:id, o:is_public, o:title, dcterms:identifier, dcterms:title
  • Content: dcterms:description, dcterms:subject (ICONCLASS)
  • Context: dcterms:temporal (Era vocabulary), dcterms:language (ISO 639-1)

Media fields:

  • Core: o:id, o:title, o:filename, o:original_url, dcterms:identifier
  • Technical: o:media_type (MIME), o:size, o:sha256
  • Rights: dcterms:license (License vocabulary), dcterms:rights
  • Descriptive: All Dublin Core terms with controlled vocabularies
Note

πŸ“– For complete data model documentation, see data/raw/vocabularies.json

Python API

Programmatic access for integration with other tools:

from src.api import OmekaAPI

# Initialize API client
with OmekaAPI(
    "https://omeka.unibe.ch",
    key_identity="YOUR_KEY",
    key_credential="YOUR_SECRET"
) as api:
    # Fetch data
    item_set = api.get_item_set(10780)
    items = api.get_items_from_set(10780)
    
    # Transform data
    result = api.transform_item_set(
        item_set_id=10780,
        output_dir="data/",
        apply_all=True
    )
    
    # Validate offline files
    validation = api.validate_offline_files("data/transformed_*/")
    
    # Upload changes (dry run)
    api.upload_transformed_data("data/transformed_*/", dry_run=True)
Note

πŸ“– For complete API examples, see examples/api_usage.py

Testing

The project includes comprehensive test coverage with pytest:

# Run all tests (96 tests)
uv run pytest

# Run specific categories
uv run pytest -m unit          # 72 unit tests
uv run pytest -m integration   # 24 integration tests

# Verbose output
uv run pytest -v

# Run specific test file
uv run pytest test/test_issue36_private_flag.py

Test organization:

  • Unit tests: Fast, isolated tests for core functionality
  • Integration tests: Component interaction and real-world scenarios
  • Automatic categorization via conftest.py
Note

πŸ“– For test documentation, see test/README.md and test/QUICKREF.md

Development

Code Quality

This project uses tools from Astral:

# Lint
uv run ruff check .

# Format
uv run ruff format .

# Type checking (via mypy annotations)

Repository Structure

Following The Turing Way advanced structure:

sgb-data-validator/
β”œβ”€β”€ src/                    # Source modules
β”‚   β”œβ”€β”€ models.py          # Pydantic validation models
β”‚   β”œβ”€β”€ api.py             # Omeka S API client
β”‚   β”œβ”€β”€ transformations.py # Data transformation utilities
β”‚   β”œβ”€β”€ vocabularies.py    # Controlled vocabulary loader
β”‚   β”œβ”€β”€ iconclass.py       # ICONCLASS notation parser
β”‚   └── profiling.py       # Data profiling utilities
β”œβ”€β”€ test/                   # Test suite (96 tests)
β”œβ”€β”€ examples/               # Usage examples and tutorials
β”œβ”€β”€ data/raw/              # Controlled vocabularies
β”œβ”€β”€ validate.py            # CLI validation script
β”œβ”€β”€ workflow.py            # CLI offline workflow script
└── pyproject.toml         # Dependencies and configuration

Contributing

Contributions are welcome! Please see:

License

Citation

If you use this tool in your research, please cite:

@software{sgb_data_validator,
  author = {Stadt.Geschichte.Basel},
  title = {SGB Data Validator},
  year = {2024},
  url = {https://github.com/Stadt-Geschichte-Basel/sgb-data-validator}
}

See CITATION.cff for complete citation metadata.

Support


Documentation: dokumentation.stadtgeschichtebasel.ch/sgb-data-validator | Source: GitHub | Project: Stadt.Geschichte.Basel

Back to top