PDF Metadata Enhancer
pdf-metadata-enhancer
A CLI tool to embed canonical metadata from DOIs into PDF files for digital preservation and accessibility. This tool fetches bibliographic metadata using DOI content negotiation and embeds it into PDF InfoDict and XMP (Dublin Core) fields.
Quick Start
# Install
uv sync
# Run
echo "pdf,doi" > mapping.csv
echo "./document.pdf,10.21255/sgb-01-406352" >> mapping.csv
uv run pdf-metadata-enhancer ingest --input mapping.csv --out-dir output/ --verboseFeatures
- DOI Metadata Fetching: Automatic metadata retrieval using HTTP content negotiation (CSL-JSON)
- PDF Enhancement: Embeds metadata into PDF InfoDict and XMP (Dublin Core)
- Flexible Input: Supports CSV, TSV, and JSONL mapping files
- Provenance Tracking: Creates JSON sidecar files with SHA256 hashes and complete metadata
- Batch Processing: Process multiple PDFs in a single run
Installation
Requirements
- Python 3.11 or higher
- uv
Install from source
git clone https://github.com/Stadt-Geschichte-Basel/pdf-metadata-enhancer.git
cd pdf-metadata-enhancer
uv syncUsage
The pdf-metadata-enhancer CLI tool allows you to embed canonical metadata from DOIs into PDF files.
Basic Usage
uv run pdf-metadata-enhancer ingest --input mapping.csv --out-dir output/Input Format
The tool accepts CSV, TSV, or JSONL files mapping PDF files to their corresponding DOIs:
CSV format:
pdf,doi
./documents/paper1.pdf,10.21255/sgb-01-406352
./documents/paper2.pdf,10.21255/sgb-01.00-586075
JSONL format:
{"pdf": "./documents/paper1.pdf", "doi": "10.21255/sgb-01-406352"}
{"pdf": "./documents/paper2.pdf", "doi": "10.21255/sgb-01.00-586075"}Example
# Create a sample mapping file
echo "pdf,doi" > mapping.csv
echo "data/raw/document.pdf,10.21255/sgb-01-406352" >> mapping.csv
# Run the enhancement
uv run pdf-metadata-enhancer ingest --input mapping.csv --out-dir data/clean/ --verboseHow It Works
Metadata Fetching: For each DOI, the tool fetches metadata using HTTP content negotiation:
curl -LH "Accept: application/vnd.citationstyles.csl+json" https://doi.org/10.21255/sgb-01-406352PDF Enhancement: Opens the input PDF and embeds metadata into:
- PDF InfoDict (native PDF metadata dictionary)
- XMP packet (extensible metadata platform using Dublin Core schema)
Provenance Tracking: Creates a sidecar JSON file with:
- SHA256 hashes of input and output files for integrity verification
- Complete CSL-JSON metadata for reproducibility
- ISO 8601 timestamp for processing time
Output
The tool produces:
- Enhanced PDF files with embedded metadata in both PDF InfoDict and XMP (Dublin Core)
- Sidecar JSON files containing:
- DOI identifier
- Complete CSL-JSON metadata
- SHA256 hashes of input and output files
- Processing timestamp for provenance tracking
Embedded Metadata Fields
The tool embeds the following metadata fields from CSL-JSON into PDFs:
PDF InfoDict:
/Title: Document title/Author: Authors (concatenated with semicolons)/Subject: Abstract/description/Keywords: Subject keywords/Producer: Tool identifier
XMP Dublin Core:
dc:title: Document titledc:creator: Authors (as array)dc:subject: Subject keywordsdc:publisher: Publisher namedc:description: Abstract/descriptiondc:identifier: DOI identifier
Sidecar Example
{
"version": "0.1.0",
"timestamp": "2023-03-15T10:30:00.000Z",
"input": {
"path": "data/raw/document.pdf",
"sha256": "ca134a6dcb00077e..."
},
"output": {
"path": "data/clean/document.pdf",
"sha256": "c396864095567edd..."
},
"doi": "10.21255/sgb-01-406352",
"metadata": {
"DOI": "10.21255/sgb-01-406352",
"title": "Document Title",
"author": [{ "family": "Müller", "given": "Hans" }]
}
}Command-Line Options
pdf-metadata-enhancer ingest --helpOptions:
-i, --input PATH: Input CSV/TSV/JSONL file mapping PDFs to DOIs (required)-o, --out-dir PATH: Output directory for enhanced PDFs and sidecar files (required)-v, --verbose: Enable verbose output
Development
Code Quality
This project uses Ruff for linting and formatting Python code.
# Install development dependencies (including ruff)
uv sync
# Check code
ruff check src/ test/ run_tests.py
# Format code
ruff format src/ test/ run_tests.py
# Fix auto-fixable issues
ruff check --fix src/ test/ run_tests.pyRunning Tests
The project includes a comprehensive test suite:
# Run all tests
uv run python3 run_tests.py
# Run individual test modules
uv run python3 test/test_input_parser.py
uv run python3 test/test_pdf_enhancer.py
uv run python3 test/test_sidecar.pyProject Structure
src/
├── pdf_metadata_enhancer/
│ ├── cli.py # Command-line interface
│ ├── metadata_fetcher.py # DOI metadata fetching
│ ├── pdf_enhancer.py # PDF metadata embedding
│ ├── input_parser.py # Input file parsing
│ └── sidecar.py # Provenance sidecar generation
└── scripts/
└── get_metadata.py # DOI extraction and metadata harvesting
test/
├── test_input_parser.py
├── test_pdf_enhancer.py
└── test_sidecar.py
sgb/
├── dois.txt # SGB DOI list (88 entries)
└── metadata.json # Pre-fetched SGB metadata
Helper Scripts
DOI Metadata Harvester (src/scripts/get_metadata.py)
This script automates the process of extracting DOIs from web pages and fetching their metadata:
Features:
- Extracts DOIs from HTML pages using pattern matching
- Fetches CSL-JSON metadata via DOI content negotiation
- Concurrent fetching with configurable concurrency
- Provenance tracking (records origin page and line numbers)
- Retry logic with exponential backoff
- Comprehensive error reporting
Usage:
# Fetch from default SGB catalog URLs
uv run python3 src/scripts/get_metadata.py
# Custom URLs
uv run python3 src/scripts/get_metadata.py --url https://example.com/catalog
# Multiple URLs
uv run python3 src/scripts/get_metadata.py \
--url https://example.com/page1 \
--url https://example.com/page2
# URLs from file
echo "https://example.com/page1" > urls.txt
echo "https://example.com/page2" >> urls.txt
uv run python3 src/scripts/get_metadata.py --urls-file urls.txt
# Custom output paths and concurrency
uv run python3 src/scripts/get_metadata.py \
--out-dois my_dois.txt \
--out-json my_metadata.json \
--out-fail failures.txt \
--concurrency 20Options:
--url: Add a URL to scan (can be used multiple times)--urls-file: Path to a text file with one URL per line--concurrency: Number of concurrent DOI fetches (default: 10)--out-dois: Output file for DOI list (default:dois.txt)--out-json: Output file for metadata (default:metadata.json)--out-fail: Output file for failure report (default:failed_dois_report.txt)
Output Files:
- DOI list (
dois.txt): One DOI per line, lowercase, sorted - Metadata (
metadata.json): Array of CSL-JSON objects - Failure report (
failed_dois_report.txt): Errors with provenance information
Additional Documentation
- Contributing: Guidelines for contributing to this project
- Changelog: Version history and changes
- Security: Security policy and vulnerability reporting
- Code of Conduct: Community guidelines
Support
This project is maintained by @Stadt-Geschichte-Basel. Please understand that we can’t provide individual support via email. We also believe that help is much more valuable when it’s shared publicly, so more people can benefit from it.
| Type | Platforms |
|---|---|
| 🚨 Bug Reports | GitHub Issue Tracker |
| 📊 Report bad data | GitHub Issue Tracker |
| 📚 Docs Issue | GitHub Issue Tracker |
| 🎁 Feature Requests | GitHub Issue Tracker |
| 🛡 Report a security vulnerability | See SECURITY.md |
| 💬 General Questions | GitHub Discussions |
Stadt-Geschichte-Basel (SGB) Dataset
This repository includes a curated dataset of DOIs from the Stadt-Geschichte-Basel project:
sgb/dois.txt: A list of 88 DOIs from the SGB catalogsgb/metadata.json: Pre-fetched CSL-JSON metadata for all SGB DOIs
Using the SGB Dataset
The SGB DOIs can be used as a reference for testing and validation:
# Create a mapping file using SGB DOIs
echo "pdf,doi" > mapping.csv
echo "document.pdf,10.21255/sgb-01-406352" >> mapping.csv
# Process PDFs with SGB metadata
uv run pdf-metadata-enhancer ingest --input mapping.csv --out-dir output/Fetching Metadata for Custom DOI Lists
The repository includes a helper script to fetch metadata for custom DOI lists:
# Fetch metadata from default SGB catalog URLs
uv run python3 src/scripts/get_metadata.py
# Fetch metadata from custom URLs
uv run python3 src/scripts/get_metadata.py --url https://example.com/page1 --url https://example.com/page2
# Customize output files
uv run python3 src/scripts/get_metadata.py \
--out-dois custom_dois.txt \
--out-json custom_metadata.json \
--out-fail failed_report.txt \
--concurrency 20The script:
- Extracts DOIs from HTML pages
- Fetches CSL-JSON metadata using content negotiation
- Creates a DOI list, metadata file, and failure report
- Supports concurrent fetching for better performance
Roadmap
No changes are currently planned.
Contributing
All contributions to this repository are welcome! If you find errors or problems with the data, or if you want to add new data or features, please open an issue or pull request. Please read CONTRIBUTING.md for details on our code of conduct and the process for submitting pull requests.
Versioning
We use SemVer for versioning. The available versions are listed in the tags on this repository.
License
The data in this repository is released under the Creative Commons Attribution 4.0 International (CC BY 4.0) License - see the LICENSE-CCBY file for details. By using this data, you agree to give appropriate credit to the original author(s) and to indicate if any modifications have been made.
The code in this repository is released under the GNU Affero General Public License v3.0 - see the LICENSE-AGPL file for details. By using this code, you agree to make any modifications available under the same license.