This project is in its early development stages, so stability is not guaranteed, and documentation is limited. We welcome your feedback and contributions as we refine and expand this project together!
This package provides Python classes and utilities for working with metadata based on the Data Documentation Initiative (DDI), an international standard for describing the data produced by surveys and other observational methods in the social, behavio◊◊al, economic, and health sciences.
There are three major flavors of DDI. This package currently supports:
-
DDI-Codebook 2.5: The lightweight version of the standard, intended primarily to document simple survey data. This specification has been widely adopted around the globe by statistical agencies, data producers, archives, research centers, and international organizations.
-
DDI-CDI 1.0 (Experimental): The new Cross Domain Integration specification that provides a unified model for describing data across different domains and methodologies.
We do not currently support DDI-Lifecycle.
- DDI-Codebook XML Processing: Load, parse, and extract structured metadata from DDI-Codebook documents
- DDI-CDI Model Classes: Work with Pydantic-based classes representing the full DDI-CDI specification
- RDF Integration: Generate RDF representations using SemPyRO (Semantic Python RDF Objects)
- Data Dictionary Extraction: Convert DDI metadata into usable data dictionaries
- Cross-Format Conversion: Transform between DDI-Codebook and DDI-CDI formats (experimental)
Once stable, this package will be officially released and distributed through PyPI. Stay tuned for updates!
In the meantime, you can install the package locally by following these steps:
-
Clone the Repository:
git clone https://github.com/DataArtifex/ddi-toolkit.git cd ddi-toolkit -
Install the Package:
From the project's home directory, run the following command to install the package:
pip install -e .
For development work with testing and documentation tools:
pip install -e .[dev]from dartfx.ddi import ddicodebook
# Load from file
my_codebook = ddicodebook.loadxml('mycodebook.xml')
# Access study metadata
study = my_codebook.studyDscr
title = study.citation.titlStmt.titl.content if study.citation else "No title"
# Access variables from data files
if my_codebook.dataDscr:
for var in my_codebook.dataDscr.var:
print(f"Variable: {var.name}, Label: {var.labl.content if var.labl else 'No label'}")
# Access file descriptions
if my_codebook.fileDscr:
for file_desc in my_codebook.fileDscr:
print(f"File: {file_desc.fileTxt.fileName}")# Get variable information for analysis
variables = []
if my_codebook.dataDscr:
for var in my_codebook.dataDscr.var:
var_info = {
'name': var.name,
'label': var.labl.content if var.labl else None,
'type': var.varFormat.type if var.varFormat else None,
'categories': []
}
# Extract categories/codes if present
if var.catgry:
for cat in var.catgry:
var_info['categories'].append({
'value': cat.catValu.content if cat.catValu else None,
'label': cat.labl.content if cat.labl else None
})
variables.append(var_info)from dartfx.ddi.ddicdi.specification import DdiCdiSpecification
# Load DDI-CDI specification
spec = DdiCdiSpecification('specifications/ddi-cdi-1.0')
# Get model information
classes = spec.get_classes()
enumerations = spec.get_enumerations()
# Search for specific resources
variable_classes = [cls for cls in classes if 'variable' in cls.name.lower()]from dartfx.ddi.ddicdi.sempyro_model import InstanceVariable, Identifier, InternationalRegistrationDataIdentifier
# Create an identifier
irdi = InternationalRegistrationDataIdentifier(
dataIdentifier="var_001",
registrationAuthorityIdentifier="example.org"
)
identifier = Identifier(ddiIdentifier=irdi)
# Create an instance variable
var = InstanceVariable(identifier=identifier)
# Serialize to RDF (if RDF functionality is available)
# rdf_output = var.to_rdf()This core package is used by other Data Artifex components leveraging DDI.
ddi-toolkit/
├── src/dartfx/ddi/
│ ├── ddicodebook.py # DDI-Codebook XML processing
│ ├── ddicdi/ # DDI-CDI support
│ │ ├── sempyro_model.py # Generated Pydantic/SemPyRO classes
│ │ ├── specification.py # DDI-CDI specification handling
│ │ └── utils.py # Utility functions
│ └── utils.py # General utilities
├── lab/ # Experimental code and Jupyter notebooks
├── specifications/ # DDI specification files
├── tests/ # Test suite
└── docs/ # Documentation
- Migrate model from Python @dataclass to Pydantic
- Integrate SeMPyRO for RDF annotation and serialization
- High level wrappers to hide DDi-CDI complexities for non-expert users
- Comprehensive test coverage
- RDF deserializer (from graph to Python)
- Complete documentation
- Add additional helper methods for common use cases
- SQL schema generators
- DCAT generator
- Enhanced DDI-Codebook to DDI-CDI conversion
- Explore integration with LLMs for metadata generation
- Fork it!
- Create your feature branch:
git checkout -b my-new-feature - Commit your changes:
git commit -am 'Add some feature' - Push to the branch:
git push origin my-new-feature - Submit a pull request :D