Skip to content

Commit 49183da

Browse files
committed
Merge branch 'find-hive-files'
2 parents 50f80c0 + eb26195 commit 49183da

File tree

4 files changed

+128
-86
lines changed

4 files changed

+128
-86
lines changed

Project.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
name = "HivePaths"
22
uuid = "67cf009d-4aa4-48c9-a112-d5138c970da1"
33
authors = ["okatsn <okatsn@gmail.com> and contributors"]
4-
version = "0.0.3"
4+
version = "0.0.4"
55

66
[compat]
77
julia = "1.10"

README.md

Lines changed: 32 additions & 60 deletions
Original file line numberDiff line numberDiff line change
@@ -7,76 +7,48 @@
77

88
<!-- Don't have any of your custom contents above; they won't occur if there is no citation. -->
99

10-
## Documentation Badge is here:
11-
1210
[![](https://img.shields.io/badge/docs-stable-blue.svg)](https://okatsn.github.io/HivePaths.jl/stable)
1311
[![](https://img.shields.io/badge/docs-dev-blue.svg)](https://okatsn.github.io/HivePaths.jl/dev)
1412

15-
> See [Documenter.jl: Documentation Versions](https://documenter.juliadocs.org/dev/man/hosting/#Documentation-Versions)
16-
17-
## Introduction
18-
19-
This is a julia package created using `okatsn`'s preference, and this package is expected to be registered to [okatsn/OkRegistry](https://github.com/okatsn/OkRegistry) for CIs to work properly.
20-
21-
!!! note Checklist
22-
23-
- [ ] Create an empty repository (namely, `https://github.com/okatsn/HivePaths.jl.git`) on github, and push the local to origin. See [connecting to remote](#tips-for-connecting-to-remote).
24-
- [ ] Add `ACCESS_OKREGISTRY` secret in the settings of this repository on Github, or delete both `register.yml` and `TagBot.yml` in `/.github/workflows/`. See [Auto-Registration](#auto-registration).
25-
- [ ] To keep `Manifest.toml` being tracked, delete the lines in `.gitignore`.
26-
- [ ] You might like to register `v0.0.0` in order to `pkg> dev HivePaths` in your environment.
27-
28-
29-
### Go to [OkPkgTemplates](https://github.com/okatsn/OkPkgTemplates.jl) for more information
30-
31-
- [How TagBot works and trouble shooting](https://github.com/okatsn/OkPkgTemplates.jl#tagbot)
32-
- [Use of Documenter](https://github.com/okatsn/OkPkgTemplates.jl#use-of-documenter)
33-
34-
## References
35-
36-
### For a remote of different name
37-
38-
Example workflow
39-
40-
- Create `YourPackage.jl` with `OkPkgTemplates`
41-
- Create a new Repo on GitHub, saying `Hello-World`
42-
- Go to local path of YourPackage.jl, `git remote set-url origin https://<git-repo>/Hello-World.git`.
43-
- Use find all and Replace "YourPackage.jl" with "Hello-World" **EXCEPT** those **NOT** URL such as:
44-
- `@testset "YourPackage.jl"` in `/test/runtest.jl`
45-
- The `sitename` field in `/docs/make.jl`
46-
47-
### Auto-Registration
48-
49-
- You have to add `ACCESS_OKREGISTRY` to the secret under the remote repo (e.g., https://github.com/okatsn/HivePaths.jl).
50-
- `ACCESS_OKREGISTRY` allows `CI.yml` to automatically register/update this package to [okatsn/OkRegistry](https://github.com/okatsn/OkRegistry).
51-
52-
### Test
53-
#### How to add a new test
54-
55-
Add `.jl` files (that has `@testset` block or `@test` inside) in `test/`; `test/runtests.jl` will automatically `include` all the `.jl` scripts there.
13+
HivePaths provides utilities for working with Hive-style partitioned file hierarchies, where data is organized using `key=value` directory structures.
5614

57-
#### Test docstring
15+
## Purpose
5816

59-
`doctest` is executed at the following **two** places:
17+
When managing datasets partitioned across multiple dimensions (e.g., `criterion=depth/partition=1/k=10/data.arrow`), HivePaths helps you:
18+
- **Parse** paths to extract partition metadata
19+
- **Build** paths with consistent hierarchical ordering
20+
- **Find** all files matching a specific schema
6021

61-
1. In `CI.yml`, `jobs: test: ` that runs `test/runtests.jl`
62-
2. In `CI.yml`, `jobs: docs: ` that runs directly on bash.
22+
Each `HiveSchema` defines one target filename and the hierarchical structure of its enclosing directories.
6323

64-
It is no harm to run both, but you can manually delete either.
65-
Of course, `pkg> test` will also run `doctest` since it runs also `test/runtests.jl`.
24+
## Example
6625

67-
### Tips for connecting to remote
26+
```julia
27+
using HivePaths
6828

69-
Connect to remote:
29+
# Define the schema
30+
schema = HiveSchema(
31+
parsers = Dict{String, Function}(
32+
"criterion" => identity,
33+
"partition" => x -> parse(Int, x),
34+
"k" => x -> parse(Int, x)
35+
),
36+
order = ["criterion", "partition", "k"],
37+
filename = "data.arrow"
38+
)
7039

71-
1. Switch to the local directory of this project (HivePaths)
72-
2. Add an empty repo HivePaths(.jl) on github (without anything!)
73-
3. `git push origin main`
40+
# Build paths
41+
path = build_hive_path(schema, "results"; criterion="depth", partition=2, k=5)
42+
# → "results/criterion=depth/partition=2/k=5/data.arrow"
7443

75-
- It can be quite tricky, see https://discourse.julialang.org/t/upload-new-package-to-github/56783
76-
More reading
77-
Pkg's Artifact that manage an external dataset as a package
78-
- https://pkgdocs.julialang.org/v1/artifacts/
79-
- a provider for reposit data: https://github.com/sdobber/FA_data
44+
# Parse paths
45+
parsed = parse_hive_path(schema, path; required_keys=["criterion", "partition"])
46+
# → (criterion="depth", partition=2, k=5)
8047

48+
# Find all matching files
49+
files = find_hive_files(schema, "results"; validate_keys=["criterion"])
50+
# → ["results/criterion=depth/partition=1/k=3/data.arrow",
51+
# "results/criterion=depth/partition=2/k=5/data.arrow", ...]
52+
```
8153

82-
This package is create on 2026-01-26.
54+
See the docstrings for detailed API documentation.

src/HivePaths.jl

Lines changed: 80 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -1,26 +1,32 @@
11
module HivePaths
22

3-
export HiveSchema, parse_hive_path, build_hive_path
3+
export HiveSchema, parse_hive_path, build_hive_path, find_hive_files
44

55
"""
6-
HiveSchema(parsers::Dict, order::Vector)
6+
HiveSchema(; parsers::Dict, order::Vector, filename::String)
77
88
Defines the structure and parsing rules for a Hive file hierarchy.
9+
10+
# Fields
11+
- `parsers`: Dict mapping key names to parsing functions
12+
- `order`: Vector defining the hierarchical order of keys in paths
13+
- `filename`: The target filename that appears in all Hive paths (one per schema)
914
"""
1015
struct HiveSchema
1116
parsers::Dict{String,Function}
1217
order::Vector{String}
18+
filename::String
1319
end
1420

1521
# Default constructor helper for cleaner syntax
16-
function HiveSchema(; parsers, order)
17-
return HiveSchema(parsers, order)
22+
function HiveSchema(; parsers, order, filename)
23+
return HiveSchema(parsers, order, filename)
1824
end
1925

2026
"""
21-
parse_hive_path(schema::HiveSchema,path::AbstractString; required_keys=[]) → NamedTuple
27+
parse_hive_path(schema::HiveSchema, path::AbstractString; required_keys=[]) → NamedTuple
2228
23-
Extract criterion, partition, and k from Hive-style paths.
29+
Extract key-value pairs from Hive-style paths according to the schema.
2430
2531
# Examples
2632
```julia
@@ -86,11 +92,12 @@ function parse_hive_path(schema::HiveSchema, path::AbstractString; required_keys
8692
end
8793

8894
"""
89-
build_hive_path(schema::HiveSchema,base_dir::AbstractString, file_name; kwargs...) → String
95+
build_hive_path(schema::HiveSchema, base_dir::AbstractString; kwargs...) → String
9096
9197
Construct Hive-style output path with consistent ordering.
9298
93-
Path structure is always: `base_dir/criterion=<criterion>/partition=<partition>[/k=<k>]/file_name`
99+
Path structure follows schema order: `base_dir/key1=<val1>/key2=<val2>/.../filename`
100+
where `filename` comes from `schema.filename`.
94101
95102
# Examples
96103
```julia
@@ -100,30 +107,27 @@ const schema = HiveSchema(
100107
"partition" => x -> parse(Int, x),
101108
"k" => x -> parse(Int, x)
102109
),
103-
order = ["criterion", "partition", "k"]
110+
order = ["criterion", "partition", "k"],
111+
filename = "data.arrow"
104112
)
105113
106-
build_hive_path(schema::HiveSchema,"data/binned", "data.arrow"; criterion="depth_iso", partition=1)
114+
build_hive_path(schema, "data/binned"; criterion="depth_iso", partition=1)
107115
# → "data/binned/criterion=depth_iso/partition=1/data.arrow"
108116
109-
build_hive_path(schema::HiveSchema,"data/cluster_assignments", "data.arrow"; partition=2, criterion="depth_iso", k=10)
117+
build_hive_path(schema, "data/cluster_assignments"; partition=2, criterion="depth_iso", k=10)
110118
# → "data/cluster_assignments/criterion=depth_iso/partition=2/k=10/data.arrow"
111-
# Noted that the order is consistent with the previous one; the order of `kwargs` does not matter.
112-
113-
build_hive_path(schema::HiveSchema,"plots/voronoi_maps", "criterion=depth_iso.png"; criterion="depth_iso", partition=1, k=8)
114-
# → "plots/voronoi_maps/criterion=depth_iso/partition=1/k=8/criterion=depth_iso.png"
119+
# Note that the order is consistent with the previous one; the order of `kwargs` does not matter.
115120
```
116121
117122
# Arguments
118123
- `base_dir`: Base directory path
119-
- `file_name`: File name to append at the end of the path
120-
- `kwargs`: labels in the path to the file as keyword arguments.
124+
- `kwargs`: Key-value pairs matching schema keys
121125
122126
123127
# Returns
124128
Complete path string with Hive-style structure
125129
"""
126-
function build_hive_path(schema::HiveSchema, base_dir::AbstractString, file_name; kwargs...)
130+
function build_hive_path(schema::HiveSchema, base_dir::AbstractString; kwargs...)
127131
# Start with base directory
128132
path_parts = String[base_dir]
129133

@@ -138,10 +142,67 @@ function build_hive_path(schema::HiveSchema, base_dir::AbstractString, file_name
138142
end
139143
end
140144

141-
push!(path_parts, file_name)
145+
push!(path_parts, schema.filename)
142146

143147
return joinpath(path_parts...)
144148
end
145149

146150

151+
# ============================================================================
152+
# I/O Utilities
153+
# ============================================================================
154+
155+
"""
156+
find_hive_files(schema::HiveSchema, root_dir::AbstractString;
157+
validate_keys=[], error_if_empty=false) -> Vector{String}
158+
159+
Recursively find files that match the schema's filename AND structure.
160+
161+
# Arguments
162+
- `validate_keys`: List of keys (e.g. `[:criterion]`) that MUST be present in the path
163+
for it to be considered valid.
164+
- `error_if_empty`: If true, throws error if no matching files are found.
165+
166+
# Returns
167+
Sorted list of absolute paths.
168+
"""
169+
function find_hive_files(schema::HiveSchema, root_dir::AbstractString;
170+
validate_keys=Symbol[], error_if_empty=false)
171+
172+
# 1. Safety Check: Directory existence
173+
if !isdir(root_dir)
174+
error("Directory not found: $root_dir")
175+
end
176+
177+
found_files = String[]
178+
target = schema.filename
179+
180+
# 2. Walk and Filter
181+
for (root, dirs, files) in walkdir(root_dir)
182+
if target in files
183+
full_path = joinpath(root, target)
184+
185+
# 3. schema-Awareness: Check if this file actually fits the schema
186+
# If validate_keys is empty, this just checks if parse crashes,
187+
# effectively acting as a loose structure check.
188+
try
189+
parsed = parse_hive_path(schema, full_path; required_keys=validate_keys)
190+
191+
push!(found_files, full_path)
192+
catch
193+
# If parsing fails (e.g. missing required keys), skip this file.
194+
# It might be a backup or a loose file not part of the dataset.
195+
continue
196+
end
197+
end
198+
end
199+
200+
# 4. Guardrail against silent failures
201+
if error_if_empty && isempty(found_files)
202+
error("No valid Hive files found in $root_dir matching schema $(schema.filename)")
203+
end
204+
205+
return sort(found_files)
206+
end
207+
147208
end

test/hivepaths.jl

Lines changed: 15 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,8 @@
77
"partition" => x -> parse(Int, x), # String -> Int
88
"k" => x -> parse(Int, x) # String -> Int
99
),
10-
["criterion", "partition", "k"] # Enforced order
10+
["criterion", "partition", "k"], # Enforced order
11+
"data.arrow"
1112
)
1213

1314
@testset "Parsing Logic" begin
@@ -53,30 +54,38 @@
5354

5455
@testset "Building Logic" begin
5556
base = "results"
56-
file = "params.json"
57+
TEST_SCHEMA2 = HiveSchema(
58+
Dict{String,Function}(
59+
"criterion" => identity, # String -> String
60+
"partition" => x -> parse(Int, x), # String -> Int
61+
"k" => x -> parse(Int, x) # String -> Int
62+
),
63+
["criterion", "partition", "k"], # Enforced order
64+
"params.json"
65+
)
5766

5867
# 1. Happy Path
5968
# Note: input order of kwargs shouldn't matter
60-
path = build_hive_path(TEST_SCHEMA, base, file; partition=1, k=5, criterion="depth")
69+
path = build_hive_path(TEST_SCHEMA2, base; partition=1, k=5, criterion="depth")
6170

6271
# Check standard path separators just in case (Windows/Unix)
6372
normalized = replace(path, "\\" => "/")
6473
@test normalized == "results/criterion=depth/partition=1/k=5/params.json"
6574

6675
# 2. Skip Missing/Nothing Values
67-
path_missing = build_hive_path(TEST_SCHEMA, base, file; criterion="depth", partition=1, k=nothing)
76+
path_missing = build_hive_path(TEST_SCHEMA2, base; criterion="depth", partition=1, k=nothing)
6877
normalized_missing = replace(path_missing, "\\" => "/")
6978
@test normalized_missing == "results/criterion=depth/partition=1/params.json"
7079

7180
# 3. Ignore Extra Kwargs (keys not in Schema)
72-
path_extra = build_hive_path(TEST_SCHEMA, base, file; criterion="depth", weird_param=999)
81+
path_extra = build_hive_path(TEST_SCHEMA2, base; criterion="depth", weird_param=999)
7382
@test !occursin("weird_param", path_extra)
7483
@test occursin("criterion=depth", path_extra)
7584
end
7685

7786
@testset "Round Trip (Build -> Parse)" begin
7887
# Generate a path
79-
generated_path = build_hive_path(TEST_SCHEMA, "tmp", "data.arrow";
88+
generated_path = build_hive_path(TEST_SCHEMA, "tmp";
8089
criterion="manual", partition=99, k=3)
8190

8291
# Immediately parse it back

0 commit comments

Comments
 (0)