Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
119 changes: 119 additions & 0 deletions claudedocs/majority_handling_analysis.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
# CDF Majority Handling Analysis

## Summary

This document explains how CDF files handle row-major vs column-major data layout and how different implementations (cdflib, CDFpp) handle this difference.

## Background: Row-Major vs Column-Major

### Data Layout
- **Row-Major** (C-style): Elements of each row are stored contiguously in memory
- Used by: C, C++, Python (NumPy default), Row
- For array `A[i,j,k]`, index `k` varies fastest

- **Column-Major** (Fortran-style): Elements of each column are stored contiguously
- Used by: Fortran, MATLAB, Julia, R
- For array `A[i,j,k]`, index `i` varies fastest

### CDF Specification
CDF files can be created with either:
- **Row-Major** (flag bit 0 = 1): Row-majority storage
- **Column-Major** (flag bit 0 = 0): Column-majority storage

The majority is stored in the CDR (CDF Descriptor Record) flags field:
- Bit 0: Majority (1 = row-major, 0 = column-major)

## Python cdflib Implementation

### Dimension Reversal for Column-Major
File: `ref/cdflib/cdflib/cdfread.py:1655-1656`
```python
if self._majority == "Column_major":
dimensions = list(reversed(dimensions))
```

### Array Transposition After Loading
File: `ref/cdflib/cdflib/cdfread.py:1749-1754`
```python
if self._majority == "Column_major":
if dimensions is not None:
axes = [0] + list(range(len(dimensions), 0, -1))
else:
axes = None
ret = np.transpose(ret, axes=axes)
```

### Logic
1. **Before loading**: Reverse dimension sizes if column-major
2. **After loading**: Transpose the array to match Python's row-major layout
3. **Record dimension**: First dimension (axis 0) is always the record count and is NOT reversed

Example for column-major CDF with dims `[3, 4, 5]` and 10 records:
1. Read dims as `[3, 4, 5]`
2. Reverse to `[5, 4, 3]` for loading
3. Shape becomes `(10, 5, 4, 3)` after adding records
4. Transpose with `axes=[0, 3, 2, 1]` → final shape `(10, 3, 4, 5)`

## C++ CDFpp Implementation

### Majority Swap Function
File: [ref/CDFpp/include/cdfpp/cdf-io/majority-swap.hpp](../ref/CDFpp/include/cdfpp/cdf-io/majority-swap.hpp)

The implementation uses an in-place element reordering strategy:

```cpp
template <bool is_string, typename shape_t, typename data_t>
void swap(data_t& data, const shape_t& shape)
{
// Only swap if dimensions > 2 (excluding record dimension)
if ((dimensions > 2 && !is_string) or (is_string and dimensions > 3))
{
// Generate access pattern for index mapping
const auto access_patern = _private::generate_access_pattern(record_shape);

// For each record, reorder elements according to access pattern
for (auto record = 0UL; record < records_count; record++)
{
for (const auto& swap_pair : access_patern)
{
temporary_record[swap_pair.src] = data[offset + swap_pair.dest];
}
std::memcpy(data.data() + offset, temporary_record.data(), bytes_per_record);
offset += elements_per_record;
}
}
}
```

### When Swap is Applied
File: [ref/CDFpp/include/cdfpp/variable.hpp:93-96](../ref/CDFpp/include/cdfpp/variable.hpp#L93-L96)

```cpp
Variable(const std::string& name, std::size_t number, data_t&& data, shape_t&& shape,
cdf_majority majority = cdf_majority::row, ...)
{
if (this->majority() == cdf_majority::column)
{
majority::swap(_data(), p_shape); // Swap for column-major files
}
}
```

### Logic
1. **Load data as-is** from the file
2. **If column-major**: Apply in-place element reordering
3. **Result**: Data layout matches the C++ row-major convention

### Index Calculation
The key is the `flat_index` calculation:

- **Row-major index**: `i*shape[1]*shape[2] + j*shape[2] + k`
- **Column-major index**: `k*shape[1]*shape[0] + j*shape[0] + i`

The swap function creates a mapping between these two indexing schemes.

## References

- CDF Internal Format Specification: https://cdf.gsfc.nasa.gov/
- Python cdflib: [ref/cdflib/cdflib/cdfread.py](../ref/cdflib/cdflib/cdfread.py)
- CDFpp: [ref/CDFpp/include/cdfpp/cdf-io/majority-swap.hpp](../ref/CDFpp/include/cdfpp/cdf-io/majority-swap.hpp)
5 changes: 3 additions & 2 deletions src/dataset.jl
Original file line number Diff line number Diff line change
Expand Up @@ -39,14 +39,15 @@ function CDFDataset(filename)
end

is_compressed(magic_numbers::UInt32) = magic_numbers != 0x0000FFFF
majority(cdf::CDFDataset) = majority(cdf.cdr)

# Convenience accessors for the dataset with lazy loading
function Base.getproperty(cdf::CDFDataset{CT}, name::Symbol) where {CT}
@inline function Base.getproperty(cdf::CDFDataset{CT, FST}, name::Symbol) where {CT, FST}
name in fieldnames(CDFDataset) && return getfield(cdf, name)
if name === :version
return version(cdf.cdr)
elseif name === :majority
return Majority(cdf.cdr)
return majority(cdf)
elseif name === :compression
return CT
elseif name === :adr
Expand Down
6 changes: 3 additions & 3 deletions src/enums.jl
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
@enum Majority begin
Row = 0
Column = 1
@enum Majority::Bool begin
Row = false
Column = true
end

@enum RecordType::Int8 begin
Expand Down
24 changes: 23 additions & 1 deletion src/loading/variable.jl
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,24 @@ else
end
end

"""
majority_swap!(data, dims_without_record)

Convert row-major data layout to column-major (Julia's native layout) in-place.
For row-major CDF files, this reverses the dimension order (except record dimension).
"""
function majority_swap!(data::AbstractArray{T, N}, dims_without_record) where {T, N}
N <= 2 && return data
perm = ntuple(i -> N - i, N - 1)
reversed_dims = reverse(dims_without_record)
temp = similar(data, dims_without_record)
for slc in eachslice(data; dims = N)
permutedims!(temp, reshape(slc, reversed_dims), perm)
copyto!(slc, temp)
end
return data
end


function variable(cdf::CDFDataset, name)
vdr = find_vdr(cdf, name)
Expand Down Expand Up @@ -59,6 +77,7 @@ function DiskArrays.readblock!(var::CDFVariable{T, N}, dest::AbstractArray{T}, r
end_idx = findfirst(entry -> entry.first <= last_rec <= entry.last, entries)
record_size = prod(dims_without_record)

is_row_major = majority(var) == Row
# If the variable is not compressed and the other dimension ranges are the same as the variable range
# we can directly read the data into dest
if is_no_compression && is_full_record
Expand All @@ -76,8 +95,9 @@ function DiskArrays.readblock!(var::CDFVariable{T, N}, dest::AbstractArray{T}, r
doffs += N_elems
end
@assert doffs == length(dest) + 1
is_row_major && majority_swap!(dest, dims_without_record)
else
Base.@inbounds Threads.@threads for i in eachindex(start_idx:end_idx)
Base.@inbounds Threads.@threads for i in start_idx:end_idx
entry = entries[i]
if is_full_record && entry.first >= first_rec && entry.last <= last_rec
# full entry
Expand All @@ -86,6 +106,7 @@ function DiskArrays.readblock!(var::CDFVariable{T, N}, dest::AbstractArray{T}, r
total_elems = record_size * length(entry)
decompressor = take!(decompressors())
load_cvvr_data!(dest_view, 1, buffer, entry.offset, total_elems, RecordSizeType, compression; decompressor)
is_row_major && majority_swap!(dest_view, dims_without_record)
put!(decompressors(), decompressor)
else
# partial entry
Expand All @@ -105,6 +126,7 @@ function DiskArrays.readblock!(var::CDFVariable{T, N}, dest::AbstractArray{T}, r

# chunk_data = _load_entry_chunk(var, entry, RecordSizeType, buffer, var.compression; decompressor)
chunk_array = reshape(chunk, dims_without_record..., :)
is_row_major && majority_swap!(chunk_array, dims_without_record)
src_view = view(chunk_array, other_ranges..., local_range)
dest_view .= src_view
end
Expand Down
2 changes: 1 addition & 1 deletion src/records/cdr.jl
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ struct CDR{FST} <: Record
end

version(cdr::CDR; verbose = true) = verbose ? (cdr.version, cdr.release, cdr.increment) : cdr.version
Majority(cdr::CDR) = (cdr.flags & 0x01) != 0 ? Majority(0) : Majority(1) # Row=0, Column=1
majority(cdr::CDR) = (cdr.flags & 0x01) != 0 ? Row : Column # Row=0, Column=1
is_cdf_v3(cdr::CDR) = cdr.version == 3

"""
Expand Down
3 changes: 1 addition & 2 deletions src/records/records.jl
Original file line number Diff line number Diff line change
Expand Up @@ -74,7 +74,6 @@ Decode the CDR flags field into individual boolean flags.
function decode_cdr_flags(flags)
flags = UInt32(flags)
return (
row_majority = (flags & 0x01) != 0,
single_file_format = (flags & 0x02) != 0,
checksum_used = (flags & 0x04) != 0,
md5_checksum = (flags & 0x08) != 0,
Expand All @@ -92,7 +91,7 @@ function Base.show(io::IO, cdr::CDR)
println(io, " Version: $(cdr.version).$(cdr.release).$(cdr.increment)")
println(io, " Encoding: $(cdr.encoding)")
println(io, " Flags: 0x$(string(cdr.flags, base = 16, pad = 8))")
println(io, " - Row Majority: $(flag_info.row_majority)")
println(io, " - Majority: $(majority(cdr))")
println(io, " - Single File Format: $(flag_info.single_file_format)")
println(io, " - Checksum Used: $(flag_info.checksum_used)")
println(io, " - MD5 Checksum: $(flag_info.md5_checksum)")
Expand Down
2 changes: 2 additions & 0 deletions src/variable.jl
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,8 @@ end

Base.size(var::CDFVariable) = var.dims

@inline majority(var::CDFVariable) = majority(var.parentdataset)

function dst_src_ranges(first, last, entry)
overlap_first = max(first, entry.first)
overlap_last = min(last, entry.last)
Expand Down
16 changes: 12 additions & 4 deletions test/comprehensive_test.jl
Original file line number Diff line number Diff line change
Expand Up @@ -64,13 +64,13 @@ const EXPECTED_VARIABLES = Dict(
"var3d_counter" => (
shape = (3, 5, 10), # Shape adjusted for Julia column-major
data_type = "CDF_DOUBLE",
values = reshape(0.0:(3 * 5 * 10 - 1), 3, 5, 10),
values = permutedims(reshape(0.0:(3 * 5 * 10 - 1), 5, 3, 10), (2, 1, 3)),
attributes = Dict("attr1" => "attr1_value", "attr2" => "attr2_value"),
),
"var5d_counter" => (
shape = (5, 4, 3, 2, 6), # Shape adjusted for Julia column-major
data_type = "CDF_DOUBLE",
values = reshape(0.0:(5 * 4 * 3 * 2 * 6 - 1), 5, 4, 3, 2, 6),
values = permutedims(reshape(0.0:(5 * 4 * 3 * 2 * 6 - 1), 2, 3, 4, 5, 6), (4, 3, 2, 1, 5)),
attributes = Dict{String, Any}(),
),
"var3d" => (
Expand Down Expand Up @@ -113,12 +113,12 @@ const EXPECTED_VARIABLES = Dict(
shape = (2, 2, 1),
data_type = "CDF_CHAR",
attributes = Dict{String, Any}(),
values = ["value[00]"; "value[01]";; "value[10]"; "value[11]";;;],
values = ["value[00]"; "value[10]";; "value[01]"; "value[11]";;;],
),
"var4d_string" => (
shape = (3, 2, 2, 1),
data_type = "CDF_CHAR",
values = ["value[000]" "value[011]"; "value[001]" "value[100]"; "value[010]" "value[101]";;; "value[110]" "value[201]"; "value[111]" "value[210]"; "value[200]" "value[211]";;;;],
values = ["value[000]"; "value[100]"; "value[200]" ;; "value[010]"; "value[110]"; "value[210]";;; "value[001]"; "value[101]"; "value[201]";; "value[011]"; "value[111]"; "value[211]";;;;],
attributes = Dict{String, Any}(),
)
)
Expand Down Expand Up @@ -162,6 +162,14 @@ end
end
end

@testset "Variable Slicing" begin
@test ds["var3d_counter"][1, 1, :] == 0:15:135
@test ds["var3d_counter"][:, :, 2]' == reshape(15:29, 5, 3)

@test ds["var3d_counter"][2, :, 2] == 20:24
@test ds["var3d_counter"][:, 1, 2] == 15:5:25
end

@testset "Variable Varying" begin
@test is_record_varying(ds["var"])
@test is_record_varying(ds["var_recvary_string"])
Expand Down
16 changes: 8 additions & 8 deletions test/perf_test.jl
Original file line number Diff line number Diff line change
Expand Up @@ -33,11 +33,11 @@ b = [b0, b1, b12, b2, b30, b3, b4, b5]
# ┌ Info: Benchmarks
# │ b =
# │ 8-element Vector{Chairmarks.Sample}:
# │ 539.550 ns (6 allocs: 528 bytes)
# │ 2.083 μs (20 allocs: 29.172 KiB)
# │ 2.000 μs (24 allocs: 29.328 KiB)
# │ 83.896 μs (3777 allocs: 162.469 KiB)
# │ 385.400 ns (7 allocs: 528 bytes)
# │ 9.586 ms (574 allocs: 31.655 MiB)
# │ 467.275 μs (100 allocs: 1.344 MiB)
# └ 20.855 μs (250 allocs: 12.484 KiB)
# │ 537.500 ns (6 allocs: 544 bytes)
# │ 1.425 μs (6 allocs: 28.219 KiB)
# │ 1.317 μs (6 allocs: 28.219 KiB)
# │ 95.062 μs (3777 allocs: 163.078 KiB)
# │ 387.500 ns (7 allocs: 560 bytes)
# │ 10.175 ms (567 allocs: 31.647 MiB, 5.34% gc time)
# │ 462.467 μs (93 allocs: 1.337 MiB)
# └ 23.209 μs (250 allocs: 12.656 KiB)