Releases: ahrefs/ocannl
Convolutional Neural Networks
This release brings many fixes to affine shape inference and affine projections inference, and for the first time proper padding propagation and initialization of padding margins. The use_padding setup (e.g. valid vs. padded convolution) is now per-affine-operation, configurable from the einsum notation.
The toy end-to-end test for CNNs, "counting circles", shows an interesting failure on what might be just Intel processors? (I can reproduce it on one of my computers.) Might be a quirk but debugging left out of this release.
From the changelog, collected by Claude:
Added
- Neutral element tracking during shape inference for proper padding reset
use_paddingsyntax in einsum notation (replacing global flag)- Circle counting dataset and MLP training test
- Cross-entropy loss and
one_hot_of_int_listhelper for classification tasks out_channelsparameter toconv2dfor explicit channel specification- Projection slot detection by naming convention in
%cdsyntax extension - Configurable scaling to
kaimingandxavierinitialization functions - New documentation:
tensors_and_contexts.md, affine indexing for convolutions - Documentation for
op_funandparam_op_funtypes, roots, embedded nodes, and params concepts
Changed
- Padding is now reset by tracking neutral elements through shape inference
- Changed default random initialization to
uniform1which doesn't impose shape constraints - Refactored
vbsfrom Map to list for order-preserving let bindings in syntax extensions - Infer the shape of inline definitions assigned a slot for
%cdexpressions withprojectionsin scope
Fixed
- Gracefully disable inlining for convolution patterns
- Don't propagate padding across operations, even if the same tensor participates in them
- Padding margin initialization for tensors with multiple operations
- Padding initialization bug for max-pool operations
uniform1periodicity by spreading bits in*_to_uint4x32conversionstropical(max-reduce) backprop to use input-shaped condition tensorstropicalg2 gradient by using correct projection for kernel gradients- Kernel extent calculation to depend on kernel size parity
- Shape inference for
Total_elemsconstraints withStrided_varnumerators compute_row_productto returnNonefor unresolved variables- Deferred dim variable guessing to Stage 5 for
Total_elemspropagation - Padding offset application during lowering for correct buffer indexing
- Intermediate grads from
kaiming,xavier - Random seed initialization missing in transformer test
Shape inference errors "you forgot to specify hidden dimensions"; new notation `%%extend_dsls`
The notation %%extend_dsls generates boilerplate to add new operations to the DSLs -- making them easily available to the %op and %cd notations. It is for example used to add normal distribution to DSLs in a concise way.
Shape inference errors "you forgot to specify hidden dimensions" are generated when shape inference would otherwise need to guess the smallest fitting shape for a parameter.
From the changelog:
Added
- Normal distribution random number generation
%%extend_dslssyntax extension for extending DSL modulesinterleaveoperation in DSL modulesDefined_by_cd_logicshape inference specification for explicit shape logic in forward code- Menhir-based einsum parser replacing Angstrom for better maintainability
- Name clash detection for inline definitions and variable captures in syntax extensions
is_paramflag in shape inference for improved parameter-related error messages- Teacher forcing support in transformer implementation
- Heuristics for "missing hidden dimensions" error messages with row variables
Tree_mappersistent map utility with exposed tree structure in sexp serialization
Changed
- Migrated shape environment to use
Utils.Tree_mapfor ppx_minidebug v3 full-scale debugging - Replaced explicit non-iteration tracking with improved projection constraints derivation
- Support for offset-only affine expressions in shape inference
- Renamed optional dimension variable parameter from
labeltoname - Row IDs replaced with provenance tracking (
Row.id→Row.prov) supporting deduplication - Tensor labels interface improved: per-operation
op_labelstring withlabellist as trailing parameter - Adapted to ppx_minidebug renaming (
entry_id→scope_id) - Prefixed block names in
lib/nn_blocks.mlfor better namespace management - Tests reorganized: more einsum-related tests moved to
test/einsum/
Fixed
- Normal distribution test determinism across different machines
- Convolution/affine indexing shape inference offset adjustment by strides
- Parameter gradients not embedded after params moved earlier in processing
- Einsum parser handling of missing convolution and single-character cases
- Shape inference for
Conv_inputadditional cases - Incremental construction of tensors in
Tensor.op - Attention masks now have empty output dimensions for proper broadcasting to multihead attentions
- LUB (Least Upper Bound) computation in
dim_ineq - Axis labels distinguished from dimension units (labels) in
shape_spec_to_dims_bio - Shape inference for dim-1 with labels treated same as dim>1 (only dim-1 without label is different)
- Shape specification requiring LUB incorporation for non-terminal shapes
- Missing CUDA backend cases and NVRTC compatibility
- Premature guessing of dim variables as dim-1 when participating in
Total_elemsconstraints - Generic constraints ignored for unused tensors
- Missing propagation when
set_dimhappened before parsing the spec - Guard
axis_keys_to_idcsfrom un-inferred shapes - More informative error messages for parameter shape errors
- Crash on repeated variable capture in syntax extensions
- Additional syntax support for binary einsum operators
Record-syntax-based inline definitions; NN building blocks
This is a rushed release. CUDA backend regressions. The next release 0.6.2 should come at the 0.6 -> 0.6.1 cadence (that is relatively soon).
Highlights:
- Record-based syntax for inline definitions, with the first field being the defined identifier, it's value being the parameter initialization (prevented for %cd syntax) -- punning means using a default, the following fields being the labeled arguments passed to the initialization expression.
- Shape allows injecting shape equality constraints via set_dim and set_equal, and capturing shape (dim, row) variables via einsum variable capture syntax. set_dim takes a variable reference and a number. "Equaling" a dimension and a row creates a Total_elems constraint (dimension vs. product of dimensions), otherwise it's strict equality e.g. between rows.
- Split the sources of neural_nets_lib into two directories, tensor/ with the core implementation and lib/ with user-land code.
- Fleshed-out Nn_blocks with components for tensors and convolutional nets. The tensor components are already validated but not yet sufficient for full-blown GPT-style network (e.g. missing fixed positional encodings). The convolution components are not validated and do not work yet.
- Introduced truly heterogeneous precision primitive ops, operations like where and Uint4x32_to_prec_uniform were previously buggy.
- Migrated documentation to docs/ directory with pandoc rendering support.
Many more changes, see the datailed CHANGES.md compiled by Claude Code.
Initialization on devices; shape inference and projections with strides
Report by Claude with main focus since start of July:
- Major Release 0.6.0 with comprehensive new features for deep learning
- Added support for Bfloat16 and FP8 precisions, critical for modern ML training
efficiency - Implemented convolution support with affine indexing expressions in projections,
einsum notation, and shape inference - Added counter-based randomness via Threefry4x32 operation for reproducible random
number generation - Introduced bidirectional precision inference (both top-down and bottom-up) for
automatic type optimization - Enhanced %cd syntax with .forward, .backprop, .zero_grads support and automatic
comment generation
- Added support for Bfloat16 and FP8 precisions, critical for modern ML training
- New Datasets and Examples
- Added MNIST and CIFAR10 datasets (borrowed from Raven)
- Created Names dataset with bigram use-case helper for language modeling
- Implemented Half-moons synthetic dataset for classification tasks
- Developed comprehensive test examples including bigram language models
- Performance and Memory Improvements
- Fixed critical memory leak in builtins.c
- Resolved bus error on large datasets
- Migrated from heap-local to on-stack allocation by default
- Improved virtual nodes and inlining to work across routines
- Enhanced shape inference with better Total_elems constraint handling and LUB support
- Backend Stabilization
- Fixed numerous CUDA backend regressions and missing constructs
- Resolved Metal backend issues with session-level bugs
- Added Float16 emulation for systems without native _Float16 support
- Fixed host-device synchronization issues with proper devices_not_lagging_host
semantics
Metal backend (macOS with GPUs including Apple Silicon)
WARNING: this release's test suite depends on an unreleased fix to PrintBox at the time of the release. I'm not planning on submitting it to the opam repository.
Highlights:
- What it says on the tin: GPUs like the M1, ..., M4.
- Got rid of
Stdlib.Formatas I found it error prone and hard to debug. - A simple test for logging from within kernels in the test suite.
From the changelog:
Added
- The Metal framework backend (Apple Silicon).
- Setting
debug_log_to_stream_filesto neatly keep logs from routine execution in their separate files. - Settings
clean_up_artifacts_on_startup,prefer_backend_uniformity. - Tools directory and the
minisedtool: regexp replacement file rewrite. - Directory arrayjit/bin and executable
read_configfor extracting OCANNL configuration into txt files.
Changed
- Removed
initializeandis_initializedfrom the backend API; instead, backends should be initialized on functor application. The functors now takeconfigas argument. - More descriptive identifier names in C-syntax code in case of name conflicts.
- Changed the backend config name
cctomulticore_ccfor consistency. - Migrated out of
Stdlib.FormattoPPrintfor all structured formatting. - Migrated stdout capture to thread-based (domain-based actually); for Windows compatibility but also much more robust for large logs.
Fixed
- Avoid conflicts with C math function names like
fma. - Satur01_gate had wrong semantics.
More primitive operations
Highlights from README:
- Supports a lot of primitive operations (including ternary ops), and ternary tensor operations.
%cdand%opsupport both curried and uncurried operator application syntax.- More flexible gradient construction via the
%cdsyntax (better projections inference). - Works on Native Windows with the C compiler backend (but CUDA backend blocked by cudajit still).
Details from the changelog:
Added
- Lots of new primitive ops:
- Unary:
Satur01 | Exp | Log | Exp2 | Log2 | Sin | Cos | Sqrt | Recip | Recip_sqrt | Neg | Tanh_approx | Not - Binary:
Satur01_gate | Max | Min | Mod | Cmplt | Cmpeq | Cmpne - Ternary:
Where | FMA(non-accumulating)
- Unary:
- Ternary tensor operations.
- A differentiable
whereoperation.
- A differentiable
- More flexible gradient construction via the
%cdsyntax (better projections inference). - CC backend piggy-backing on OCaml's C compiler (consistent across OSes).
Changed
- Updated to printbox 0.12, with upstreamed graphing.
-pthread->-lpthreadinc_library_flagsindunefiles.- Removed Numpy support for easier compatibility on native Windows.
- Unary (primitive) ops and relu are now named, not operator syntax.
- Refactored
%cdparsing of primitive ops. %cdand%opsupport both curried and uncurried operator application syntax.- Updated to ppx_minidebug 2.2.0 with support for cross-run diffing.
Fixed
- Numbers text rendering (consistent across OSes).
- Moved closing row variables to stage 3, because stage 2 may need to process inequalities generating more LUBs.
- Don't unnecessarily prevent bytecode-only build targets.
Automatic synchronization and transfers between host and devices
From the changelog:
Added
- Automatic transfers to host from the context that most recently updated a node.
- Automatic transfers of routine's inputs from host to routine's context if the host array modification was not yet transfered.
Fixed
- Added
#as alternative to~~for comment lines inocannl_configfiles, and fixed a bug in their parsing.
Stream-to-stream synchronization at the buffer level
Highlights from README:
- Support for CUDA events, and
Condition-based events for CPU backends. - Overhaul of the backend interfaces, both user-facing but especially internal: full code sharing.
- Automatic stream-to-stream synchronization on a per-tensor-node basis.
Details from the changelog:
Added
- Interface files for
BackendsandLow_level. - Fixed #245: tracking of used memory. But there's room for improvement.
- Stream-to-stream synchronization functionality, with lazy per-tensor-node synchronization.
Changed
- Migrated to cudajit 0.6.1.
- Verifying that code is linked with the right contexts, by tracking
embedded_nodeswith assignments. - Renaming: (virtual)
device->stream,physical_device->device. - New files: split out
backend_intf.ml,backend_impl.ml,schedulers.mlfrombackends.ml; movedTnode.tasktotask.ml; renamedbackend_utils.mltoc_syntax.ml. - Removed half-static verification of merge buffer nodes inside
device_to_device. - Fixed #286: cross-stream-sharing incorporated into
Tnode.memory_mode. - Moved the multicore backend from a
device = streammodel to a single device model. - Got rid of
unsafe_cleanup. - Rename
subordinaltostream_id. - Removed dependency on
core, broke up dependency onppx_jane. - Huge refactoring of backend internal interfaces and API (not repeating same code).
- Built per-tensor-node stream-to-stream synchronization into copying functions.
- Re-introduced whole-device blocking synchronization, which now is just a slight optimization as it also cleans up event book-keeping.
- Simplifications: no more explicit compilation postponing; no more hard-coded pointers (all non-local arrays are passed by parameter).
- Fresh backends are now fresh modules to structurally prevent any potential cache leaking.
Fixed
- Validating merge nodes for the CUDA backend.
- Checking
is_releasedon weak array retrieval.
Half precision, mixed precision, CUDA virtual devices
The release 0.4.1 offers: half precision, mixed precision, proper support for cuda virtual devices, and many bug fixes.
From the CHANGELOG:
Added
- Implemented the previously-mocked support for half precision (FP16).
- We work around the missing Ctypes coverage by not using
Ctypes.bigarray_start. - We check FP16 constants for overflow.
- We output half precision specific code from the CUDA backend.
- We work around the missing Ctypes coverage by not using
- Finally proper support for mixed precision! Lazy precision defaults and delayed precision setting via
Tnode.update_prec. - A placeholder
nn_blocks.mlhinting at an intended design pattern for model components. - A memory model for the multiple virtual devices per physical device setup, implemented in the CUDA backend. It fixes the CUDA backend behavior in the data parallelism benchmark.
- Slides for the Fun OCaml meetup: docs/Fun OCaml.
- New syntax: inline tensor declarations with a literal float as initial value.
Changed
- Removed the
pipes_cc, pipes_gccjitbackends (Pipes_multicore_backend) -- I had fixedPipes_multicore_backendby using thepolllibrary instead ofUnix.select, but it turns out to be very very slow. - Changed the
%cdblock comment syntax~~to allow detailed structuring. RewroteTrain.grad_updateto use the%cdsyntax. - Made
Train.sgd_oneslightly more thrifty:p =- learning_rate *. sgd_delta-->p =- learning_rate * sgd_delta ~logic:"."without the inline tensor expression.
Fixed
- Log levels related de-confusion:
- Critical bug: logging of computation traces was not properly converted to ppx_minidebug 2.0.
- Properly restore
log_leveland inform about its setting. - By default do not log from tests.
debug_log_from_routinesshould only happen whenlog_level > 1.
- Bugs in
Multicore_backend:awaitwas not checking queue emptiness,worker'sCondition.broadcastwas non-atomically guarded (doesn't need to be), possible deadloop due to the lockfree queue -- now replaced withsaturn_lockfree. - Reduced busy-waiting inside
c_compile_and_load, propagating compilation errors now instead of infinite loop on error. - Fixed loss of significant digits for small numbers when outputting files.
- Added missing mixed-precision conversions in the
C_syntaxbackend builder. - Restored the functionality of debug logging from the cuda backend.
- Always reinitialize global state at the beginning of
let%expect_test, to make them more deterministic.
Half precision, mixed precision, CUDA virtual devices
The release 0.4.1 offers: half precision, mixed precision, proper support for cuda virtual devices, and many bug fixes.
Non-beta release blocked by getting cudajit 0.4.1 in the opam-repository.
From the CHANGELOG:
Added
- Implemented the previously-mocked support for half precision (FP16).
- We work around the missing Ctypes coverage by not using
Ctypes.bigarray_start. - We check FP16 constants for overflow.
- We output half precision specific code from the CUDA backend.
- We work around the missing Ctypes coverage by not using
- Finally proper support for mixed precision! Lazy precision defaults and delayed precision setting via
Tnode.update_prec. - A placeholder
nn_blocks.mlhinting at an intended design pattern for model components. - A memory model for the multiple virtual devices per physical device setup, implemented in the CUDA backend.
- It fixes the CUDA backend behavior in the data parallelism benchmark.
Changed
- Removed the
pipes_cc, pipes_gccjitbackends (Pipes_multicore_backend) -- I had fixedPipes_multicore_backendby using thepolllibrary instead ofUnix.select, but it turns out to be very very slow.
Fixed
- Log levels related de-confusion:
- Critical bug: logging of computation traces was not properly converted to ppx_minidebug 2.0.
- Properly restore
log_leveland inform about its setting. - By default do not log from tests.
debug_log_from_routinesshould only happen whenlog_level > 1.
- Bugs in
Multicore_backend:awaitwas not checking queue emptiness,worker'sCondition.broadcastwas non-atomically guarded (doesn't need to be), possible deadloop due to the lockfree queue -- now replaced withsaturn_lockfree. - Reduced busy-waiting inside
c_compile_and_load, propagating compilation errors now instead of infinite loop on error. - Fixed loss of significant digits for small numbers when outputting files.
- Added missing mixed-precision conversions in the
C_syntaxbackend builder. - Restored the functionality of debug logging from the
cudabackend.