Releases · ahrefs/ocannl

19 Dec 20:01

lukstafi

0.6.3

c73729d

Latest

This release brings many fixes to affine shape inference and affine projections inference, and for the first time proper padding propagation and initialization of padding margins. The use_padding setup (e.g. valid vs. padded convolution) is now per-affine-operation, configurable from the einsum notation.

The toy end-to-end test for CNNs, "counting circles", shows an interesting failure on what might be just Intel processors? (I can reproduce it on one of my computers.) Might be a quirk but debugging left out of this release.

From the changelog, collected by Claude:

Added

Neutral element tracking during shape inference for proper padding reset
use_padding syntax in einsum notation (replacing global flag)
Circle counting dataset and MLP training test
Cross-entropy loss and one_hot_of_int_list helper for classification tasks
out_channels parameter to conv2d for explicit channel specification
Projection slot detection by naming convention in %cd syntax extension
Configurable scaling to kaiming and xavier initialization functions
New documentation: tensors_and_contexts.md, affine indexing for convolutions
Documentation for op_fun and param_op_fun types, roots, embedded nodes, and params concepts

Changed

Padding is now reset by tracking neutral elements through shape inference
Changed default random initialization to uniform1 which doesn't impose shape constraints
Refactored vbs from Map to list for order-preserving let bindings in syntax extensions
Infer the shape of inline definitions assigned a slot for %cd expressions with projections in scope

Fixed

Gracefully disable inlining for convolution patterns
Don't propagate padding across operations, even if the same tensor participates in them
Padding margin initialization for tensors with multiple operations
Padding initialization bug for max-pool operations
uniform1 periodicity by spreading bits in *_to_uint4x32 conversions
tropical (max-reduce) backprop to use input-shaped condition tensors
tropical g2 gradient by using correct projection for kernel gradients
Kernel extent calculation to depend on kernel size parity
Shape inference for Total_elems constraints with Strided_var numerators
compute_row_product to return None for unresolved variables
Deferred dim variable guessing to Stage 5 for Total_elems propagation
Padding offset application during lowering for correct buffer indexing
Intermediate grads from kaiming, xavier
Random seed initialization missing in transformer test

Assets 2

0 Join discussion

28 Nov 20:49

lukstafi

0.6.2

dc9b194

Shape inference errors "you forgot to specify hidden dimensions"; new notation `%%extend_dsls`

The notation %%extend_dsls generates boilerplate to add new operations to the DSLs -- making them easily available to the %op and %cd notations. It is for example used to add normal distribution to DSLs in a concise way.

Shape inference errors "you forgot to specify hidden dimensions" are generated when shape inference would otherwise need to guess the smallest fitting shape for a parameter.

From the changelog:

Added

Normal distribution random number generation
%%extend_dsls syntax extension for extending DSL modules
interleave operation in DSL modules
Defined_by_cd_logic shape inference specification for explicit shape logic in forward code
Menhir-based einsum parser replacing Angstrom for better maintainability
Name clash detection for inline definitions and variable captures in syntax extensions
is_param flag in shape inference for improved parameter-related error messages
Teacher forcing support in transformer implementation
Heuristics for "missing hidden dimensions" error messages with row variables
Tree_map persistent map utility with exposed tree structure in sexp serialization

Changed

Migrated shape environment to use Utils.Tree_map for ppx_minidebug v3 full-scale debugging
Replaced explicit non-iteration tracking with improved projection constraints derivation
Support for offset-only affine expressions in shape inference
Renamed optional dimension variable parameter from label to name
Row IDs replaced with provenance tracking (Row.id → Row.prov) supporting deduplication
Tensor labels interface improved: per-operation op_label string with label list as trailing parameter
Adapted to ppx_minidebug renaming (entry_id → scope_id)
Prefixed block names in lib/nn_blocks.ml for better namespace management
Tests reorganized: more einsum-related tests moved to test/einsum/

Fixed

Normal distribution test determinism across different machines
Convolution/affine indexing shape inference offset adjustment by strides
Parameter gradients not embedded after params moved earlier in processing
Einsum parser handling of missing convolution and single-character cases
Shape inference for Conv_input additional cases
Incremental construction of tensors in Tensor.op
Attention masks now have empty output dimensions for proper broadcasting to multihead attentions
LUB (Least Upper Bound) computation in dim_ineq
Axis labels distinguished from dimension units (labels) in shape_spec_to_dims_bio
Shape inference for dim-1 with labels treated same as dim>1 (only dim-1 without label is different)
Shape specification requiring LUB incorporation for non-terminal shapes
Missing CUDA backend cases and NVRTC compatibility
Premature guessing of dim variables as dim-1 when participating in Total_elems constraints
Generic constraints ignored for unused tensors
Missing propagation when set_dim happened before parsing the spec
Guard axis_keys_to_idcs from un-inferred shapes
More informative error messages for parameter shape errors
Crash on repeated variable capture in syntax extensions
Additional syntax support for binary einsum operators

Assets 2

0 Join discussion

12 Sep 10:06

lukstafi

0.6.1.2

9490788

Record-syntax-based inline definitions; NN building blocks

This is a rushed release. CUDA backend regressions. The next release 0.6.2 should come at the 0.6 -> 0.6.1 cadence (that is relatively soon).

Highlights:

Record-based syntax for inline definitions, with the first field being the defined identifier, it's value being the parameter initialization (prevented for %cd syntax) -- punning means using a default, the following fields being the labeled arguments passed to the initialization expression.
Shape allows injecting shape equality constraints via set_dim and set_equal, and capturing shape (dim, row) variables via einsum variable capture syntax. set_dim takes a variable reference and a number. "Equaling" a dimension and a row creates a Total_elems constraint (dimension vs. product of dimensions), otherwise it's strict equality e.g. between rows.
Split the sources of neural_nets_lib into two directories, tensor/ with the core implementation and lib/ with user-land code.
Fleshed-out Nn_blocks with components for tensors and convolutional nets. The tensor components are already validated but not yet sufficient for full-blown GPT-style network (e.g. missing fixed positional encodings). The convolution components are not validated and do not work yet.
Introduced truly heterogeneous precision primitive ops, operations like where and Uint4x32_to_prec_uniform were previously buggy.
Migrated documentation to docs/ directory with pandoc rendering support.

Many more changes, see the datailed CHANGES.md compiled by Claude Code.

Assets 2

0 Join discussion

20 Aug 15:00

lukstafi

0.6.0.4

76e8c51

Initialization on devices; shape inference and projections with strides

Report by Claude with main focus since start of July:

Major Release 0.6.0 with comprehensive new features for deep learning
- Added support for Bfloat16 and FP8 precisions, critical for modern ML training
  efficiency
- Implemented convolution support with affine indexing expressions in projections,
  einsum notation, and shape inference
- Added counter-based randomness via Threefry4x32 operation for reproducible random
  number generation
- Introduced bidirectional precision inference (both top-down and bottom-up) for
  automatic type optimization
- Enhanced %cd syntax with .forward, .backprop, .zero_grads support and automatic
  comment generation
New Datasets and Examples
- Added MNIST and CIFAR10 datasets (borrowed from Raven)
- Created Names dataset with bigram use-case helper for language modeling
- Implemented Half-moons synthetic dataset for classification tasks
- Developed comprehensive test examples including bigram language models
Performance and Memory Improvements
- Fixed critical memory leak in builtins.c
- Resolved bus error on large datasets
- Migrated from heap-local to on-stack allocation by default
- Improved virtual nodes and inlining to work across routines
- Enhanced shape inference with better Total_elems constraint handling and LUB support
Backend Stabilization
- Fixed numerous CUDA backend regressions and missing constructs
- Resolved Metal backend issues with session-level bugs
- Added Float16 emulation for systems without native _Float16 support
- Fixed host-device synchronization issues with proper devices_not_lagging_host
  semantics

Assets 2

0 Join discussion

26 May 18:31

lukstafi

0.5.3

b31eb40

Metal backend (macOS with GPUs including Apple Silicon)

WARNING: this release's test suite depends on an unreleased fix to PrintBox at the time of the release. I'm not planning on submitting it to the opam repository.

Highlights:

What it says on the tin: GPUs like the M1, ..., M4.
Got rid of Stdlib.Format as I found it error prone and hard to debug.
A simple test for logging from within kernels in the test suite.

From the changelog:

Added

The Metal framework backend (Apple Silicon).
Setting debug_log_to_stream_files to neatly keep logs from routine execution in their separate files.
Settings clean_up_artifacts_on_startup, prefer_backend_uniformity.
Tools directory and the minised tool: regexp replacement file rewrite.
Directory arrayjit/bin and executable read_config for extracting OCANNL configuration into txt files.

Changed

Removed initialize and is_initialized from the backend API; instead, backends should be initialized on functor application. The functors now take config as argument.
More descriptive identifier names in C-syntax code in case of name conflicts.
Changed the backend config name cc to multicore_cc for consistency.
Migrated out of Stdlib.Format to PPrint for all structured formatting.
Migrated stdout capture to thread-based (domain-based actually); for Windows compatibility but also much more robust for large logs.

Fixed

Avoid conflicts with C math function names like fma.
Satur01_gate had wrong semantics.

Assets 2

08 Apr 09:37

lukstafi

0.5.2

a65dd89

More primitive operations

Highlights from README:

Supports a lot of primitive operations (including ternary ops), and ternary tensor operations.
%cd and %op support both curried and uncurried operator application syntax.
More flexible gradient construction via the %cd syntax (better projections inference).
Works on Native Windows with the C compiler backend (but CUDA backend blocked by cudajit still).

Details from the changelog:

Added

Lots of new primitive ops:
- Unary: Satur01 | Exp | Log | Exp2 | Log2 | Sin | Cos | Sqrt | Recip | Recip_sqrt | Neg | Tanh_approx | Not
- Binary: Satur01_gate | Max | Min | Mod | Cmplt | Cmpeq | Cmpne
- Ternary: Where | FMA (non-accumulating)
Ternary tensor operations.
- A differentiable where operation.
More flexible gradient construction via the %cd syntax (better projections inference).
CC backend piggy-backing on OCaml's C compiler (consistent across OSes).

Changed

Updated to printbox 0.12, with upstreamed graphing.
-pthread -> -lpthread in c_library_flags in dune files.
Removed Numpy support for easier compatibility on native Windows.
Unary (primitive) ops and relu are now named, not operator syntax.
Refactored %cd parsing of primitive ops.
%cd and %op support both curried and uncurried operator application syntax.
Updated to ppx_minidebug 2.2.0 with support for cross-run diffing.

Fixed

Numbers text rendering (consistent across OSes).
Moved closing row variables to stage 3, because stage 2 may need to process inequalities generating more LUBs.
Don't unnecessarily prevent bytecode-only build targets.

Assets 2

0 Join discussion

01 Jan 21:39

lukstafi

0.5.1

d4c9644

Automatic synchronization and transfers between host and devices

From the changelog:

Added

Automatic transfers to host from the context that most recently updated a node.
Automatic transfers of routine's inputs from host to routine's context if the host array modification was not yet transfered.

Fixed

Added # as alternative to ~~ for comment lines in ocannl_config files, and fixed a bug in their parsing.

Assets 2

20 Dec 20:30

lukstafi

0.5.0

e3fbe33

Stream-to-stream synchronization at the buffer level

Highlights from README:

Support for CUDA events, and Condition-based events for CPU backends.
Overhaul of the backend interfaces, both user-facing but especially internal: full code sharing.
Automatic stream-to-stream synchronization on a per-tensor-node basis.

Details from the changelog:

Added

Interface files for Backends and Low_level.
Fixed #245: tracking of used memory. But there's room for improvement.
Stream-to-stream synchronization functionality, with lazy per-tensor-node synchronization.

Changed

Migrated to cudajit 0.6.1.
Verifying that code is linked with the right contexts, by tracking embedded_nodes with assignments.
Renaming: (virtual) device -> stream, physical_device -> device.
New files: split out backend_intf.ml, backend_impl.ml, schedulers.ml from backends.ml; moved Tnode.task to task.ml; renamed backend_utils.ml to c_syntax.ml.
Removed half-static verification of merge buffer nodes inside device_to_device.
Fixed #286: cross-stream-sharing incorporated into Tnode.memory_mode.
Moved the multicore backend from a device = stream model to a single device model.
Got rid of unsafe_cleanup.
Rename subordinal to stream_id.
Removed dependency on core, broke up dependency on ppx_jane.
Huge refactoring of backend internal interfaces and API (not repeating same code).
Built per-tensor-node stream-to-stream synchronization into copying functions.
Re-introduced whole-device blocking synchronization, which now is just a slight optimization as it also cleans up event book-keeping.
Simplifications: no more explicit compilation postponing; no more hard-coded pointers (all non-local arrays are passed by parameter).
Fresh backends are now fresh modules to structurally prevent any potential cache leaking.

Fixed

Validating merge nodes for the CUDA backend.
Checking is_released on weak array retrieval.

Assets 2

17 Sep 13:07

lukstafi

0.4.1.1

7c16e2a

Half precision, mixed precision, CUDA virtual devices

The release 0.4.1 offers: half precision, mixed precision, proper support for cuda virtual devices, and many bug fixes.

From the CHANGELOG:

Added

Implemented the previously-mocked support for half precision (FP16).
- We work around the missing Ctypes coverage by not using Ctypes.bigarray_start.
- We check FP16 constants for overflow.
- We output half precision specific code from the CUDA backend.
Finally proper support for mixed precision! Lazy precision defaults and delayed precision setting via Tnode.update_prec.
A placeholder nn_blocks.ml hinting at an intended design pattern for model components.
A memory model for the multiple virtual devices per physical device setup, implemented in the CUDA backend. It fixes the CUDA backend behavior in the data parallelism benchmark.
Slides for the Fun OCaml meetup: docs/Fun OCaml.
New syntax: inline tensor declarations with a literal float as initial value.

Changed

Removed the pipes_cc, pipes_gccjit backends (Pipes_multicore_backend) -- I had fixed Pipes_multicore_backend by using the poll library instead of Unix.select, but it turns out to be very very slow.
Changed the %cd block comment syntax ~~ to allow detailed structuring. Rewrote Train.grad_update to use the %cd syntax.
Made Train.sgd_one slightly more thrifty: p =- learning_rate *. sgd_delta --> p =- learning_rate * sgd_delta ~logic:"." without the inline tensor expression.

Fixed

Log levels related de-confusion:
- Critical bug: logging of computation traces was not properly converted to ppx_minidebug 2.0.
- Properly restore log_level and inform about its setting.
- By default do not log from tests.
- debug_log_from_routines should only happen when log_level > 1.
Bugs in Multicore_backend: await was not checking queue emptiness, worker's Condition.broadcast was non-atomically guarded (doesn't need to be), possible deadloop due to the lockfree queue -- now replaced with saturn_lockfree.
Reduced busy-waiting inside c_compile_and_load, propagating compilation errors now instead of infinite loop on error.
Fixed loss of significant digits for small numbers when outputting files.
Added missing mixed-precision conversions in the C_syntax backend builder.
Restored the functionality of debug logging from the cuda backend.
Always reinitialize global state at the beginning of let%expect_test, to make them more deterministic.

Assets 2

13 Sep 22:39

lukstafi

0.4.1.beta2

72d12ba

Half precision, mixed precision, CUDA virtual devices Pre-release

Pre-release

The release 0.4.1 offers: half precision, mixed precision, proper support for cuda virtual devices, and many bug fixes.

Non-beta release blocked by getting cudajit 0.4.1 in the opam-repository.

From the CHANGELOG:

Added

Implemented the previously-mocked support for half precision (FP16).
- We work around the missing Ctypes coverage by not using Ctypes.bigarray_start.
- We check FP16 constants for overflow.
- We output half precision specific code from the CUDA backend.
Finally proper support for mixed precision! Lazy precision defaults and delayed precision setting via Tnode.update_prec.
A placeholder nn_blocks.ml hinting at an intended design pattern for model components.
A memory model for the multiple virtual devices per physical device setup, implemented in the CUDA backend.
- It fixes the CUDA backend behavior in the data parallelism benchmark.

Changed

Removed the pipes_cc, pipes_gccjit backends (Pipes_multicore_backend) -- I had fixed Pipes_multicore_backend by using the poll library instead of Unix.select, but it turns out to be very very slow.

Fixed

Log levels related de-confusion:
- Critical bug: logging of computation traces was not properly converted to ppx_minidebug 2.0.
- Properly restore log_level and inform about its setting.
- By default do not log from tests.
- debug_log_from_routines should only happen when log_level > 1.
Bugs in Multicore_backend: await was not checking queue emptiness, worker's Condition.broadcast was non-atomically guarded (doesn't need to be), possible deadloop due to the lockfree queue -- now replaced with saturn_lockfree.
Reduced busy-waiting inside c_compile_and_load, propagating compilation errors now instead of infinite loop on error.
Fixed loss of significant digits for small numbers when outputting files.
Added missing mixed-precision conversions in the C_syntax backend builder.
Restored the functionality of debug logging from the cuda backend.

Assets 2

Releases: ahrefs/ocannl

Convolutional Neural Networks

Added

Changed

Fixed

Uh oh!

Shape inference errors "you forgot to specify hidden dimensions"; new notation `%%extend_dsls`

Added

Changed

Fixed

Uh oh!

Record-syntax-based inline definitions; NN building blocks

Uh oh!

Initialization on devices; shape inference and projections with strides

Uh oh!

Metal backend (macOS with GPUs including Apple Silicon)

Added

Changed

Fixed

Uh oh!

More primitive operations

Added

Changed

Fixed

Uh oh!

Automatic synchronization and transfers between host and devices

Added

Fixed

Uh oh!

Stream-to-stream synchronization at the buffer level

Added

Changed

Fixed

Uh oh!

Half precision, mixed precision, CUDA virtual devices

Added

Changed

Fixed

Uh oh!

Half precision, mixed precision, CUDA virtual devices

Added

Changed

Fixed

Uh oh!