Skip to content

Releases: ahrefs/ocannl

Convolutional Neural Networks

19 Dec 20:01

Choose a tag to compare

This release brings many fixes to affine shape inference and affine projections inference, and for the first time proper padding propagation and initialization of padding margins. The use_padding setup (e.g. valid vs. padded convolution) is now per-affine-operation, configurable from the einsum notation.

The toy end-to-end test for CNNs, "counting circles", shows an interesting failure on what might be just Intel processors? (I can reproduce it on one of my computers.) Might be a quirk but debugging left out of this release.

From the changelog, collected by Claude:

Added

  • Neutral element tracking during shape inference for proper padding reset
  • use_padding syntax in einsum notation (replacing global flag)
  • Circle counting dataset and MLP training test
  • Cross-entropy loss and one_hot_of_int_list helper for classification tasks
  • out_channels parameter to conv2d for explicit channel specification
  • Projection slot detection by naming convention in %cd syntax extension
  • Configurable scaling to kaiming and xavier initialization functions
  • New documentation: tensors_and_contexts.md, affine indexing for convolutions
  • Documentation for op_fun and param_op_fun types, roots, embedded nodes, and params concepts

Changed

  • Padding is now reset by tracking neutral elements through shape inference
  • Changed default random initialization to uniform1 which doesn't impose shape constraints
  • Refactored vbs from Map to list for order-preserving let bindings in syntax extensions
  • Infer the shape of inline definitions assigned a slot for %cd expressions with projections in scope

Fixed

  • Gracefully disable inlining for convolution patterns
  • Don't propagate padding across operations, even if the same tensor participates in them
  • Padding margin initialization for tensors with multiple operations
  • Padding initialization bug for max-pool operations
  • uniform1 periodicity by spreading bits in *_to_uint4x32 conversions
  • tropical (max-reduce) backprop to use input-shaped condition tensors
  • tropical g2 gradient by using correct projection for kernel gradients
  • Kernel extent calculation to depend on kernel size parity
  • Shape inference for Total_elems constraints with Strided_var numerators
  • compute_row_product to return None for unresolved variables
  • Deferred dim variable guessing to Stage 5 for Total_elems propagation
  • Padding offset application during lowering for correct buffer indexing
  • Intermediate grads from kaiming, xavier
  • Random seed initialization missing in transformer test

Shape inference errors "you forgot to specify hidden dimensions"; new notation `%%extend_dsls`

28 Nov 20:49

Choose a tag to compare

The notation %%extend_dsls generates boilerplate to add new operations to the DSLs -- making them easily available to the %op and %cd notations. It is for example used to add normal distribution to DSLs in a concise way.

Shape inference errors "you forgot to specify hidden dimensions" are generated when shape inference would otherwise need to guess the smallest fitting shape for a parameter.

From the changelog:

Added

  • Normal distribution random number generation
  • %%extend_dsls syntax extension for extending DSL modules
  • interleave operation in DSL modules
  • Defined_by_cd_logic shape inference specification for explicit shape logic in forward code
  • Menhir-based einsum parser replacing Angstrom for better maintainability
  • Name clash detection for inline definitions and variable captures in syntax extensions
  • is_param flag in shape inference for improved parameter-related error messages
  • Teacher forcing support in transformer implementation
  • Heuristics for "missing hidden dimensions" error messages with row variables
  • Tree_map persistent map utility with exposed tree structure in sexp serialization

Changed

  • Migrated shape environment to use Utils.Tree_map for ppx_minidebug v3 full-scale debugging
  • Replaced explicit non-iteration tracking with improved projection constraints derivation
  • Support for offset-only affine expressions in shape inference
  • Renamed optional dimension variable parameter from label to name
  • Row IDs replaced with provenance tracking (Row.idRow.prov) supporting deduplication
  • Tensor labels interface improved: per-operation op_label string with label list as trailing parameter
  • Adapted to ppx_minidebug renaming (entry_idscope_id)
  • Prefixed block names in lib/nn_blocks.ml for better namespace management
  • Tests reorganized: more einsum-related tests moved to test/einsum/

Fixed

  • Normal distribution test determinism across different machines
  • Convolution/affine indexing shape inference offset adjustment by strides
  • Parameter gradients not embedded after params moved earlier in processing
  • Einsum parser handling of missing convolution and single-character cases
  • Shape inference for Conv_input additional cases
  • Incremental construction of tensors in Tensor.op
  • Attention masks now have empty output dimensions for proper broadcasting to multihead attentions
  • LUB (Least Upper Bound) computation in dim_ineq
  • Axis labels distinguished from dimension units (labels) in shape_spec_to_dims_bio
  • Shape inference for dim-1 with labels treated same as dim>1 (only dim-1 without label is different)
  • Shape specification requiring LUB incorporation for non-terminal shapes
  • Missing CUDA backend cases and NVRTC compatibility
  • Premature guessing of dim variables as dim-1 when participating in Total_elems constraints
  • Generic constraints ignored for unused tensors
  • Missing propagation when set_dim happened before parsing the spec
  • Guard axis_keys_to_idcs from un-inferred shapes
  • More informative error messages for parameter shape errors
  • Crash on repeated variable capture in syntax extensions
  • Additional syntax support for binary einsum operators

Record-syntax-based inline definitions; NN building blocks

12 Sep 10:06

Choose a tag to compare

This is a rushed release. CUDA backend regressions. The next release 0.6.2 should come at the 0.6 -> 0.6.1 cadence (that is relatively soon).

Highlights:

  • Record-based syntax for inline definitions, with the first field being the defined identifier, it's value being the parameter initialization (prevented for %cd syntax) -- punning means using a default, the following fields being the labeled arguments passed to the initialization expression.
  • Shape allows injecting shape equality constraints via set_dim and set_equal, and capturing shape (dim, row) variables via einsum variable capture syntax. set_dim takes a variable reference and a number. "Equaling" a dimension and a row creates a Total_elems constraint (dimension vs. product of dimensions), otherwise it's strict equality e.g. between rows.
  • Split the sources of neural_nets_lib into two directories, tensor/ with the core implementation and lib/ with user-land code.
  • Fleshed-out Nn_blocks with components for tensors and convolutional nets. The tensor components are already validated but not yet sufficient for full-blown GPT-style network (e.g. missing fixed positional encodings). The convolution components are not validated and do not work yet.
  • Introduced truly heterogeneous precision primitive ops, operations like where and Uint4x32_to_prec_uniform were previously buggy.
  • Migrated documentation to docs/ directory with pandoc rendering support.

Many more changes, see the datailed CHANGES.md compiled by Claude Code.

Initialization on devices; shape inference and projections with strides

20 Aug 15:00

Choose a tag to compare

Report by Claude with main focus since start of July:

  • Major Release 0.6.0 with comprehensive new features for deep learning
    • Added support for Bfloat16 and FP8 precisions, critical for modern ML training
      efficiency
    • Implemented convolution support with affine indexing expressions in projections,
      einsum notation, and shape inference
    • Added counter-based randomness via Threefry4x32 operation for reproducible random
      number generation
    • Introduced bidirectional precision inference (both top-down and bottom-up) for
      automatic type optimization
    • Enhanced %cd syntax with .forward, .backprop, .zero_grads support and automatic
      comment generation
  • New Datasets and Examples
    • Added MNIST and CIFAR10 datasets (borrowed from Raven)
    • Created Names dataset with bigram use-case helper for language modeling
    • Implemented Half-moons synthetic dataset for classification tasks
    • Developed comprehensive test examples including bigram language models
  • Performance and Memory Improvements
    • Fixed critical memory leak in builtins.c
    • Resolved bus error on large datasets
    • Migrated from heap-local to on-stack allocation by default
    • Improved virtual nodes and inlining to work across routines
    • Enhanced shape inference with better Total_elems constraint handling and LUB support
  • Backend Stabilization
    • Fixed numerous CUDA backend regressions and missing constructs
    • Resolved Metal backend issues with session-level bugs
    • Added Float16 emulation for systems without native _Float16 support
    • Fixed host-device synchronization issues with proper devices_not_lagging_host
      semantics

Metal backend (macOS with GPUs including Apple Silicon)

26 May 18:31

Choose a tag to compare

WARNING: this release's test suite depends on an unreleased fix to PrintBox at the time of the release. I'm not planning on submitting it to the opam repository.

Highlights:

  • What it says on the tin: GPUs like the M1, ..., M4.
  • Got rid of Stdlib.Format as I found it error prone and hard to debug.
  • A simple test for logging from within kernels in the test suite.

From the changelog:

Added

  • The Metal framework backend (Apple Silicon).
  • Setting debug_log_to_stream_files to neatly keep logs from routine execution in their separate files.
  • Settings clean_up_artifacts_on_startup, prefer_backend_uniformity.
  • Tools directory and the minised tool: regexp replacement file rewrite.
  • Directory arrayjit/bin and executable read_config for extracting OCANNL configuration into txt files.

Changed

  • Removed initialize and is_initialized from the backend API; instead, backends should be initialized on functor application. The functors now take config as argument.
  • More descriptive identifier names in C-syntax code in case of name conflicts.
  • Changed the backend config name cc to multicore_cc for consistency.
  • Migrated out of Stdlib.Format to PPrint for all structured formatting.
  • Migrated stdout capture to thread-based (domain-based actually); for Windows compatibility but also much more robust for large logs.

Fixed

  • Avoid conflicts with C math function names like fma.
  • Satur01_gate had wrong semantics.

More primitive operations

08 Apr 09:37

Choose a tag to compare

Highlights from README:

  • Supports a lot of primitive operations (including ternary ops), and ternary tensor operations.
  • %cd and %op support both curried and uncurried operator application syntax.
  • More flexible gradient construction via the %cd syntax (better projections inference).
  • Works on Native Windows with the C compiler backend (but CUDA backend blocked by cudajit still).

Details from the changelog:

Added

  • Lots of new primitive ops:
    • Unary: Satur01 | Exp | Log | Exp2 | Log2 | Sin | Cos | Sqrt | Recip | Recip_sqrt | Neg | Tanh_approx | Not
    • Binary: Satur01_gate | Max | Min | Mod | Cmplt | Cmpeq | Cmpne
    • Ternary: Where | FMA (non-accumulating)
  • Ternary tensor operations.
    • A differentiable where operation.
  • More flexible gradient construction via the %cd syntax (better projections inference).
  • CC backend piggy-backing on OCaml's C compiler (consistent across OSes).

Changed

  • Updated to printbox 0.12, with upstreamed graphing.
  • -pthread -> -lpthread in c_library_flags in dune files.
  • Removed Numpy support for easier compatibility on native Windows.
  • Unary (primitive) ops and relu are now named, not operator syntax.
  • Refactored %cd parsing of primitive ops.
  • %cd and %op support both curried and uncurried operator application syntax.
  • Updated to ppx_minidebug 2.2.0 with support for cross-run diffing.

Fixed

  • Numbers text rendering (consistent across OSes).
  • Moved closing row variables to stage 3, because stage 2 may need to process inequalities generating more LUBs.
  • Don't unnecessarily prevent bytecode-only build targets.

Automatic synchronization and transfers between host and devices

01 Jan 21:39

Choose a tag to compare

From the changelog:

Added

  • Automatic transfers to host from the context that most recently updated a node.
  • Automatic transfers of routine's inputs from host to routine's context if the host array modification was not yet transfered.

Fixed

  • Added # as alternative to ~~ for comment lines in ocannl_config files, and fixed a bug in their parsing.

Stream-to-stream synchronization at the buffer level

20 Dec 20:30

Choose a tag to compare

Highlights from README:

  • Support for CUDA events, and Condition-based events for CPU backends.
  • Overhaul of the backend interfaces, both user-facing but especially internal: full code sharing.
  • Automatic stream-to-stream synchronization on a per-tensor-node basis.

Details from the changelog:

Added

  • Interface files for Backends and Low_level.
  • Fixed #245: tracking of used memory. But there's room for improvement.
  • Stream-to-stream synchronization functionality, with lazy per-tensor-node synchronization.

Changed

  • Migrated to cudajit 0.6.1.
  • Verifying that code is linked with the right contexts, by tracking embedded_nodes with assignments.
  • Renaming: (virtual) device -> stream, physical_device -> device.
  • New files: split out backend_intf.ml, backend_impl.ml, schedulers.ml from backends.ml; moved Tnode.task to task.ml; renamed backend_utils.ml to c_syntax.ml.
  • Removed half-static verification of merge buffer nodes inside device_to_device.
  • Fixed #286: cross-stream-sharing incorporated into Tnode.memory_mode.
  • Moved the multicore backend from a device = stream model to a single device model.
  • Got rid of unsafe_cleanup.
  • Rename subordinal to stream_id.
  • Removed dependency on core, broke up dependency on ppx_jane.
  • Huge refactoring of backend internal interfaces and API (not repeating same code).
  • Built per-tensor-node stream-to-stream synchronization into copying functions.
  • Re-introduced whole-device blocking synchronization, which now is just a slight optimization as it also cleans up event book-keeping.
  • Simplifications: no more explicit compilation postponing; no more hard-coded pointers (all non-local arrays are passed by parameter).
  • Fresh backends are now fresh modules to structurally prevent any potential cache leaking.

Fixed

  • Validating merge nodes for the CUDA backend.
  • Checking is_released on weak array retrieval.

Half precision, mixed precision, CUDA virtual devices

17 Sep 13:07

Choose a tag to compare

The release 0.4.1 offers: half precision, mixed precision, proper support for cuda virtual devices, and many bug fixes.

From the CHANGELOG:

Added

  • Implemented the previously-mocked support for half precision (FP16).
    • We work around the missing Ctypes coverage by not using Ctypes.bigarray_start.
    • We check FP16 constants for overflow.
    • We output half precision specific code from the CUDA backend.
  • Finally proper support for mixed precision! Lazy precision defaults and delayed precision setting via Tnode.update_prec.
  • A placeholder nn_blocks.ml hinting at an intended design pattern for model components.
  • A memory model for the multiple virtual devices per physical device setup, implemented in the CUDA backend. It fixes the CUDA backend behavior in the data parallelism benchmark.
  • Slides for the Fun OCaml meetup: docs/Fun OCaml.
  • New syntax: inline tensor declarations with a literal float as initial value.

Changed

  • Removed the pipes_cc, pipes_gccjit backends (Pipes_multicore_backend) -- I had fixed Pipes_multicore_backend by using the poll library instead of Unix.select, but it turns out to be very very slow.
  • Changed the %cd block comment syntax ~~ to allow detailed structuring. Rewrote Train.grad_update to use the %cd syntax.
  • Made Train.sgd_one slightly more thrifty: p =- learning_rate *. sgd_delta --> p =- learning_rate * sgd_delta ~logic:"." without the inline tensor expression.

Fixed

  • Log levels related de-confusion:
    • Critical bug: logging of computation traces was not properly converted to ppx_minidebug 2.0.
    • Properly restore log_level and inform about its setting.
    • By default do not log from tests.
    • debug_log_from_routines should only happen when log_level > 1.
  • Bugs in Multicore_backend: await was not checking queue emptiness, worker's Condition.broadcast was non-atomically guarded (doesn't need to be), possible deadloop due to the lockfree queue -- now replaced with saturn_lockfree.
  • Reduced busy-waiting inside c_compile_and_load, propagating compilation errors now instead of infinite loop on error.
  • Fixed loss of significant digits for small numbers when outputting files.
  • Added missing mixed-precision conversions in the C_syntax backend builder.
  • Restored the functionality of debug logging from the cuda backend.
  • Always reinitialize global state at the beginning of let%expect_test, to make them more deterministic.

Half precision, mixed precision, CUDA virtual devices

13 Sep 22:39

Choose a tag to compare

The release 0.4.1 offers: half precision, mixed precision, proper support for cuda virtual devices, and many bug fixes.

Non-beta release blocked by getting cudajit 0.4.1 in the opam-repository.

From the CHANGELOG:

Added

  • Implemented the previously-mocked support for half precision (FP16).
    • We work around the missing Ctypes coverage by not using Ctypes.bigarray_start.
    • We check FP16 constants for overflow.
    • We output half precision specific code from the CUDA backend.
  • Finally proper support for mixed precision! Lazy precision defaults and delayed precision setting via Tnode.update_prec.
  • A placeholder nn_blocks.ml hinting at an intended design pattern for model components.
  • A memory model for the multiple virtual devices per physical device setup, implemented in the CUDA backend.
    • It fixes the CUDA backend behavior in the data parallelism benchmark.

Changed

  • Removed the pipes_cc, pipes_gccjit backends (Pipes_multicore_backend) -- I had fixed Pipes_multicore_backend by using the poll library instead of Unix.select, but it turns out to be very very slow.

Fixed

  • Log levels related de-confusion:
    • Critical bug: logging of computation traces was not properly converted to ppx_minidebug 2.0.
    • Properly restore log_level and inform about its setting.
    • By default do not log from tests.
    • debug_log_from_routines should only happen when log_level > 1.
  • Bugs in Multicore_backend: await was not checking queue emptiness, worker's Condition.broadcast was non-atomically guarded (doesn't need to be), possible deadloop due to the lockfree queue -- now replaced with saturn_lockfree.
  • Reduced busy-waiting inside c_compile_and_load, propagating compilation errors now instead of infinite loop on error.
  • Fixed loss of significant digits for small numbers when outputting files.
  • Added missing mixed-precision conversions in the C_syntax backend builder.
  • Restored the functionality of debug logging from the cuda backend.