Releases: bacalhau-project/bacalhau
v1.6.1-rc1
Major Improvements
-
Partitioned Execution Support:
Added support for splitting jobs across multiple executions with automatic partition management. The feature includes:- Partition assignment and tracking
- Independent execution progress monitoring
- Granular failure handling with retry of only failed partitions
- Each execution receives its partition details through environment variables, enabling partition-aware processing when needed
-
S3 Input Partitioning:
Added automatic data distribution for S3 inputs across multiple executions using configurable strategies:- Multiple partitioning strategies: Users can choose between object-based distribution for even splitting, regex patterns for structured data, substring matching for fixed formats, or date-based partitioning for temporal data
- Even distribution of data without requiring custom partition code
- Support for shared data access through non-partitioned inputs
- Automatic data subset assignment to each execution
Full Changelog: v1.6.0...v1.6.1-rc1
v1.6.0
Bacalhau v1.6.0 Release Notes
We are excited to announce the release of Bacalhau v1.6.0, introducing a new communication architecture that significantly improves the reliability and resilience of distributed compute networks.
Key Features and Improvements
New Bacalhau Messaging Protocol (BMP)
At the heart of this release is the new messaging protocol, a complete redesign of node communication that brings significant improvements to network reliability:
Key Benefits
- Self-Healing Network: Compute nodes and orchestrators automatically reconnect and sync after network interruptions
- Offline-First Operation: Compute nodes can start and operate even when disconnected from the orchestrator
- Automatic State Recovery: When nodes reconnect, they automatically share all missed job execution information and results
- Zero Data Loss: Ensures no job execution data or results are lost during network disruptions
- Seamless Recovery: Network interruptions are handled transparently without requiring manual intervention
Technical Improvements
- Reliable Message Delivery: Ordered, at-least-once message delivery between nodes
- Automatic Recovery: Built-in failure detection and recovery mechanisms
- Connection Health Monitoring: Proactive health checks and connection management
- Event-Based Architecture: Decoupled event processing from message delivery
- Efficient Checkpointing: Maintains system state for reliable recovery
- Backward Compatibility: Maintains compatibility with v1.5 orchestrators
Enhanced Web UI Experience
- Direct Result Downloads: Download job results directly from the interface
- Simplified Configuration: Automatic request routing eliminates manual IP configuration
- Improved Architecture: Streamlined backend setup while maintaining security
Operational Improvements
- Reverse Proxy Support: Added capability to run orchestrator behind a reverse proxy
- Agent Configuration: New
bacalhau agent config
command to inspect agent configuration - TLS Support: Added TLS encryption support for NATS communication
- Better Logging: Implemented more human-readable logging patterns
Upgrade Notes and Backward Compatibility
Bacalhau v1.6.0 maintains backward compatibility while introducing the new BMP:
- Compute nodes maintain compatibility with v1.5 orchestrators, and vice versa
- Support for re-handshake from legacy clients
We're excited for you to experience the enhanced reliability and resilience provided by the BMP in Bacalhau v1.6.0. This release represents a significant architectural advancement in making distributed computing more robust and dependable.
v1.5.1
Major Improvements
- Enhanced Web UI Routing: Improved routing of Web UI requests without requiring backend address definition
- Faster Startup: Dramatically reduced node startup time from ~9 seconds to ~1.5 seconds by optimizing IMDS access
- Job Management: Added support for stopping jobs using short IDs
- Bug Fix: Resolved issues with default publishers functionality
Breaking Changes
- Removed exec command and job translation functionality
Additional Changes
- Added Docker compose support
- Improved API error handling
Links
- Full Changelog: v1.5.0...v1.5.1
v1.5.1-rc1
Major Improvements
- Enhanced Web UI Routing: Improved routing of Web UI requests without requiring backend address definition
- Faster Startup: Dramatically reduced node startup time from ~9 seconds to ~1.5 seconds by optimizing IMDS access
- Job Management: Added support for stopping jobs using short IDs
- Bug Fix: Resolved issues with default publishers functionality
Breaking Changes
- Removed exec command and job translation functionality
Additional Changes
- Added Docker compose support
- Improved API error handling
Links
- Full Changelog: v1.5.0...v1.5.1-rc1
v1.5.0
Bacalhau v1.5 Release Notes
We're thrilled to announce the release of Bacalhau 1.5.0, a significant update that introduces powerful new features and enhancements. Building on the momentum from our previous releases, Bacalhau 1.5 focuses on simplifying configuration, improving visibility, and enhancing overall performance.
Key Features and Improvements
Simplified Configuration Management
- New File-Based Configuration System: We've introduced a more intuitive file-based configuration system, replacing complex CLI flags. This change makes setting up and managing Bacalhau networks more straightforward and less error-prone.
- Flexible Configuration Options: Users can now provide:
- A single config file
- Multiple config files that are merged
- Key-value pairs directly via the
-c
- flag (e.g.,-c key=value
)
- Decoupled Configuration: Configuration is now decoupled from the repo (now called data dir), allowing for more flexible setups.
Enhanced Data Directory Structure
- Improved Organization: We've clearly separated compute and orchestrator related data, providing a cleaner structure.
- Consolidated Metadata: System metadata is now consolidated into a single
system_metadata.json
file for easier management.
New WebUI
- Embedded Management Interface: Introduced a comprehensive WebUI for easier management and monitoring of your Bacalhau network. This significant feature allows users to visualize and interact with their Bacalhau deployment without relying solely on the CLI.
Enhanced Job Visibility and Reporting
- Granular Event Reporting: Improved reporting on job progress, including detailed scheduling actions, failures, and retries.
- Better Error Messages: Enhanced error reporting system with meaningful messages and debugging hints.
API Enhancements
- Pagination for Job History: Implemented pagination support for job history, improving the user experience when dealing with a large jobs and making it easier to navigate through job and execution history events.
Upgrade Notes and Backward Compatibility
While Bacalhau 1.5.0 introduces some breaking changes, we've ensured a smooth upgrade path:
- Most CLI flags have been removed in favor of configuration files, but we gracefully handle deprecated flags for backward compatibility.
- The structure of the data directory has changed, but we automatically handle the migration when you first run the new Bacalhau version.
- Many old configuration options have been deprecated in favor of the new structure and config keys.
Please refer to our [updated documentation](https://docs.bacalhau.org/) for detailed instructions on upgrading to Bacalhau 1.5.0 and taking advantage of the new configuration system.
We're excited for you to explore the new features and enhancements in Bacalhau 1.5.0. Whether you're a seasoned Bacalhau user or just getting started, this update will empower you to build and run distributed compute networks more effectively than ever before.
v1.4.0
Announcing Bacalhau 1.4.0
We’re excited to announce the release of Bacalhau 1.4.0, a significant update that introduces powerful new features and enhancements. Building on the momentum from our previous releases this year (1.2.0, 1.3.0, 1.3.1, and 1.3.2), Bacalhau 1.4 strengthens our platform’s performance, scalability, and user experience, solidifying its position as a leading platform for building and running distributed compute networks.
In this release, we focused on three major efforts, with particular emphasis on those deploying Bacalhau at scale:
Performance and Scalability Enhancements
-
Extended Job Queuing: Bacalhau 1.4.0 introduces a more robust queuing system, improving job scheduling and execution efficiency, especially in high-demand or globally distributed networks. By intelligently managing job queues, Bacalhau ensures smoother operations and increased throughput, leading to higher success rates for your distributed compute tasks.
-
Migration to NATS, Deprecation of libp2p and Embedded IPFS Node: We’ve fully transitioned to NATS.io as Bacalhau’s communication backbone, moving away from libp2p and the embedded IPFS node. This change streamlines communication and reduces overhead, marking a significant step towards a more efficient and scalable network. IPFS integration remains available with external nodes for those who need it.
Improved User Experience
-
Updated CLI and HTTP API: Bacalhau 1.4.0 introduces a revamped command-line interface (CLI) and HTTP API. These updates align the CLI commands with the new API structure and enhance overall usability. While most changes are seamless for existing users, some command adjustments have been made (e.g., bacalhau create becomes bacalhau job run). Our updated documentation will guide you through the transition smoothly.
-
Job Spec Updates: We've introduced an updated Job Specification format while deprecating some features of the previous format. This change requires users to update their job specs but brings improved clarity and consistency.
-
Enhanced Error Reporting: Bacalhau 1.4.0 improves error reporting, making it easier to diagnose and troubleshoot issues. This enhancement contributes to a more stable and reliable experience, helping users quickly resolve any problems that arise. For detailed guidance, please consult our documentation on the new Job Spec requirements.
-
Introduction of Node Manager: In Bacalhau 1.4.0, we’re introducing the Node Manager. This feature simplifies node operations, providing a clear view of all compute nodes and their status. You can approve, deny, or delete nodes as needed, making management straightforward. Heartbeats from nodes keep the Node Manager updated on their connectivity, enhancing overall stability and performance.
Smooth Transition for Existing Users
- Error Handling and Guidance: We understand that transitioning to a new version can be challenging. To ease this process, we’ve implemented helpful error messages and guidance for those adjusting to the changes in CLI behavior and job specifications. We’ve also created a table to show how some of the Bacalhau API endpoints have been remapped. If you’re not ready to upgrade, you can continue using version 1.3.1 while maintaining your private Bacalhau cluster.
Join Us on the Journey
We’re excited for you to explore the new features and enhancements in Bacalhau 1.4.0. Over the next five days, we’ll dive deeper into each topic in our “5 Days of Bacalhau” blog series. Whether you’re a seasoned Bacalhau user or just getting started, this update will empower you to build and run distributed compute networks more effectively than ever before.
v1.3.2
What's Changed
- Splitting to Consumer and Producer Client by @udsamani in #4027
- refactor: define config instance. config is no longer global by @frrist in #3959
- ops: update prod cluster vars to reflect current state by @frrist in #4034
- introduce job queueing when no nodes were found by @wdbaruni in #4049
- filter out nodes with high queue capacity by @wdbaruni in #4051
- Update canaries to 1.3.1 by @wdbaruni in #4039
Full Changelog: v1.3.1...v1.3.2
v1.3.1
What's Changed
- Update prod to v1.3.0. by @simonwo in #3708
- fix: update bacalhau install script to use go v1.21.8 by @frrist in #3675
- Add authentication to the Web UI by @simonwo in #3711
- fix: re-use client TLS config for establishing ws connections by @frrist in #3722
- Add Basic Bashtub tests by @js-ts in #3669
- Replace use of multierror libraries with now standardised errors.Join by @simonwo in #3744
- Load WASM module dependencies if non-WASM input data is specified by @simonwo in #3747
- Adds a
node delete
command by @rossjones in #3716 - Add error output to job selection exec hooks by @simonwo in #3745
- feat: create a bacalhau user to run bacalhau process by @frrist in #3733
- refactor: bidder to simplfiy exposing errors by @frrist in #3680
- Improve error reporting when job resources exceed capacity by @simonwo in #3749
- Print ranking failures in the CLI by @simonwo in #3752
- Return failed bid when image does not exist by @rossjones in #3755
- Generate certificates requester three by @olgibbons in #3707
- Just a bunch of spelling fixes by @aronchick in #3760
- Bump go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetrichttp from 0.45.0 to 1.24.0 by @dependabot in #3615
- Improve ranking error output by @simonwo in #3763
- Node heartbeats by @rossjones in #3709
- Ensure compute node labels are published over LibP2P connections by @simonwo in #3769
- Use RunE instead of Run for cobra commands by @rossjones in #3767
- Optionally constrain to connected, approved compute nodes when selecting/ranking them by @rossjones in #3768
- Print command line executable when displaying additional commands by @simonwo in #3770
- Deprecates some IPFS flags by @rossjones in #3778
- Make it easier to get in progress jobs of a certain type by @rossjones in #3777
- Fixes a panic in the wasm executor when entrypoint is not known by @rossjones in #3779
- Structured events in place of comment strings by @simonwo in #3771
- fix: remove Liveness and Acceptance from NodeInfo by @frrist in #3785
- Add issue templates by @wdbaruni in #3951
- Bump golang.org/x/net from 0.21.0 to 0.23.0 by @dependabot in #3809
- Bump golang.org/x/net from 0.21.0 to 0.23.0 in /ops/aws/canary/lambda by @dependabot in #3808
- Bump idna from 3.3 to 3.7 in /ops by @dependabot in #3786
- Bump idna from 3.6 to 3.7 in /python by @dependabot in #3788
- Bump idna from 3.6 to 3.7 in /integration/airflow by @dependabot in #3789
- Bump sqlparse from 0.4.4 to 0.5.0 in /integration/airflow by @dependabot in #3792
- Bump gunicorn from 21.2.0 to 22.0.0 in /integration/airflow by @dependabot in #3794
- Bump aiohttp from 3.9.3 to 3.9.4 in /integration/airflow by @dependabot in #3796
- Bump apache-airflow from 2.8.1 to 2.9.0 in /integration/airflow by @dependabot in #3798
- Bump go.opentelemetry.io/otel/exporters/otlp/otlptrace from 1.23.1 to 1.26.0 by @dependabot in #3955
- Removing placeholders by @MichaelHoepler in #3795
- adding ok-to-test functionality by @aronchick in #3960
- 3444 adding all fake data by @aronchick in #3445
- feat: premit update check configuration w/ envvar by @frrist in #3967
- fix: prevent panic in test due to nil pointer dereference by @frrist in #3977
- fix: require connected and approved nodes for scheduling by @frrist in #3957
- fix: allow timeouts in the client methods that make API call by @frrist in #3964
- refactor: add deprecation message to libp2p & ipfs flags by @frrist in #3966
- Support MP deployments with self-generated TLS certificate by @frrist in #3721
- fix: update checker doesn't require server connection to work by @frrist in #3775
- ExecutionTimeout to fail executions instead of jobs by @wdbaruni in #3974
- adding update checker test env variable by @aronchick in #3997
- Fix Repo Clone by @js-ts in #3973
- urldownload
GetVolumeSize
check for content length by @udsamani in #4006 - add atomic job store transactions by @wdbaruni in #3996
- compute nodes no longer reject bids based on queued capacity by @wdbaruni in #4002
- fix txContext auto-cancellation of context by @wdbaruni in #4012
- single requester node in demo networks by @wdbaruni in #3673
- fixing circle ci commenting by @aronchick in #4022
- run demo network on nats by @wdbaruni in #4009
- add evaluation watcher and enable atomic enqueues by @wdbaruni in #3998
- make job name optional by @wdbaruni in #4005
- fix: implement NodeStore migration by @frrist in #4029
New Contributors
Full Changelog: v1.3.0...v1.3.1
v1.3.0
We are thrilled to announce the release of Bacalhau v1.3.0, a significant milestone in our quest for helping organizations of all sizes deal with the world of distributed compute. Packed with exciting new features like user access control, local results publishing, and TLS support, this release is built to address the needs of even the largest organizations without the complexity of traditional distributed platforms! 🚀
Without further ado, let’s dive in!
New Features
User access control
Bacalhau v1.3.0 now supports authentication and authorization of individual users with a flexible and customizable auth system that remains simple for single-node clusters but scales up well to wide enterprise deployments.
Bacalhau auth integrates well with whatever auth systems users already have. Bacalhau can use private keys, shared secrets, usernames, and passwords and 2FA. Additionally, Bacalhau offers OAuth2/OIDC for authentication and can apply access control to single users, groups, teams and can use RBAC or ABAC mechanisms as desired.
The default behaviour is unchanged. Users will be authenticated based on their private key and authorized to submit and control their own jobs and read-only information about the cluster will be available with authentication.
To start using user authentication, check out the auth docs and install a custom policy to control user access and their permissions.
Publishing and serving results on local disks
In Bacalhau v1.3.0 we are introducing a new publisher type that lets users publish to the local disk of the compute node. This will streamline the process of testing the publisher functionality without the need for a remote storage service. This is especially handy for those who are just getting started with Bacalhau.
The local publisher is composed of two parts: the publisher that compresses and moves job outputs to a specified location, and an HTTP server that delivers the content back to the user.
By default, the HTTP server listens on port 6001, but this can be modified using the --local-publisher-port
flag. The server will deliver content from the directory specified by the local-publisher-directory
flag, or, if not set, from a subdirectory of the configured Bacalhau storage directory. The --local-publisher-address
flag can be used to set the address that the HTTP server listens on. Default values for this vary by environment (e.g., localhost
for test and development environments, public
for production environments), but users can set these values in the config if the defaults are not suitable.
We should stress that managing the storage is still the administrator’s responsibility. Because local storage necessarily means storing on a single node, thinking through clean up, persistence, etc. are things you should think through before moving into production!
NATS-based networking
In the Bacalhau v1.3.0 release, we are introducing a new transport layer to improve inter-node connectivity. This new layer utilizes NATS, a robust messaging system, instead of the existing libp2p transport.
With the introduction of NATS, we are simplifying the network requirements for Compute nodes. Now, only Orchestrator nodes (also known as Requester nodes) need to be publicly reachable. As a result, Compute nodes only need to know the address of a single Orchestrator node, and they can learn about and connect to other Orchestrators at runtime. This change not only simplifies the setup process but also enhances resilience as it allows Compute nodes to failover and reconnect to other Orchestrators when necessary. This change only affects inter-node communication, and the Bacalhau HTTP API is unchanged.
We acknowledge that adapting to new technologies takes time. In recognition of this, libp2p will continue to be supported as an alternative during this transition period. This ensures that you have the flexibility to migrate at your own pace. Users who wish to continue using libp2p need to specify the Node.Network.Type
config option or --network
flag as libp2p
explicitly when running their network.
Persistent memory of connected nodes
The Bacalhau v1.3.0 release introduces a significant upgrade ensuring the persistence of node information across requester node restarts. This addresses a shortcoming of the previous in-memory store, which would lose all knowledge of compute nodes upon a restart. The new persistent store is a major advancement towards maintaining more accurate node information and tracking compute nodes that may be temporarily inaccessible to the cluster.
The new persistent store is used automatically when NATS-based networking is used.
TLS support for Bacalhau CLI
Bacalhau v1.3.0 now supports TLS requests to the requester node for all CLI commands. While the default communication remains HTTP, users can activate TLS calls using the command line flag --tls
, setting the Node.ClientAPI.ClientTLS.UseTLS
config option to true
or by exporting the BACALHAU_API_TLS=1
environment variable.
For self-signed certificates, users can either accept insecure requests or provide a CA certificate file. The Node.ClientAPI.ClientTLS.CACert
config option, BACALHAU_API_CACERT
environment variable and --cacert
flag can be used to verify the certificate with a provided CA certificate file. Alternatively, the Node.ClientAPI.ClientTLS.Insecure
config option, --insecure
flag or BACALHAU_API_INSECURE
environment variable can be used to make API requests without certificate verification.
Customizable node names
In the Bacalhau v1.3.0 release, we've introduced a new feature that allows users to set their own nodeID. This addition gives users the flexibility to tailor their node names according to their preferences and needs.
Users have the option to manually set the node name, or they can opt for automatic generation using various providers. These providers include puuid
(which is the default option), uuid
, hostname
, aws
, and gcp
.
The puuid
option generates a node name using the n-{uuid}
pattern, such as n-f1bab231-68ad-4c72-bab6-580cd49bf521
. The uuid
option generates a uuid as a node name. The hostname
option uses the hostname as the node id, replacing any .
with -
to ensure compatibility with NATS. The aws
option uses the EC2 instance name if the node is deployed on AWS, and the gcp
option uses the VM's id if the node is deployed on GCP.
It's important to note that these providers will only be called into action if no existing node name is found in config.yaml
, the CLI --name
flag, or environment variables. Once a node name is generated, it will be persisted in config.yaml
, ensuring that the node names are consistent across sessions.
To set the node name manually:
bacalhau serve --name my-custom-name
To use a puuid
as the node name (which is the default):
bacalhau serve
To use the hostname as the node name:
bacalhau serve --name-provider hostname
This new feature is aimed at enhancing user customization and control, making Bacalhau even more user-friendly and adaptable to different user needs.
Improved telemetry and metrics
Bacalhau Telemetry Suite
In this update we have introduced a docker-compose based telemetry suite complete with open-telemetry, Prometheus, Grafana, and Jaeger containers for collecting and inspecting telemetry data emitted from bacalhau nodes. For details on running the suite see the respective README.md
Improved Visibility via New Metrics
In this update we have added new metrics to improve the observability of bacalhau nodes. These metrics include:
job_submitted
: Number of jobs submitted to the Bacalhau node.job_publish_duration_milliseconds
: Duration of publishing a job on the compute node in milliseconds.job_storage_upload_duration_milliseconds
: Duration of uploading job storage input on the compute node in milliseconds.job_storage_prepare_duration_milliseconds
: Duration of preparing job storage input on the compute node in milliseconds.job_storage_cleanup_duration_milliseconds
: Duration of job storage input cleanup on the compute node in milliseconds.job_duration_milliseconds
: Duration of a job on the compute node in milliseconds.docker_active_executions
: Number of active docker executions on the compute node.wasm_active_executions
: Number of active WASM executions on the compute node.bacalhau_node_info
: A static metric with labels describing the bacalhau node.node_id
: ID of bacalhau node emitting metricnode_network_transport
: bacalhau node network transport type (libp2p or NATs)node_is_compute
: true if the node is accepting compute jobsnode_is_requester
: true if the node is serving as a requester nodenode_engines
: list of engines the node supports.node_publishers
: list of publishers the node supports.node_storages
: list of storages the node supports
Improved Out of Memory handling for Docker jobs
The Bacalhau CLI will now explain when Docker jobs run out of memory and include links to the Bacalhau documentation showing how to increase the memory limit for a job.
Improved configuration for IPFS
In this update, we have allowed for the embedded IPFS nodes gateway, API, and swarm listening multi-addresses to be configured, providing users with more control and determinism, particularly when configuring firewall rules.
This update also introduces changes when the --ipfs-serve-path
flag is set, now preserving the content of the embedded IPFS nodes repo across Bacalhau restarts, maintaining any data the embedded IPFS node stored as well as its identity.
F...
v1.3.0-rc3
What's Changed
- Clarifies error message when bacalhau encounters a newer repo version by @rossjones in #3638
- Approve/Reject node membership from CLI by @rossjones in #3594
- Fix low cleanup frequency in cache tests by @rossjones in #3636
- Add more auth documentation. by @simonwo in #3642
- release v1.2.3 by @wdbaruni in #3650
- Default local publisher to binding to all addresses by @rossjones in #3655
- Configure IP for local publisher address by @rossjones in #3659
- Upgrades the codebase to Go 1.21 by @rossjones in #3640
- Enable dark mode system-pref and switcher for docs by @rossjones in #3661
- Remove custom yaml tag on TLS ServerKey by @rossjones in #3639
- Local publisher address tf by @rossjones in #3660
- Publisher processing for exec command by @rossjones in #3656
- Default for NetworkType is NATS by @rossjones in #3628
- Use local node in Bash tests when starting hybrid node by @simonwo in #3667
- stop auto-generating auth tokens by @wdbaruni in #3670
- run staging on nats by @wdbaruni in #3671
- Allow nodes to call register on startup, even if previously registered by @rossjones in #3672
Full Changelog: v1.2.3...v1.3.0-rc3