Skip to content

Add streaming interface to era transition for initial funds injection#5572

Open
f-f wants to merge 1 commit intomasterfrom
f-f/streaming-genesis-data1
Open

Add streaming interface to era transition for initial funds injection#5572
f-f wants to merge 1 commit intomasterfrom
f-f/streaming-genesis-data1

Conversation

@f-f
Copy link
Contributor

@f-f f-f commented Feb 11, 2026

Description

This patch implements the ground work for #5549, to support data injection from genesis files using a streaming framework rather than relying on lazy IO.

The spec is the one that @lehins laid out in the issue, but here we only implement streaming for initialFunds, as the rest can be easily added after the approach is solid enough.

I have a few code-specific notes that I will add as comments on the diff.

Checklist

  • Commits in meaningful sequence and with useful messages.
  • Tests added or updated when needed.
  • CHANGELOG.md files updated for packages with externally visible changes.
    NOTE: New section is never added with the code changes. (See RELEASING.md).
  • Versions updated in .cabal and CHANGELOG.md files when necessary, according to the
    versioning process.
  • Version bounds in .cabal files updated when necessary.
    NOTE: If bounds change in a cabal file, that package itself must have a version increase. (See RELEASING.md).
  • Code formatted (use scripts/fourmolize.sh).
  • Cabal files formatted (use scripts/cabal-format.sh).
  • CDDL files are up to date (use scripts/gen-cddl.sh)
  • hie.yaml updated (use scripts/gen-hie.sh).
  • Self-reviewed the diff.

@f-f f-f requested a review from a team as a code owner February 11, 2026 14:03
instance EncCBOR ShelleyGenesis

instance ToCBOR ShelleyGenesis where
-- Note: sgExtraConfig is intentionally not serialized for backward compatibility
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't de/serialize this so it's not breaking for existing files, but I'm not sure of what's polite here since there are other breaking changes

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think CBOR for genesis files is used anywhere, so ignoring it is fine from that perspective. However, that will break roundtrip tests, so maybe it would be best to actually add the proper de/serialization for this field

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah makes sense, will do!

}
}
m (NewEpochState era)
registerInitialFunds withFileHandle tc newEpochState = do
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is now monadic and I did not add a deprecated alternative because the signature propagates kinda everywhere. Happy to hear feedback about if we should be handling this more gracefully

@f-f f-f force-pushed the f-f/streaming-genesis-data1 branch from 3e9ddc0 to a949d55 Compare February 11, 2026 14:11
-- | Fold over an injection source: for Embedded sources the data is folded in
-- memory, while for the FromFile source the data is streamed, hashed, and parsed
-- incrementally, with content hash verification at the end.
foldInjectionSource ::
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not 100% happy with this name, but it does describe what this function does 😄

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

foldInjectionData 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I was not a big fan of that either, but I'll move to that 👍

Cardano.Ledger.Coin
Cardano.Ledger.Compactible
Cardano.Ledger.Core
Cardano.Ledger.Crypto.Blake2b.Incremental
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This module should obviously not be here - I noted in the comments that it should be in base, but if we want to merge this before updating base we could move it to other-modules (in the shelley package I guess) - it's exposed right now only so I could test that it does the right thing

@f-f
Copy link
Contributor Author

f-f commented Feb 11, 2026

The elephant in the room is "tests": currently there are no tests, and I'm not sure what kind of tests we would like for this.

I have a few scripts that I used to verify that:

  1. the incremental hashing is correct against the standard hashing
  2. the streaming code is properly streaming

The tests could be some variation of these, feedback welcome:

Hash checking, foldInjectionSource fails if the hash doesn't match (obviously the tests for this should go in base since the hashing code goes there, but if we merge it here first then it makes sense to have some tests for that here)

{-# LANGUAGE BangPatterns #-}
{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE TypeApplications #-}

module Main where

import Cardano.Crypto.Hash (Blake2b_256)
import qualified Cardano.Crypto.Hash.Class as Hash
import Cardano.Crypto.Libsodium (sodiumInit)
import Cardano.Ledger.Shelley.Genesis (InjectionSource (..), foldInjectionSource)
import qualified Data.ByteString as BS
import qualified Data.ByteString.Char8 as BS8
import Data.Text (Text)
import System.Directory (removeFile)
import System.IO (BufferMode (..), IOMode (..), hSetBuffering, withFile)

main :: IO ()
main = do
  sodiumInit
  let path = "/tmp/hash-check.json"
      n = 1000 :: Int

  generateJsonFile path n

  fileHash <- Hash.hashWith @Blake2b_256 id <$> BS.readFile path

  let source = FromFile path fileHash :: InjectionSource Text Text
      wfh fp k = withFile fp ReadMode k
  count <- foldInjectionSource wfh source (\(!acc) _ -> acc + 1 :: Int) 0

  removeFile path
  putStrLn $ "Entries: " <> show count <> " (expected " <> show n <> ")"

generateJsonFile :: FilePath -> Int -> IO ()
generateJsonFile path n =
  withFile path WriteMode $ \h -> do
    hSetBuffering h (BlockBuffering (Just (64 * 1024)))
    BS8.hPut h "{"
    mapM_
      ( \i -> do
          let comma = if i == (1 :: Int) then "" else ","
          BS8.hPut h $ comma <> "\"key_" <> BS8.pack (show i) <> "\": \"val_" <> BS8.pack (show i) <> "\""
      )
      [1 .. n]
    BS8.hPut h "}"

Streaming check: this creates a file with 3GB of json, then reads it back with streaming.
Limit the heap when running, e.g. to 300MB - program should fail if trying to read the file all at once: cabal run streaming-check -- +RTS -s -M300M

{-# LANGUAGE BangPatterns #-}
{-# LANGUAGE DataKinds #-}
{-# LANGUAGE NumericUnderscores #-}
{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE TypeApplications #-}

module Main where

import Cardano.Crypto.Hash (Blake2b_256)
import qualified Cardano.Crypto.Hash.Class as Hash
import Cardano.Crypto.Libsodium (sodiumInit)
import qualified Cardano.Ledger.Crypto.Blake2b.Incremental as Blake2b
import Cardano.Ledger.Shelley.Genesis (InjectionSource (..), foldInjectionSource)
import qualified Data.ByteString as BS
import qualified Data.ByteString.Char8 as BS8
import Data.Text (Text)
import System.Directory (removeFile)
import System.IO (BufferMode (..), IOMode (..), hSetBuffering, withFile)

main :: IO ()
main = do
  sodiumInit
  let path = "/tmp/streaming-check.json"
      n = 100_000_000 :: Int

  putStrLn $ "Generating " <> show n <> " entries..."
  fileHash <- generateAndHash path n

  putStrLn "Streaming fold..."
  let source = FromFile path fileHash :: InjectionSource Text Text
      wfh fp k = withFile fp ReadMode k
  count <- foldInjectionSource wfh source (\(!acc) _ -> acc + 1 :: Int) 0

  removeFile path
  putStrLn $ "Entries: " <> show count <> " (expected " <> show n <> ")"

generateAndHash :: FilePath -> Int -> IO (Hash.Hash Blake2b_256 BS.ByteString)
generateAndHash path n = do
  ctx <- Blake2b.blake2bInit @32
  withFile path WriteMode $ \h -> do
    hSetBuffering h (BlockBuffering (Just (64 * 1024)))
    let w bs = BS8.hPut h bs >> Blake2b.blake2bUpdate ctx bs
    w "{"
    mapM_
      ( \i -> do
          let comma = if i == (1 :: Int) then "" else ","
          w $ comma <> "\"key_" <> BS8.pack (show i) <> "\": \"val_" <> BS8.pack (show i) <> "\""
      )
      [1 .. n]
    w "}"
  Blake2b.blake2b256Finalize ctx

Cabal stanzas (to go in the shelley cabal file):

executable hash-check
  main-is: HashCheck.hs
  hs-source-dirs: app
  default-language: Haskell2010
  ghc-options:
    -Wall
    -Wcompat
    -Wincomplete-record-updates
    -Wincomplete-uni-patterns
    -Wredundant-constraints
    -Wpartial-fields
    -Wunused-packages
    -threaded
    -rtsopts
    -with-rtsopts=-N

  build-depends:
    base >=4.18 && <5,
    bytestring,
    cardano-crypto-class,
    cardano-ledger-shelley,
    directory,
    text,

executable streaming-check
  main-is: StreamingCheck.hs
  hs-source-dirs: app
  default-language: Haskell2010
  ghc-options:
    -Wall
    -Wcompat
    -Wincomplete-record-updates
    -Wincomplete-uni-patterns
    -Wredundant-constraints
    -Wpartial-fields
    -Wunused-packages
    -threaded
    -rtsopts
    -with-rtsopts=-N

  build-depends:
    base >=4.18 && <5,
    bytestring,
    cardano-crypto-class,
    cardano-ledger-core,
    cardano-ledger-shelley,
    directory,
    text,

errno <- getErrno
ioException $
errnoToIOError "blake2bFinalize: c_crypto_generichash_blake2b_final" errno Nothing Nothing
BS.packCStringLen (castPtr outptr, outlen)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realised that mlsbFinalize needs to be called after this and before returning; the docs for mlsbNew say:

The caller is responsible for deallocating it ('mlsbFinalize') when done with it. The contents of the memory block is undefined.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fixed in IntersectMBO/cardano-base#632 - I will either wait for that do be released, or backport the changes here once that's merged

hashBytes <- blake2bFinalize ctx
case Hash.hashFromBytes hashBytes of
Just h -> pure h
Nothing -> error "blake2b256Finalize: unexpected hash size mismatch"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be replaced by a fail from MonadFail?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can, but it should not be, since fail for IO is also an exception, but a much uglier one.

Comment on lines +797 to +799
(Nothing, Nothing) -> pure $ Embedded LM.empty
(Just _, Just _) ->
fail "InjectionSource: cannot specify both 'file' and 'data' fields"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a thought, that perhaps it would be prudent to report via stdout that there is no source in the (Nothing, Nothing) case, or may be not 🙂 .

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reporting on stdout implies you are working in IO, but we are in json Parser

(Nothing, Nothing) case is a legitimate case, so why there should be any reporting about it?

These are json keys, so it makes more sense to use double quotes instead, despite that it is slightly less convenient.

Suggested change
(Nothing, Nothing) -> pure $ Embedded LM.empty
(Just _, Just _) ->
fail "InjectionSource: cannot specify both 'file' and 'data' fields"
fail "InjectionSource: cannot specify both \"file\" and \"data\" fields"

-- | Errors that can occur during injection source resolution
data InjectionError
= InjectionHashMismatch !FilePath !Text !Text
| InjectionParseError !FilePath !String
Copy link
Contributor

@aniketd aniketd Feb 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could not find a usage of this InjectionParseError, but I suppose it was intended to be used.


-- | Source for injectable data
--
-- @since 1.19.0.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other @since are 1.18 in this PR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I don't think it is very useful for ledger packages to start adding since, since:

  • in order for those to be useful we need to be consistently adding them for all functions (which would be a lot of work to add retroactively)
  • We do a lot of re-exports of definitions, which totally breaks the since

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, thanks for the context; was not sure what our custom was on that. Will remove them 🙂

Copy link
Collaborator

@lehins lehins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is some preliminary feedback.
So far this is looking really good, despite all my pivoting suggestions 👍

hashBytes <- blake2bFinalize ctx
case Hash.hashFromBytes hashBytes of
Just h -> pure h
Nothing -> error "blake2b256Finalize: unexpected hash size mismatch"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can, but it should not be, since fail for IO is also an exception, but a much uglier one.


-- | Source for injectable data
--
-- @since 1.19.0.0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I don't think it is very useful for ledger packages to start adding since, since:

  • in order for those to be useful we need to be consistently adding them for all functions (which would be a lot of work to add retroactively)
  • We do a lot of re-exports of definitions, which totally breaks the since

Comment on lines +797 to +799
(Nothing, Nothing) -> pure $ Embedded LM.empty
(Just _, Just _) ->
fail "InjectionSource: cannot specify both 'file' and 'data' fields"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reporting on stdout implies you are working in IO, but we are in json Parser

(Nothing, Nothing) case is a legitimate case, so why there should be any reporting about it?

These are json keys, so it makes more sense to use double quotes instead, despite that it is slightly less convenient.

Suggested change
(Nothing, Nothing) -> pure $ Embedded LM.empty
(Just _, Just _) ->
fail "InjectionSource: cannot specify both 'file' and 'data' fields"
fail "InjectionSource: cannot specify both \"file\" and \"data\" fields"

, sgStaking :: ShelleyGenesisStaking
-- ^ 'sgStaking' is intentionally kept lazy, as it can otherwise cause
-- out-of-memory problems in testing and benchmarking.
, sgExtraConfig :: !(Maybe ShelleyExtraConfig)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
, sgExtraConfig :: !(Maybe ShelleyExtraConfig)
, sgExtraConfig :: !(StrictMaybe ShelleyExtraConfig)

instance EncCBOR ShelleyGenesis

instance ToCBOR ShelleyGenesis where
-- Note: sgExtraConfig is intentionally not serialized for backward compatibility
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think CBOR for genesis files is used anywhere, so ignoring it is fine from that perspective. However, that will break roundtrip tests, so maybe it would be best to actually add the proper de/serialization for this field

deriving (Show, Eq, Generic, NFData, NoThunks)

instance Default (InjectionSource k v) where
def = Embedded LM.empty
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def = Embedded LM.empty
def = NoInjection


-- each chunk updates the hash as side effect, then feeds the parser
let chunks = StreamingBS.toChunks (StreamingBS.fromHandle handle)
hashingChunks = Streaming.mapM (\c -> Blake2b.blake2bUpdate ctx c >> pure c) chunks
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slight simplification

Suggested change
hashingChunks = Streaming.mapM (\c -> Blake2b.blake2bUpdate ctx c >> pure c) chunks
hashingChunks = Streaming.mapM (\c -> c <$ Blake2b.blake2bUpdate ctx c) chunks

Comment on lines +127 to +128
(MonadIO m, MonadFail m) =>
WithFileHandle m ->
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's instead utilize the same interface that Consensus already depends on, namely fs-api and io-classes

All fails can be converted into proper exceptions.

Suggested change
(MonadIO m, MonadFail m) =>
WithFileHandle m ->
(HasCallStack, MonadThrow m) =>
HasFS m h ->

-- | Fold over an injection source: for Embedded sources the data is folded in
-- memory, while for the FromFile source the data is streamed, hashed, and parsed
-- incrementally, with content hash verification at the end.
foldInjectionSource ::
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

foldInjectionData 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants