Add streaming interface to era transition for initial funds injection#5572
Add streaming interface to era transition for initial funds injection#5572
Conversation
| instance EncCBOR ShelleyGenesis | ||
|
|
||
| instance ToCBOR ShelleyGenesis where | ||
| -- Note: sgExtraConfig is intentionally not serialized for backward compatibility |
There was a problem hiding this comment.
I didn't de/serialize this so it's not breaking for existing files, but I'm not sure of what's polite here since there are other breaking changes
There was a problem hiding this comment.
I don't think CBOR for genesis files is used anywhere, so ignoring it is fine from that perspective. However, that will break roundtrip tests, so maybe it would be best to actually add the proper de/serialization for this field
There was a problem hiding this comment.
Ah makes sense, will do!
| } | ||
| } | ||
| m (NewEpochState era) | ||
| registerInitialFunds withFileHandle tc newEpochState = do |
There was a problem hiding this comment.
This function is now monadic and I did not add a deprecated alternative because the signature propagates kinda everywhere. Happy to hear feedback about if we should be handling this more gracefully
3e9ddc0 to
a949d55
Compare
| -- | Fold over an injection source: for Embedded sources the data is folded in | ||
| -- memory, while for the FromFile source the data is streamed, hashed, and parsed | ||
| -- incrementally, with content hash verification at the end. | ||
| foldInjectionSource :: |
There was a problem hiding this comment.
I am not 100% happy with this name, but it does describe what this function does 😄
There was a problem hiding this comment.
Yeah I was not a big fan of that either, but I'll move to that 👍
| Cardano.Ledger.Coin | ||
| Cardano.Ledger.Compactible | ||
| Cardano.Ledger.Core | ||
| Cardano.Ledger.Crypto.Blake2b.Incremental |
There was a problem hiding this comment.
This module should obviously not be here - I noted in the comments that it should be in base, but if we want to merge this before updating base we could move it to other-modules (in the shelley package I guess) - it's exposed right now only so I could test that it does the right thing
|
The elephant in the room is "tests": currently there are no tests, and I'm not sure what kind of tests we would like for this. I have a few scripts that I used to verify that:
The tests could be some variation of these, feedback welcome: Hash checking, {-# LANGUAGE BangPatterns #-}
{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE TypeApplications #-}
module Main where
import Cardano.Crypto.Hash (Blake2b_256)
import qualified Cardano.Crypto.Hash.Class as Hash
import Cardano.Crypto.Libsodium (sodiumInit)
import Cardano.Ledger.Shelley.Genesis (InjectionSource (..), foldInjectionSource)
import qualified Data.ByteString as BS
import qualified Data.ByteString.Char8 as BS8
import Data.Text (Text)
import System.Directory (removeFile)
import System.IO (BufferMode (..), IOMode (..), hSetBuffering, withFile)
main :: IO ()
main = do
sodiumInit
let path = "/tmp/hash-check.json"
n = 1000 :: Int
generateJsonFile path n
fileHash <- Hash.hashWith @Blake2b_256 id <$> BS.readFile path
let source = FromFile path fileHash :: InjectionSource Text Text
wfh fp k = withFile fp ReadMode k
count <- foldInjectionSource wfh source (\(!acc) _ -> acc + 1 :: Int) 0
removeFile path
putStrLn $ "Entries: " <> show count <> " (expected " <> show n <> ")"
generateJsonFile :: FilePath -> Int -> IO ()
generateJsonFile path n =
withFile path WriteMode $ \h -> do
hSetBuffering h (BlockBuffering (Just (64 * 1024)))
BS8.hPut h "{"
mapM_
( \i -> do
let comma = if i == (1 :: Int) then "" else ","
BS8.hPut h $ comma <> "\"key_" <> BS8.pack (show i) <> "\": \"val_" <> BS8.pack (show i) <> "\""
)
[1 .. n]
BS8.hPut h "}"Streaming check: this creates a file with 3GB of json, then reads it back with streaming. {-# LANGUAGE BangPatterns #-}
{-# LANGUAGE DataKinds #-}
{-# LANGUAGE NumericUnderscores #-}
{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE TypeApplications #-}
module Main where
import Cardano.Crypto.Hash (Blake2b_256)
import qualified Cardano.Crypto.Hash.Class as Hash
import Cardano.Crypto.Libsodium (sodiumInit)
import qualified Cardano.Ledger.Crypto.Blake2b.Incremental as Blake2b
import Cardano.Ledger.Shelley.Genesis (InjectionSource (..), foldInjectionSource)
import qualified Data.ByteString as BS
import qualified Data.ByteString.Char8 as BS8
import Data.Text (Text)
import System.Directory (removeFile)
import System.IO (BufferMode (..), IOMode (..), hSetBuffering, withFile)
main :: IO ()
main = do
sodiumInit
let path = "/tmp/streaming-check.json"
n = 100_000_000 :: Int
putStrLn $ "Generating " <> show n <> " entries..."
fileHash <- generateAndHash path n
putStrLn "Streaming fold..."
let source = FromFile path fileHash :: InjectionSource Text Text
wfh fp k = withFile fp ReadMode k
count <- foldInjectionSource wfh source (\(!acc) _ -> acc + 1 :: Int) 0
removeFile path
putStrLn $ "Entries: " <> show count <> " (expected " <> show n <> ")"
generateAndHash :: FilePath -> Int -> IO (Hash.Hash Blake2b_256 BS.ByteString)
generateAndHash path n = do
ctx <- Blake2b.blake2bInit @32
withFile path WriteMode $ \h -> do
hSetBuffering h (BlockBuffering (Just (64 * 1024)))
let w bs = BS8.hPut h bs >> Blake2b.blake2bUpdate ctx bs
w "{"
mapM_
( \i -> do
let comma = if i == (1 :: Int) then "" else ","
w $ comma <> "\"key_" <> BS8.pack (show i) <> "\": \"val_" <> BS8.pack (show i) <> "\""
)
[1 .. n]
w "}"
Blake2b.blake2b256Finalize ctxCabal stanzas (to go in the shelley cabal file): executable hash-check
main-is: HashCheck.hs
hs-source-dirs: app
default-language: Haskell2010
ghc-options:
-Wall
-Wcompat
-Wincomplete-record-updates
-Wincomplete-uni-patterns
-Wredundant-constraints
-Wpartial-fields
-Wunused-packages
-threaded
-rtsopts
-with-rtsopts=-N
build-depends:
base >=4.18 && <5,
bytestring,
cardano-crypto-class,
cardano-ledger-shelley,
directory,
text,
executable streaming-check
main-is: StreamingCheck.hs
hs-source-dirs: app
default-language: Haskell2010
ghc-options:
-Wall
-Wcompat
-Wincomplete-record-updates
-Wincomplete-uni-patterns
-Wredundant-constraints
-Wpartial-fields
-Wunused-packages
-threaded
-rtsopts
-with-rtsopts=-N
build-depends:
base >=4.18 && <5,
bytestring,
cardano-crypto-class,
cardano-ledger-core,
cardano-ledger-shelley,
directory,
text, |
| errno <- getErrno | ||
| ioException $ | ||
| errnoToIOError "blake2bFinalize: c_crypto_generichash_blake2b_final" errno Nothing Nothing | ||
| BS.packCStringLen (castPtr outptr, outlen) |
There was a problem hiding this comment.
I just realised that mlsbFinalize needs to be called after this and before returning; the docs for mlsbNew say:
The caller is responsible for deallocating it ('mlsbFinalize') when done with it. The contents of the memory block is undefined.
There was a problem hiding this comment.
This is fixed in IntersectMBO/cardano-base#632 - I will either wait for that do be released, or backport the changes here once that's merged
| hashBytes <- blake2bFinalize ctx | ||
| case Hash.hashFromBytes hashBytes of | ||
| Just h -> pure h | ||
| Nothing -> error "blake2b256Finalize: unexpected hash size mismatch" |
There was a problem hiding this comment.
Can this be replaced by a fail from MonadFail?
There was a problem hiding this comment.
It can, but it should not be, since fail for IO is also an exception, but a much uglier one.
| (Nothing, Nothing) -> pure $ Embedded LM.empty | ||
| (Just _, Just _) -> | ||
| fail "InjectionSource: cannot specify both 'file' and 'data' fields" |
There was a problem hiding this comment.
Just a thought, that perhaps it would be prudent to report via stdout that there is no source in the (Nothing, Nothing) case, or may be not 🙂 .
There was a problem hiding this comment.
reporting on stdout implies you are working in IO, but we are in json Parser
(Nothing, Nothing) case is a legitimate case, so why there should be any reporting about it?
These are json keys, so it makes more sense to use double quotes instead, despite that it is slightly less convenient.
| (Nothing, Nothing) -> pure $ Embedded LM.empty | |
| (Just _, Just _) -> | |
| fail "InjectionSource: cannot specify both 'file' and 'data' fields" | |
| fail "InjectionSource: cannot specify both \"file\" and \"data\" fields" |
| -- | Errors that can occur during injection source resolution | ||
| data InjectionError | ||
| = InjectionHashMismatch !FilePath !Text !Text | ||
| | InjectionParseError !FilePath !String |
There was a problem hiding this comment.
Could not find a usage of this InjectionParseError, but I suppose it was intended to be used.
|
|
||
| -- | Source for injectable data | ||
| -- | ||
| -- @since 1.19.0.0 |
There was a problem hiding this comment.
Other @since are 1.18 in this PR.
There was a problem hiding this comment.
Yeah, I don't think it is very useful for ledger packages to start adding since, since:
- in order for those to be useful we need to be consistently adding them for all functions (which would be a lot of work to add retroactively)
- We do a lot of re-exports of definitions, which totally breaks the
since
There was a problem hiding this comment.
Right, thanks for the context; was not sure what our custom was on that. Will remove them 🙂
lehins
left a comment
There was a problem hiding this comment.
This is some preliminary feedback.
So far this is looking really good, despite all my pivoting suggestions 👍
| hashBytes <- blake2bFinalize ctx | ||
| case Hash.hashFromBytes hashBytes of | ||
| Just h -> pure h | ||
| Nothing -> error "blake2b256Finalize: unexpected hash size mismatch" |
There was a problem hiding this comment.
It can, but it should not be, since fail for IO is also an exception, but a much uglier one.
|
|
||
| -- | Source for injectable data | ||
| -- | ||
| -- @since 1.19.0.0 |
There was a problem hiding this comment.
Yeah, I don't think it is very useful for ledger packages to start adding since, since:
- in order for those to be useful we need to be consistently adding them for all functions (which would be a lot of work to add retroactively)
- We do a lot of re-exports of definitions, which totally breaks the
since
| (Nothing, Nothing) -> pure $ Embedded LM.empty | ||
| (Just _, Just _) -> | ||
| fail "InjectionSource: cannot specify both 'file' and 'data' fields" |
There was a problem hiding this comment.
reporting on stdout implies you are working in IO, but we are in json Parser
(Nothing, Nothing) case is a legitimate case, so why there should be any reporting about it?
These are json keys, so it makes more sense to use double quotes instead, despite that it is slightly less convenient.
| (Nothing, Nothing) -> pure $ Embedded LM.empty | |
| (Just _, Just _) -> | |
| fail "InjectionSource: cannot specify both 'file' and 'data' fields" | |
| fail "InjectionSource: cannot specify both \"file\" and \"data\" fields" |
| , sgStaking :: ShelleyGenesisStaking | ||
| -- ^ 'sgStaking' is intentionally kept lazy, as it can otherwise cause | ||
| -- out-of-memory problems in testing and benchmarking. | ||
| , sgExtraConfig :: !(Maybe ShelleyExtraConfig) |
There was a problem hiding this comment.
| , sgExtraConfig :: !(Maybe ShelleyExtraConfig) | |
| , sgExtraConfig :: !(StrictMaybe ShelleyExtraConfig) |
| instance EncCBOR ShelleyGenesis | ||
|
|
||
| instance ToCBOR ShelleyGenesis where | ||
| -- Note: sgExtraConfig is intentionally not serialized for backward compatibility |
There was a problem hiding this comment.
I don't think CBOR for genesis files is used anywhere, so ignoring it is fine from that perspective. However, that will break roundtrip tests, so maybe it would be best to actually add the proper de/serialization for this field
| deriving (Show, Eq, Generic, NFData, NoThunks) | ||
|
|
||
| instance Default (InjectionSource k v) where | ||
| def = Embedded LM.empty |
There was a problem hiding this comment.
| def = Embedded LM.empty | |
| def = NoInjection |
|
|
||
| -- each chunk updates the hash as side effect, then feeds the parser | ||
| let chunks = StreamingBS.toChunks (StreamingBS.fromHandle handle) | ||
| hashingChunks = Streaming.mapM (\c -> Blake2b.blake2bUpdate ctx c >> pure c) chunks |
There was a problem hiding this comment.
Slight simplification
| hashingChunks = Streaming.mapM (\c -> Blake2b.blake2bUpdate ctx c >> pure c) chunks | |
| hashingChunks = Streaming.mapM (\c -> c <$ Blake2b.blake2bUpdate ctx c) chunks |
| (MonadIO m, MonadFail m) => | ||
| WithFileHandle m -> |
There was a problem hiding this comment.
Let's instead utilize the same interface that Consensus already depends on, namely fs-api and io-classes
All fails can be converted into proper exceptions.
| (MonadIO m, MonadFail m) => | |
| WithFileHandle m -> | |
| (HasCallStack, MonadThrow m) => | |
| HasFS m h -> |
| -- | Fold over an injection source: for Embedded sources the data is folded in | ||
| -- memory, while for the FromFile source the data is streamed, hashed, and parsed | ||
| -- incrementally, with content hash verification at the end. | ||
| foldInjectionSource :: |
Description
This patch implements the ground work for #5549, to support data injection from genesis files using a streaming framework rather than relying on lazy IO.
The spec is the one that @lehins laid out in the issue, but here we only implement streaming for
initialFunds, as the rest can be easily added after the approach is solid enough.I have a few code-specific notes that I will add as comments on the diff.
Checklist
CHANGELOG.mdfiles updated for packages with externally visible changes.NOTE: New section is never added with the code changes. (See RELEASING.md).
.cabalandCHANGELOG.mdfiles when necessary, according to theversioning process.
.cabalfiles updated when necessary.NOTE: If bounds change in a cabal file, that package itself must have a version increase. (See RELEASING.md).
scripts/fourmolize.sh).scripts/cabal-format.sh).scripts/gen-cddl.sh)hie.yamlupdated (usescripts/gen-hie.sh).