-
Notifications
You must be signed in to change notification settings - Fork 75
WIP feat(patterns): pattern-based compression take2 #1584
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
erights
wants to merge
1
commit into
markm-prepare-for-extended-matchers
Choose a base branch
from
markm-pattern-based-compression-2
base: markm-prepare-for-extended-matchers
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
WIP feat(patterns): pattern-based compression take2 #1584
erights
wants to merge
1
commit into
markm-prepare-for-extended-matchers
from
markm-pattern-based-compression-2
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
241b2d3
to
f57ac4b
Compare
f57ac4b
to
533d62a
Compare
533d62a
to
7ce2d16
Compare
7ce2d16
to
1025466
Compare
18db466
to
accc77c
Compare
b05871a
to
2a13b3d
Compare
accc77c
to
2e6810f
Compare
a0170df
to
505f81f
Compare
2e6810f
to
99b58d6
Compare
505f81f
to
c2cd034
Compare
282fd46
to
b77b6f7
Compare
be5d3aa
to
3a169ed
Compare
7125ac7
to
061c7e6
Compare
5497b03
to
ce825a7
Compare
bb79e79
to
c079763
Compare
f013614
to
92befa7
Compare
c079763
to
7af6f89
Compare
92befa7
to
b4b09cd
Compare
7af6f89
to
c6d0e20
Compare
b4b09cd
to
a71dd8f
Compare
c6d0e20
to
5566832
Compare
a71dd8f
to
552cdca
Compare
5566832
to
da664f9
Compare
552cdca
to
b6ab0e1
Compare
da664f9
to
ce1dac5
Compare
b6ab0e1
to
bd279f6
Compare
ce1dac5
to
a222c71
Compare
bd279f6
to
711ef1c
Compare
a222c71
to
0c316b9
Compare
711ef1c
to
1e4653e
Compare
0c316b9
to
8e72f8c
Compare
1e4653e
to
c164404
Compare
8e72f8c
to
ce96699
Compare
c164404
to
21e35d9
Compare
ce96699
to
a4dd6c9
Compare
21e35d9
to
9be1bfe
Compare
a4dd6c9
to
c933684
Compare
9be1bfe
to
259cbd2
Compare
c933684
to
d5abd66
Compare
259cbd2
to
fccb493
Compare
d5abd66
to
b660322
Compare
fccb493
to
b8f81f1
Compare
b660322
to
00d8db6
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Staged on #2248
closes: #2112
refs: #1564 Agoric/agoric-sdk#6432
Description
Adds two new exports to @endo/patterns
and its "inverse"
(From Agoric/agoric-sdk#6432 (comment) ):
For example without compression, the Zoe proposal
is stored with a smallcaps body of
'#{"exit":{"afterDeadline":{"deadline":"+11","timer":"$0.Alleged: timer"}},"give":{"Bid":{"brand":"$1.Alleged: simoleans","value":"+37"}},"want":{"Winnings":{"brand":"$2.Alleged: moola","value":{"#tag":"copyBag","payload":[[{"foo":"c"},"+1"],[{"foo":"b"},"+1"],[{"foo":"a"},"+1"]]}}}}'
But it compresses with the proposalShape
to
whose smallcaps body is
'#[[["c"],["b"],["a"]],"+37","+11"]'
which is 12% as long.
It would take much more work, but if we were able to use matching interface guards on the sending and receiving sides, we'd get similar savings for messages. Agoric/agoric-sdk#6355 may help get there. But note the difficulties explained in "Upgrade Considerations" below.
mustCompress
is analogous tomustMatch
, which as a reminder isThe following equivalences must hold
mustMatch(s,p,l1?)
must succeed iffmuchCompress(s,p,l2?)
succeeds. When they succeed, the label does not matter.label
to be more informative. Thus, one throws iff the other throws. The diagnostics are not necessarily the same.mustMatch(s,p,l1?)
and thereforemustCompress(s,p,l2?)
succeeds iffcompress(s,p) === true
.mustMatch(s,p,l?) === c
iffmustDecompress(c,p,l) === s2
wheres
ands2
have the same distributed object semantics.compareRank(s, s2) === 0
,isKey(s) === isKey(s2)
,isKey(s) =>
keyEQ(s,s2)`.The point is that typically
c
is smaller thans
, though in some cases it may be larger. The space savings should typically be similar to the space savings from schema-based encodings like protobuf or capn-proto. The pattern is analogous to the schema. Anything that must be in all specimens that match a given pattern can be omitted from the compressed form, since those parts can be recovered from the pattern on decompression. Unlike schema-based compression, this can include dynamic elements like brand identity, potentially resulting in greater savings and tighter error checking.Unlike schema-based compression schemes like protobuf or cap'n proto, the layering here makes compression mostly independent of encoding/serialization, as shown by the above example: The compression is independent of whether the result will be encoded with smallcaps, and the smallcaps encoding is independent of whether its input was a compressed or uncompressed specimen. Or rather, mostly independent. We chose a nested-array compression because of its compact JSON representation, preserved by smallcaps.
Security Considerations
If sender and receiver can be led into compressing and decompressing with different patterns, or with different compression/decompression algorithms associated with that pattern's matchers, then compressed data might be decompressed into something arbitrarily different that the sender meant to send. See "Upgrade Considerations" below.
Aside from that, none.
Scaling Considerations
The whole point. Compression could result in tremendously less data stored, send, and received. Unfortunately, so far, the informal measurements of the time taken to compress is not encouraging. This needs to be measured carefully, and probably needs to be improved tremendously, before this PR is ready for production use. Ideally:
encode(mustCompress(data, pattern))
typically takes both less time and less space thanmustMatch(data, pattern) && encode(data)
.mustDecompress(decode(encodedCompressedData))
typically takes less time thandecode(encodedUncompressedData)
.This will depend of course on what
encode
scheme is used.Documentation Considerations
Testing Considerations
Already includes good manual tests.
Compatibility Considerations
A big advantage of smallcaps encoded of an uncompressed specimen is that the result is still mostly human readable, and processable using JSON-oriented tooling like jq. The compressed form loses both of these benefits, also calling into question whether there's any point in smallcaps encoding the compressed form rather than using an unreadable binary encoding like
compactOrdered
,syrup
orcbor
.compactOrdered
is both rank equality preserving and rank order preserving. Holding the pattern constant,compactOrdered
of the compressed form would still be rank equality preserving, but not rank order preserving. Thus, stores will probably continue to encode their keys usingcompactOrdered
on the uncompressed form, forfeiting the opportunity to usekeyShape
for compression.Upgrade Considerations
When the compressed form is communicated instead of the uncompressed form, the sender and receiver must agree precisely on the pattern. If a different pattern is used to uncompress than was used to compress, the compressed data might silently uncompress into data arbitrarily different than the original specimen. The best way to do this is to send the pattern as well somehow from the sender to receiver. For small data, this may cost more space than it saves.
SwingSet already stores optional patterns with some large data stores, with an error check to ensure that the data matches the pattern:
keyShape
,valueShape
, andstateShape
. Agoric/agoric-sdk#6432 modifies SwingSet to also use thevalueShape
andstateShape
for compression.A pattern is a tree of copy-data to be matched literally (the key-like parts), and Matchers, typically expressed in code like
M.bagOf(keyShape, countShape)
in the example above. The overall compression/decompression algorithms are composed from compression/decompression algorithms for each matcher kind. Not only must the sender and receiver agree exactly on the pattern, they must agree exactly on the algorithms associated with each matcher in the pattern. But we'd also like to improve these over time. Thus, this PR includes in each matcher kind definition an optional version number of the compression algorithm it uses. If omitted, that matcher does not compress. Version numbers are assigned in increasing sequence starting with1
. The algorithm associated with a given sequence number must never change. If a given version of the endo supports matcher M sequence number N, then it should also support all sequence numbers prior to N, unless there is a compelling reason to retire an old one.The
M.something(...)
matcher makers should generally produce a matcher with the latest locally supported sequence number. Thus, this system supports older senders sending to newer receivers. This works fine for intra-vat storage, as in Agoric/agoric-sdk#6432 , since intra-vat storage communicates data only forward in time/versions. However, inter-vat communications must tolerate some version slippage in both direction, which will require design of some kind of pattern negotiation.[ ] Includes*BREAKING*:
in the commit message with migration instructions for any breaking change.This PR itself does not introduce any breaking changes. But PRs based on it will have more hazards of breaking changes as explained above.
NEWS.md
for user-facing changes.Many of the points made in this PR note should be summarized in a NEWS.md entry.