-
Notifications
You must be signed in to change notification settings - Fork 1.5k
refactor!: Centralize KVS value semantics in @crawlee/core #3774
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
14 commits
Select commit
Hold shift + click to select a range
2a002f1
refactor!: Centralize KVS value semantics
janbuchar d0e3433
Missing awaits
janbuchar de6c16b
Add failing tests according to review feedback
janbuchar 76923d1
Fix serializeValue logic
janbuchar c5707e2
Add KeyValueStore.getRecord method
janbuchar cd4e933
Avoid empty string sentinels
janbuchar 441e794
Tighten down test assertions
janbuchar 12ff6e8
Extract shared buffer/stream detection to utils, handle web streams c…
janbuchar 22a4475
Fix dependencies
janbuchar aef1874
Improve KeyValueStoreRecord.value type, patch up FS storage roundrip
janbuchar b75ad02
Tighten down types
janbuchar a9a77e2
Tighten the types down for real
janbuchar 3b67dc9
Tidy up loose ends in the KVS value-semantics refactor
janbuchar 19fb52a
Fix fs-storage tests for the byte-transport contract
janbuchar File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,138 @@ | ||
| import type { Dictionary } from '@crawlee/types'; | ||
| import contentTypeParser from 'content-type'; | ||
| import { isBuffer, isStream } from '@crawlee/utils'; | ||
| import JSON5 from 'json5'; | ||
|
|
||
| import { jsonStringifyExtended } from '@apify/utilities'; | ||
|
|
||
| const CONTENT_TYPE_JSON = 'application/json'; | ||
| const STRINGIFIABLE_CONTENT_TYPE_RXS = [new RegExp(`^${CONTENT_TYPE_JSON}$`, 'i'), /^application\/.*xml$/i, /^text\//i]; | ||
|
|
||
| /** | ||
| * Canonical write path for key-value store records. | ||
| * | ||
| * When a content type is provided, the value passes through unchanged — it is the caller's | ||
| * responsibility to supply a String/Buffer/Stream (the frontend validates this). | ||
| * | ||
| * When no content type is provided, it is inferred from the value's shape: | ||
| * - Buffer / typed array / ArrayBuffer / stream → `application/octet-stream` (passthrough) | ||
| * - `string` → `text/plain; charset=utf-8` (passthrough) | ||
| * - anything else → `application/json; charset=utf-8` (serialized via `jsonStringifyExtended`) | ||
| * | ||
| * Does NOT drain streams — that is storage mechanics and stays in the storage client. | ||
| * | ||
| * Backend-independent. | ||
| */ | ||
| export function serializeValue( | ||
| value: unknown, | ||
| contentType?: string, | ||
| ): { | ||
| value: Buffer | ArrayBuffer | ArrayBufferView | string | NodeJS.ReadableStream | ReadableStream; | ||
| contentType: string; | ||
| } { | ||
| if (contentType !== null && contentType !== undefined) { | ||
| return { value: value as Buffer | string | NodeJS.ReadableStream | ReadableStream, contentType }; | ||
| } | ||
|
|
||
| if (isStream(value) || isBuffer(value)) { | ||
| return { | ||
| value, | ||
| contentType: 'application/octet-stream', | ||
| }; | ||
| } | ||
|
|
||
| if (typeof value === 'string') { | ||
| return { value, contentType: 'text/plain; charset=utf-8' }; | ||
| } | ||
|
|
||
| let serialized: string; | ||
| try { | ||
| // Format JSON to simplify debugging, the overheads with compression is negligible | ||
| serialized = jsonStringifyExtended(value as Dictionary, null, 2); | ||
| } catch (e) { | ||
| const error = e as Error; | ||
| // Give more meaningful error message | ||
| if (error.message?.includes('Invalid string length')) { | ||
| error.message = 'Object is too large'; | ||
| } | ||
| throw new Error(`The "value" parameter cannot be stringified to JSON: ${error.message}`); | ||
| } | ||
|
|
||
| if (serialized === undefined) { | ||
| throw new Error( | ||
| 'The "value" parameter was stringified to JSON and returned undefined. ' + | ||
| "Make sure you're not trying to stringify an undefined value.", | ||
| ); | ||
| } | ||
|
|
||
| return { value: serialized, contentType: 'application/json; charset=utf-8' }; | ||
| } | ||
|
|
||
| /** | ||
| * Parses a Buffer or ArrayBuffer using the provided content type header. | ||
| * | ||
| * - application/json is returned as a parsed object. | ||
| * - application/*xml and text/* are returned as strings. | ||
| * - everything else is returned as original body. | ||
| * | ||
| * If the header includes a charset, the body will be stringified only | ||
| * if the charset represents a known encoding to Node.js or Browser. | ||
| * | ||
| * Backend-independent — this is the canonical read path for the {@apilink KeyValueStore} frontend. | ||
| */ | ||
| export function parseValue( | ||
| body: Buffer | ArrayBuffer | string, | ||
| contentTypeHeader: string | null, | ||
| ): string | Buffer | ArrayBuffer | Record<string, unknown> { | ||
| // No content type at all → we have no basis for interpretation; hand back the raw value. | ||
| if (contentTypeHeader === null) return body; | ||
|
|
||
| let contentType: string; | ||
| let charset: BufferEncoding; | ||
| try { | ||
| const result = contentTypeParser.parse(contentTypeHeader); | ||
| contentType = result.type; | ||
| charset = result.parameters.charset as BufferEncoding; | ||
| } catch { | ||
| // Unparseable header → keep the original value rather than a mangled string. | ||
| return body; | ||
| } | ||
|
|
||
| // If we can't successfully interpret it, we return the original value rather than mangling it. | ||
| if (!areDataStringifiable(contentType, charset)) return body; | ||
|
|
||
| // Decode raw bytes using the resolved charset. An already-decoded string passes through (callers | ||
| // may hand us one directly), avoiding a needless re-encode round-trip. | ||
| const dataString = typeof body === 'string' ? body : isomorphicBufferToString(body, charset); | ||
|
|
||
| return contentType === CONTENT_TYPE_JSON ? JSON5.parse(dataString) : dataString; | ||
| } | ||
|
|
||
| function isomorphicBufferToString(buffer: Buffer | ArrayBuffer, encoding: BufferEncoding): string { | ||
|
janbuchar marked this conversation as resolved.
|
||
| if (buffer.constructor.name !== ArrayBuffer.name) { | ||
| return (buffer as Buffer).toString(encoding); | ||
| } | ||
|
|
||
| // In Node, wrap the ArrayBuffer in a Buffer so the resolved charset is honored (the caller already | ||
| // checked it via `Buffer.isEncoding`). Only the browser, which lacks Buffer, is limited to UTF-8. | ||
| if (typeof Buffer !== 'undefined') { | ||
| return Buffer.from(buffer as ArrayBuffer).toString(encoding); | ||
| } | ||
|
|
||
| const decoder = new TextDecoder(encoding); | ||
| return decoder.decode(new Uint8Array(buffer as ArrayBuffer)); | ||
| } | ||
|
|
||
| function isCharsetStringifiable(charset: string): charset is BufferEncoding { | ||
| if (!charset) return true; // hope that it's utf-8 | ||
| return Buffer.isEncoding(charset); | ||
| } | ||
|
|
||
| function isContentTypeStringifiable(contentType: string): boolean { | ||
| if (!contentType) return false; // keep buffer | ||
| return STRINGIFIABLE_CONTENT_TYPE_RXS.some((rx) => rx.test(contentType)); | ||
| } | ||
|
|
||
| function areDataStringifiable(contentType: string, charset: string): boolean { | ||
| return isContentTypeStringifiable(contentType) && isCharsetStringifiable(charset); | ||
| } | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.