Skip to content

[draft] feat: add embeddings field for hybrid search#78

Open
yutin1987 wants to merge 1 commit into
masterfrom
feature/embedding-schema
Open

[draft] feat: add embeddings field for hybrid search#78
yutin1987 wants to merge 1 commit into
masterfrom
feature/embedding-schema

Conversation

@yutin1987

Copy link
Copy Markdown

No description provided.

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This repository is configured for manual code reviews. Comment @claude review to trigger a review and subscribe this PR to future pushes, or @claude review once for a one-time review.

Tip: disable this comment in your organization's Code Review settings.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for dense-vector embeddings across the airesponses, articles, and replies schemas to facilitate hybrid search. The changes include updates to Zod schemas and Elasticsearch mappings, along with version increments. Feedback suggests extracting the duplicated embeddings schema into a shared utility and strengthening Zod validation—specifically enforcing a vector length of 768 and ensuring offsets are non-negative integers—to maintain consistency with the Elasticsearch mapping requirements.

Comment thread schema/airesponses.ts
Comment on lines +43 to +53
embeddings: z
.array(
z
.object({
vector: z.array(z.number()),
startOffsetSec: z.number().optional(),
endOffsetSec: z.number().optional(),
})
.strict()
)
.optional(),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The embeddings schema is duplicated across airesponses.ts, articles.ts, and replies.ts. To improve maintainability and ensure consistency, consider extracting this definition into a shared schema in util/sharedSchema.ts. Additionally, the validation should be strengthened to match the Elasticsearch mapping requirements: 1. Use .length(768) on the vector array to prevent indexing errors (as the dense_vector mapping specifies dims: 768). 2. Use .int().nonnegative() for startOffsetSec and endOffsetSec to align with the integer type in Elasticsearch.

Suggested change
embeddings: z
.array(
z
.object({
vector: z.array(z.number()),
startOffsetSec: z.number().optional(),
endOffsetSec: z.number().optional(),
})
.strict()
)
.optional(),
embeddings: z
.array(
z
.object({
vector: z.array(z.number()).length(768),
startOffsetSec: z.number().int().nonnegative().optional(),
endOffsetSec: z.number().int().nonnegative().optional(),
})
.strict()
)
.optional(),

Comment thread schema/articles.ts
Comment on lines +149 to +159
embeddings: z
.array(
z
.object({
vector: z.array(z.number()),
startOffsetSec: z.number().optional(),
endOffsetSec: z.number().optional(),
})
.strict()
)
.optional(),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This embeddings schema is identical to the one in airesponses.ts and replies.ts. It is recommended to extract it to a shared utility. Also, adding length validation for the vector (768) and integer validation for the offsets will ensure the data conforms to the Elasticsearch mapping.

Suggested change
embeddings: z
.array(
z
.object({
vector: z.array(z.number()),
startOffsetSec: z.number().optional(),
endOffsetSec: z.number().optional(),
})
.strict()
)
.optional(),
embeddings: z
.array(
z
.object({
vector: z.array(z.number()).length(768),
startOffsetSec: z.number().int().nonnegative().optional(),
endOffsetSec: z.number().int().nonnegative().optional(),
})
.strict()
)
.optional(),

Comment thread schema/replies.ts
Comment on lines +40 to +50
embeddings: z
.array(
z
.object({
vector: z.array(z.number()),
startOffsetSec: z.number().optional(),
endOffsetSec: z.number().optional(),
})
.strict()
)
.optional(),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This embeddings schema is identical to the one in airesponses.ts and articles.ts. It is recommended to extract it to a shared utility. Also, adding length validation for the vector (768) and integer validation for the offsets will ensure the data conforms to the Elasticsearch mapping.

Suggested change
embeddings: z
.array(
z
.object({
vector: z.array(z.number()),
startOffsetSec: z.number().optional(),
endOffsetSec: z.number().optional(),
})
.strict()
)
.optional(),
embeddings: z
.array(
z
.object({
vector: z.array(z.number()).length(768),
startOffsetSec: z.number().int().nonnegative().optional(),
endOffsetSec: z.number().int().nonnegative().optional(),
})
.strict()
)
.optional(),

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant