Skip to content

Conversation

@bfoley13
Copy link
Collaborator

Reason for Change:
This is the first PR in line with the AutoIndexer Proposal that create the AutoIndexer api spec. The proposal lays out the definitions and reasons for them. This PR includes the spec definition, additions to the Makefile and output of the run of make manifests, as well as the

Requirements

  • added unit tests and e2e tests (if applicable).

Issue Fixed:

Notes for Reviewers:

: AutoIndexer api spec
@kaito-pr-agent
Copy link

Title

feat: Add AutoIndexer API and controller implementation


Description

  • Introduced AutoIndexer API for automated document indexing

  • Added validation logic for AutoIndexer specifications

  • Implemented unit tests for validation scenarios

  • Created Helm chart for AutoIndexer deployment


Changes walkthrough 📝

Relevant files
Api
1 files
autoindexer_types.go
Define AutoIndexer CRD structure                                                 
+230/-0 
Validation
1 files
autoindexer_validation.go
Implement validation for AutoIndexer                                         
+182/-0 
Tests
1 files
autoindexer_validation_test.go
Add unit tests for validation logic                                           
+93/-0   
Configuration
1 files
condition_types.go
Add AutoIndexer condition types                                                   
+8/-6     
Defaulting
1 files
autoindexer_default.go
Add defaulting logic for AutoIndexer                                         
+22/-0   
Helm
4 files
_helpers.tpl
Add Helm template helpers                                                               
+52/-0   
deployment.yaml
Create Deployment template                                                             
+82/-0   
values.yaml
Set Helm chart values                                                                       
+35/-0   
Chart.yaml
Define Helm chart metadata                                                             
+23/-0   
Crd
2 files
kaito.sh_autoindexers.yaml
Define AutoIndexer CRD                                                                     
+321/-0 
kaito.sh_autoindexers.yaml
Add generated CRD manifest                                                             
+321/-0 
Rbac
2 files
clusterrole.yaml
Define ClusterRole permissions                                                     
+53/-0   
role.yaml
Define Role permissions                                                                   
+21/-0   
Logging
1 files
knative-logging-configmap.yaml
Add logging configuration                                                               
+35/-0   
Security
1 files
secret-webhook-cert.yaml
Add certificate secret template                                                   
+11/-0   
Additional files
6 files
Makefile +1/-0     
clusterrole_binding.yaml +14/-0   
role_binding.yaml +15/-0   
service.yaml +20/-0   
serviceaccount.yaml +7/-0     
webhooks.yaml +26/-0   

Need help?
  • Type /help how to ... in the comments thread for any questions about PR-Agent usage.
  • Check out the documentation for more information.
  • @kaito-pr-agent
    Copy link

    PR Reviewer Guide 🔍

    Here are some key observations to aid the review process:

    ⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪
    🧪 PR contains tests
    🔒 No security concerns identified
    ⚡ Recommended focus areas for review

    Schedule Validation

    The schedule validation logic only checks for basic cron format indicators but doesn't validate against the full regex pattern used in the CRD spec. This could allow invalid schedules to pass validation.

    if a.Spec.Schedule != nil && *a.Spec.Schedule != "" {
    	// Basic check for cron or @every format
    	s := *a.Spec.Schedule
    	if !(s[0] == '@' || len(s) >= 5) {
    		errs = errs.Also(apis.ErrInvalidValue(s, "schedule"))
    	}
    }
    Git Path Validation

    The git path validation regex patterns might be too restrictive for real-world use cases. Consider testing with more diverse path patterns to ensure flexibility.

    	patSimple           = regexp.MustCompile(`^([a-zA-Z0-9._-]+)(/[a-zA-Z0-9._-]+)*$`)
    	patStar             = regexp.MustCompile(`^(\*|\*\*/[a-zA-Z0-9._-]+|[a-zA-Z0-9._-]+/\*|[a-zA-Z0-9._-]+/\*\.[a-zA-Z0-9]+|\*\.[a-zA-Z0-9]+)$`)
    	patGlobstar         = regexp.MustCompile(`^(\*\*/[a-zA-Z0-9._-]+(/[a-zA-Z0-9._-]+)*)$`)
    	patTrailingGlobstar = regexp.MustCompile(`^([a-zA-Z0-9._-]+(/[a-zA-Z0-9._-]+)*)/\*\*$`)
    	patMiddleGlobstar   = regexp.MustCompile(`^([a-zA-Z0-9._-]+/)?\*\*/[a-zA-Z0-9._-]+(/[a-zA-Z0-9._-]+)*$`)
    )
    isValidGitignorePattern := func(p string) bool {
    	if p == "" {
    		return false
    	}
    	if patSimple.MatchString(p) {
    		return true
    	}
    	if patStar.MatchString(p) {
    		return true
    	}
    	if patGlobstar.MatchString(p) {
    		return true
    	}
    	if patTrailingGlobstar.MatchString(p) {
    		return true
    	}
    	if patMiddleGlobstar.MatchString(p) {
    		return true
    	}
    	return false
    }
    for i, p := range g.Paths {
    	if !isValidGitignorePattern(p) {
    		return apis.ErrInvalidValue(p, "paths").ViaFieldIndex("paths", i)
    	}
    }
    for i, p := range g.ExcludePaths {
    	if !isValidGitignorePattern(p) {
    		return apis.ErrInvalidValue(p, "excludePaths").ViaFieldIndex("excludePaths", i)
    	}
    }
    Test Coverage Gap

    Add test cases for valid schedule patterns and credentials with proper secretRef to improve validation coverage.

    {"schedule invalid", func(a *AutoIndexer) {
    	s := "bad"
    	a.Spec.Schedule = &s
    }, true},

    @kaito-pr-agent
    Copy link

    PR Code Suggestions ✨

    Explore these optional code suggestions:

    CategorySuggestion                                                                                                                                    Impact
    Possible issue
    Fix invalid multiline regex pattern

    The regex pattern for the schedule field is split across two lines which will cause
    YAML parsing errors. The pattern string must be on a single line without breaks to
    be valid YAML syntax.

    config/crd/bases/kaito.sh_autoindexers.yaml [170-172]

    -pattern: ^(@(annually|yearly|monthly|weekly|daily|hourly|reboot))|(@every (\d+(ns|us|µs|ms|s|m|h))+)|((((\d+,)+\d+|(\d+(\/|-)\d+)|\d+|\*)
    -              ?){5,7})$
    +pattern: ^(@(annually|yearly|monthly|weekly|daily|hourly|reboot))|(@every (\d+(ns|us|µs|ms|s|m|h))+)|((((\d+,)+\d+|(\d+(\/|-)\d+)|\d+|\*) ?){5,7})$
    Suggestion importance[1-10]: 10

    __

    Why: The multiline regex pattern would cause YAML parsing errors and prevent CRD validation. This is a critical fix for functionality.

    High
    Strengthen schedule regex validation

    The current schedule validation is too lenient and doesn't match the CRD's regex
    pattern.
    Use the same regex pattern defined in the CRD to ensure consistent
    validation between
    the API server and application logic. This prevents invalid
    schedules from being accepted.

    api/v1alpha1/autoindexer_validation.go [77-83]

     if a.Spec.Schedule != nil && *a.Spec.Schedule != "" {
    -    // Basic check for cron or @every format
         s := *a.Spec.Schedule
    -    if !(s[0] == '@' || len(s) >= 5) {
    +    pattern := regexp.MustCompile(`^(@(annually|yearly|monthly|weekly|daily|hourly|reboot))|(@every (\d+(ns|us|µs|ms|s|m|h))+)|((((\d+,)+\d+|(\d+(\/|-)\d+)|\d+|\*) ?){5,7})$`)
    +    if !pattern.MatchString(s) {
             errs = errs.Also(apis.ErrInvalidValue(s, "schedule"))
         }
     }
    Suggestion importance[1-10]: 8

    __

    Why: Ensures consistent validation with the CRD regex pattern, preventing invalid schedules from being accepted by the webhook.

    Medium
    Ensure consistent enum quoting

    The enum value 'Unknown' is unquoted while others are quoted, causing YAML
    inconsistency. All enum values should be consistently quoted to prevent parsing
    errors.

    config/crd/bases/kaito.sh_autoindexers.yaml [233-236]

     enum:
     - "True"
     - "False"
    -- Unknown
    +- "Unknown"
    Suggestion importance[1-10]: 7

    __

    Why: Inconsistent quoting could cause YAML parsing issues. While not critical, it improves schema reliability.

    Medium
    Security
    Restrict URL schemes to http/https

    Add validation to restrict URL schemes to http/https only. Allowing other schemes
    like ftp or file
    could introduce security vulnerabilities or unexpected behavior in
    the indexing process.

    api/v1alpha1/autoindexer_validation.go [169-175]

     for i, url := range s.URLs {
    -    // Basic URL validation
         u, err := apis.ParseURL(url)
         if err != nil || u.Scheme == "" || u.Host == "" {
             return apis.ErrInvalidValue(url, "urls").ViaFieldIndex("urls", i)
         }
    +    if u.Scheme != "http" && u.Scheme != "https" {
    +        return apis.ErrInvalidValue(url, "urls").ViaFieldIndex("urls", i).WithDetails("only http and https schemes are allowed")
    +    }
     }
    Suggestion importance[1-10]: 7

    __

    Why: Improves security by disallowing potentially unsafe URL schemes, but note that the system might only support http/https.

    Medium
    General
    Clarify Git data source name

    Rename 'DataSourceTypeGitHub' to 'DataSourceTypeGit' since it represents generic Git
    repositories,
    not just GitHub. The current name is misleading and could cause
    confusion for users
    implementing non-GitHub repositories.

    api/v1alpha1/autoindexer_types.go [93-96]

     const (
    -    DataSourceTypeGitHub DataSourceType = "Git"
    +    DataSourceTypeGit DataSourceType = "Git"
         DataSourceTypeStatic DataSourceType = "Static"
     )
    Suggestion importance[1-10]: 5

    __

    Why: Renames the constant for clarity, but it's a refactoring that doesn't affect functionality.

    Low
    Update test for Git constant

    Update the test to use the renamed constant 'DataSourceTypeGit' for consistency.

    This ensures tests align with the actual implementation and prevents false test
    failures
    after the constant renaming.

    api/v1alpha1/autoindexer_validation_test.go [23]

     validSpec := AutoIndexerSpec{
         RAGEngine:  "rag",
         IndexName:  "my-index",
    -    DataSource: DataSourceSpec{Type: DataSourceTypeGitHub, Git: validGit},
    +    DataSource: DataSourceSpec{Type: DataSourceTypeGit, Git: validGit},
     }
    Suggestion importance[1-10]: 3

    __

    Why: Necessary only if the constant is renamed, and it's a minor test update.

    Low

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

    Projects

    Status: No status

    Development

    Successfully merging this pull request may close these issues.

    1 participant