-
Notifications
You must be signed in to change notification settings - Fork 12
Introduce a new, simplified documentation scraping backend #347
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
What I am calling "backendv2". This is intended to be a simpler, more maintainable replacement for the existing backend, with improved performance and reliability. This approach aims to reduce some of the layers of abstraction but also make much better use of go worktrees for efficient git operations. It also introduces the idea of using a database to track a lot of the state of the system, rather than relying on filesystem state and S3 objects. Signed-off-by: James Humphries <[email protected]>
Signed-off-by: James Humphries <[email protected]>
Deploying registry-ui with
|
| Latest commit: |
e56e071
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://98f1690a.registry-ui-dxi.pages.dev |
| Branch Preview URL: | https://backendv2.registry-ui-dxi.pages.dev |
Signed-off-by: James Humphries <[email protected]>
| latestURL = nightliesBaseURL + "/nightlies/latest.json" | ||
| ) | ||
|
|
||
| // TODO: Investigate if we should bring in tofudl here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that nightly support was merged, we should use it here + dogfood it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know that we're actively maintaining tofudl for the language server as well, but I wasn't sure if we were recommending that it was the right way to download nightlies.
This ties into a conversation I had with @diofeher thismorning. He made a small comment about nightlies and their support here: opentofu/opentofu#3193 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree here.
On top of that, the version I've implemented in tofudl has an additional benefit of SHA verification.
| for _, file := range licenseFiles { | ||
| for _, l := range filesWithLicenses[file] { | ||
| // Exit early (keeping in mind the sort order above) | ||
| if l.Confidence >= d.config.ConfidenceOverrideThreshold { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to store all the results in the DB and filter there instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, maybe I could store all results over a certain threshold, thats not a bad idea.
| if _, err := os.Stat(filepath.Join(localPath, ".git")); err == nil { | ||
| repository, err := git.PlainOpen(localPath) | ||
| if err != nil { | ||
| return nil, fmt.Errorf("failed to open existing repository at %s: %w", localPath, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to try to wipe the dir and start over here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can wipe the dir, I don't think we need to start over and re-clone again at this level. maybe we could add retries to cloning though.
backendv2/pkg/git/repo.go
Outdated
| } | ||
|
|
||
| // Clone clones the repository if it hasn't been cloned already | ||
| func (r *Repo) Clone(ctx context.Context) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: EnsureCloned(ctx)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually renamed this from EnsureCloned to Clone with a different method to check that it was cloned for claification, but im happy to go with EnsureCloned if you want.
Signed-off-by: James Humphries <[email protected]>
backendv2/pkg/module/parser.go
Outdated
| } | ||
|
|
||
| // BuildCompleteModuleStructure creates the complete module structure required by the registry API | ||
| func (p *Parser) BuildCompleteModuleStructure(ctx context.Context, rootModuleData map[string]interface{}, submodulesData map[string]map[string]interface{}, examplesData map[string]map[string]interface{}, licenses []license.License) (map[string]interface{}, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This signature could benefit from some type aliases :)
backendv2/pkg/module/parser.go
Outdated
|
|
||
| // Build the complete structure | ||
| moduleStructure := map[string]interface{}{ | ||
| "id": p.version, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason you are not using a more clearly defined structure here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to test it with a lot of providers first, and against the UI. Until i've tested that I wanted to keep it flexible for now.
I've added a TODO that we should introduce better typing in this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, great work!
Obviously, with this amount of changes, there are a lot of things to comment on, so I tested most of the commands and reviewed the code that is reachable from those.
I will do a full review later, when it's marked as ready for review.
OTEL Consistency
grep -r "tracer.Start(.*)" . shows a lack of consistency in the tracer names.
Things I had to change to make this work on my local machine
diff --git a/backendv2/pkg/config/bucket.go b/backendv2/pkg/config/bucket.go
index be6192f..26f95a8 100644
--- a/backendv2/pkg/config/bucket.go
+++ b/backendv2/pkg/config/bucket.go
@@ -31,9 +31,9 @@ func (c *BucketConfig) Validate() error {
if c.SecretAccessKey == "" {
return fmt.Errorf("bucket.secretAccessKey is required")
}
- if c.Endpoint == "" {
- return fmt.Errorf("bucket.endpoint is required")
- }
+ // if c.Endpoint == "" {
+ // return fmt.Errorf("bucket.endpoint is required")
+ // }
// Set default region if empty
if c.Region == "" {
@@ -50,7 +50,7 @@ func (c *BucketConfig) GetClient(ctx context.Context) (*s3.Client, error) {
cfg, err := awsconfig.LoadDefaultConfig(ctx,
awsconfig.WithCredentialsProvider(credentials.NewStaticCredentialsProvider(c.AccessKeyID, c.SecretAccessKey, "")),
- awsconfig.WithRegion("auto"),
+ awsconfig.WithRegion(c.Region),
)
if err != nil {
slog.ErrorContext(ctx, "Failed to load AWS config", "error", err)
@@ -61,6 +61,9 @@ func (c *BucketConfig) GetClient(ctx context.Context) (*s3.Client, error) {
otelaws.AppendMiddlewares(&cfg.APIOptions)
client := s3.NewFromConfig(cfg, func(o *s3.Options) {
+ if c.Endpoint == "" {
+ return
+ }
o.BaseEndpoint = aws.String(c.Endpoint)
})Additionally, had to:
- comment out ID: 25 from migration because of a query that still uses
tag_created_atthat was dropped in that version. - manually run
insert into licenses(spdx_id, name) values ('Apache-2.0', 'apache');- do you plan to have a seed for these licenses? 🤔 Otherwise, when inserting into the
provider_version_licensesormodule_version_licenses, there will be a fk violation
- do you plan to have a seed for these licenses? 🤔 Otherwise, when inserting into the
| slog.ErrorContext(ctx, "Error loading config", "error", err) | ||
| log.Fatalf("Error loading config: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any particular reason to use both log and slog? Using log for the os.Exit() inside Fatalf? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's it exactly. What do you think about keeping it? or would you prefer i use slog + explicit os.exit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest, yes, if it's only for that.
For the maintenance purpose, I would say that the intent of the log.Fatalf is unclear here and we can end up with somebody removing that line strictly because "this is duplicated with the log above".
A call to os.Exit() is definitive in its intent. Worse case, create a common function somewhere and log with slog and call os.Exit. 🤷
| err := k.Load(env.Provider(".", env.Opt{ | ||
| Prefix: EnvVarPrefix, | ||
| TransformFunc: func(k, v string) (string, any) { | ||
| k = strings.ReplaceAll(strings.ToLower(strings.TrimPrefix(k, EnvVarPrefix)), "_", ".") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this meant to format the keys as the ones in config.example.yaml?
If so, isn't that ToLower will mess everything up?
backendv2/pkg/tofu/dl.go
Outdated
| // TODO: Investigate if we should bring in tofudl here? | ||
|
|
||
| func DownloadLatestNightly(ctx context.Context, destination string) error { | ||
| tracer := otel.Tracer("opentofu-registry-backend") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this particular tracer name opentofu-registry-backend is scattered all over the place, wouldn't you think that would benefit of having a common function somewhere to build this?
I am also thinking of opentofu/opentofu#3448
| &cli.StringFlag{ | ||
| Name: "namespace", | ||
| Usage: "Module namespace (e.g., hashicorp)", | ||
| Required: true, | ||
| }, | ||
| &cli.StringFlag{ | ||
| Name: "name", | ||
| Usage: "Module name (e.g., vpc)", | ||
| Required: true, | ||
| }, | ||
| &cli.StringFlag{ | ||
| Name: "target", | ||
| Usage: "Module target (e.g., aws)", | ||
| Required: true, | ||
| }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably it's me, but man these args are confusing for modules 😢
Namespace, makes sense, but name and target is not clear at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, I've never understood why the need to put the target (e.g. alicloud) in the end, like: https://search.opentofu.org/module/terraform-alicloud-modules/vpn-gateway/alicloud/latest
backendv2/pkg/module/storage/db.go
Outdated
| processed_at = EXCLUDED.processed_at, | ||
| tag_created_at = EXCLUDED.tag_created_at` | ||
|
|
||
| _, err = tx.Exec(ctx, query, namespace, name, target, version, jsonData, tagCreatedAt) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
failed to store module version: ERROR: column "tag_created_at" of relation "module_versions" does not exist (SQLSTATE 42703)
And I ran migrate up
|
|
||
| -- Remove unused column from module_versions table | ||
| ALTER TABLE module_versions | ||
| DROP COLUMN IF EXISTS tag_created_at;`, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is still used in StoreModuleVersion
| _, err = tx.Exec(ctx, insertQuery, | ||
| namespace, name, target, version, | ||
| lic.SPDX, float64(lic.Confidence), lic.File, matchType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Message = {string} "insert or update on table "module_version_licenses" violates foreign key constraint "module_version_licenses_license_spdx_id_fkey""
Detail = {string} "Key (license_spdx_id)=(Apache-2.0) is not present in table "licenses"."
Running sync-module --namespace=coralogix --name=aws --target=coralogix --version=v3.10.8
|
|
||
| // Store documents and complete the scraping process only if license was accepted | ||
| if licenseAccepted { | ||
| err = docScraper.ScrapeAndStore(ctx, namespace, name, version, workDir, licenses, tx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the line 128, we scrape the docs and we scrape those again here? 🤔 Wouldn't be more performant to just store what was already scrapped?
Signed-off-by: James Humphries <[email protected]>
Signed-off-by: James Humphries <[email protected]>
Signed-off-by: James Humphries <[email protected]>
Signed-off-by: James Humphries <[email protected]>
Signed-off-by: James Humphries <[email protected]>
Signed-off-by: James Humphries <[email protected]>
Signed-off-by: James Humphries <[email protected]>
…d> [args]
The available commands for execution are listed below.
The primary workflow commands are given first, followed by
less common or more advanced commands.
Main commands:
init Prepare your working directory for other commands
validate Check whether the configuration is valid
plan Show changes required by the current configuration
apply Create or update infrastructure
destroy Destroy previously-created infrastructure
All other commands:
console Try OpenTofu expressions at an interactive command prompt
fmt Reformat your configuration in the standard style
force-unlock Release a stuck lock on the current workspace
get Install or upgrade remote OpenTofu modules
graph Generate a Graphviz graph of the steps in an operation
import Associate existing infrastructure with a OpenTofu resource
login Obtain and save credentials for a remote host
logout Remove locally-stored credentials for a remote host
metadata Metadata related commands
output Show output values from your root module
providers Show the providers required for this configuration
refresh Update the state to match remote systems
show Show the current state or a saved plan
state Advanced state management
taint Mark a resource instance as not fully functional
test Execute integration tests for OpenTofu modules
untaint Remove the 'tainted' state from a resource instance
version Show the current OpenTofu version
workspace Workspace management
Global options (use these before the subcommand, if any):
-chdir=DIR Switch to a different working directory before executing the
given subcommand.
-help Show this help output, or the help for a specified subcommand.
-version An alias for the "version" subcommand. package being badly gitignored
Signed-off-by: James Humphries <[email protected]>
Signed-off-by: James Humphries <[email protected]>
Signed-off-by: James Humphries <[email protected]>
Signed-off-by: James Humphries <[email protected]>
Signed-off-by: James Humphries <[email protected]>
Signed-off-by: James Humphries <[email protected]>
Signed-off-by: James Humphries <[email protected]>
Signed-off-by: James Humphries <[email protected]>
…ed if they fail to scrape and tweaked some logging Signed-off-by: James Humphries <[email protected]>
- Cleaned up ScrapeVersion for providers and modules to reduce calls to registry clients - Added checksum tracking for s3 client - Improved pgx tracing - Improved http tracing - General cleanup of comments that were useless or old TODOs that were done - Improved tracking of failed scraping of modules and providers Signed-off-by: James Humphries <[email protected]>
|
|
||
| // Set default region if empty | ||
| if c.Region == "" { | ||
| c.Region = "auto" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if we should include some docs indicating that this is the default region for R2 to make it clear that this is the expected default system, rather than AWS.
| awsconfig.WithCredentialsProvider(credentials.NewStaticCredentialsProvider(c.AccessKeyID, c.SecretAccessKey, "")), | ||
| awsconfig.WithRegion(c.Region), | ||
| awsconfig.WithHTTPClient(httpClient), | ||
| ) | ||
| if err != nil { | ||
| slog.ErrorContext(ctx, "Failed to load AWS config", "error", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But now I'm uncertain about my previous comment since this seems to be heavily tied to AWS - although I think this would work fine for S3-like systems.
| return fmt.Errorf("registryPath is required") | ||
| } | ||
|
|
||
| // make the directory if it doesnt eixst |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // make the directory if it doesnt eixst | |
| // make the directory if it doesn't exist |
| return err | ||
| } | ||
|
|
||
| // no need to Validate telemetry right noww |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this comment necessary? I mean, is this a call to action to implement later, but it's not needed at this point in time, or is it just saying that we don't ever need to validate telemetry config?
| ConfidenceThreshold float32 `koanf:"confidenceThreshold"` | ||
| // ConfidenceOverrideThreshold is the limit at which a detected license overrides all other detected licenses. | ||
| ConfidenceOverrideThreshold float32 `koanf:"confidenceOverrideThreshold"` | ||
| } | ||
|
|
||
| func (c *LicenseConfig) Validate() error { | ||
| // compatible licenses shouldn't be empty | ||
| if len(c.CompatibleLicenses) == 0 { | ||
| return fmt.Errorf("compatible license list was empty or not provided") | ||
| } | ||
|
|
||
| if c.ConfidenceThreshold == 0.0 { | ||
| c.ConfidenceThreshold = 0.85 | ||
| } | ||
| if c.ConfidenceOverrideThreshold == 0 { | ||
| c.ConfidenceOverrideThreshold = 0.98 | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we describe these thresholds more clearly and explain how they're calculated?
| parent_organisation = NULL, | ||
| parent_name = NULL, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think there's another way to rebuild this query without duplicating it only because of the parent's fork information?
|
|
||
| // Expressions" describes the resource-type-specific content of the | ||
| // configuration block. | ||
| Expressions map[string]any `json:"expressions,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Expressions map[string]any `json:"expressions,omitempty"` | |
| Expressions map[string]expression `json:"expressions,omitempty"` |
Shouldn't it be like this?
| func matchesModuleFilter(parts []string, filterParts []string) bool { | ||
| if len(filterParts) == 0 { | ||
| return true | ||
| } | ||
|
|
||
| if len(filterParts) == 1 { | ||
| return matchPattern(filterParts[0], parts[0]) | ||
| } | ||
|
|
||
| if len(filterParts) == 2 { | ||
| if !matchPattern(filterParts[0], parts[0]) { | ||
| return false | ||
| } | ||
|
|
||
| if len(parts) >= 2 && matchPattern(filterParts[1], parts[1]) { | ||
| return true | ||
| } | ||
| if len(parts) >= 3 && matchPattern(filterParts[1], parts[2]) { | ||
| return true | ||
| } | ||
| return false | ||
| } | ||
|
|
||
| if len(filterParts) == 3 { | ||
| if len(parts) < 3 { | ||
| return false | ||
| } | ||
| return matchPattern(filterParts[0], parts[0]) && | ||
| matchPattern(filterParts[1], parts[1]) && | ||
| matchPattern(filterParts[2], parts[2]) | ||
| } | ||
|
|
||
| return false | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function looks very strange to me. I'm not sure what it's doing, and the code paths look too complicated.
| if len(filterParts) > 0 && !matchPattern(filterParts[0], namespace) { | ||
| continue | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the reason for skipping this one? It's not clear to me
| firstLetter := strings.ToLower(string(namespace[0])) | ||
| if firstLetter >= "0" && firstLetter <= "9" { | ||
| firstLetter = string(namespace[0]) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've seen this condition in other places but I'm not sure why it's needed and the motivation of the assignment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even though the backend appears to be working, I found it very hard to review a PR this size. At least separating the PRs by package would make it easier to review.
That being said, I like the direction the backend is going, the new functionality is really needed, and the architecture seems to fit our needs better.
EDIT: Sorry for the request for changes, I just noticed it's a draft 🤦
Signed-off-by: James Humphries <[email protected]>
Signed-off-by: James Humphries <[email protected]>
Signed-off-by: James Humphries <[email protected]>
Signed-off-by: James Humphries <[email protected]>
Signed-off-by: James Humphries <[email protected]>
Signed-off-by: James Humphries <[email protected]>
Signed-off-by: James Humphries <[email protected]>
Signed-off-by: James Humphries <[email protected]>
Signed-off-by: James Humphries <[email protected]>
Signed-off-by: James Humphries <[email protected]>
Signed-off-by: James Humphries <[email protected]>
Signed-off-by: James Humphries <[email protected]>
Signed-off-by: James Humphries <[email protected]>
…ument Signed-off-by: James Humphries <[email protected]>
Signed-off-by: James Humphries <[email protected]>
Signed-off-by: James Humphries <[email protected]>
What I am calling "backendv2". This is intended to be a simpler, more maintainable replacement for the existing backend, with improved performance and reliability.
This approach aims to reduce some of the layers of abstraction but also make much better use of go worktrees for efficient git operations.
It also introduces the idea of using a database to track a lot of the state of the system, rather than relying on filesystem state and S3 objects.
This PR also
tofu show -json module=<PATH>to get module metadataChecklist