Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define TR caching comms in ATD #353

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
64 changes: 63 additions & 1 deletion semgrep_output_v1.atd
Original file line number Diff line number Diff line change
Expand Up @@ -555,6 +555,8 @@ type sca_match = {
(* Note that in addition to "reachable" there are also the notions of
* "vulnerable" and "exploitable".
* coupling: see also SCA_match.ml
* TODO? have a Direct of xxx and Transitive of sca_transitive_match_kind?
* better so can be reused in other types such as tr_cache_result?
*)
type sca_match_kind = [
(* This is used for "parity" or "upgrade-only" rules. transitivity
Expand Down Expand Up @@ -1831,6 +1833,62 @@ type scan_config = {
?ci_config_from_cloud: ci_config_from_cloud option;
}

(* ------------------------------------------- *)
(* Transitive reachabilitiy (TR) caching comms *)
(* ------------------------------------------- *)
(* We want essentially to cache semgrep computation on third party packages
* to quickly know (rule_id x package_version) -> sca_transitive_match_kind
* to avoid downloading and recomputing each time the same thing.
*)

(* The "key".
* The rule_id and resolved_url should form a valid key for our TR cache
* database table. Indeed, semgrep should always return the same result when
* using the same rule and same resolved_url package. The content at the
* URL should hopefully not change (we could md5sum it just in case) and
* the content of the rule_id should also not change (could md5sum it maybe too).
* I've added tr_version below just in case we want to invalidate past
* cached entries (e.g., the semgrep engine itself changed enough that
* some past cached results might be wrong and should be recomputed)
*)
type tr_cache_key = {
rule_id: rule_id;
(* ex: http://some-website/hello-world.0.1.2.tgz like in found_dependency *)
resolved_url: string;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if this will be easy to find in many cases, though I agree that if it's possible it makes the best key. I guess it is probably fine to start with this, and add another key later if needed.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes we can always refine. This is just defining the interface. Once we start the implementation we will discover
we need to refine it.

(* to bump just in case of problem
* TODO: to be set in Transitive_reachability.ml tr_version constant
*)
tr_version: int;
}

(* The "value" *)
type tr_cache_match_result = {
(* TODO: could define separate sca_transitive_match type? Should only be
* one of TransitiveXxx case of sca_match_kind
* TODO? add other fields from sca_match or just sca_match_kind is enough?
* add location maybe? or make it part of the [transitive_reachable]
* and [transitive_unreachable] records?
* TODO? make it a list? match_results: ... list; ?
*)
match_result: sca_match_kind;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit odd, since when scanning a package with a rule it will result in direct code matches, not sca matches. Maybe we should just return the match here? Then, the CLI could treat those matches the same as matches that it receives from a call to Semgrep locally.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could. This is a bigger data structure to store then though, and for TR what we really need is actually just the sca_transitive_match_kind; that's the thing we try to optimize to avoid downloading the dependency and run semgrep on it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should at least store the match locations. I don't think it makes sense to store sca_match_kind because if there are matches in multiple packages, we will somehow need to combine those into a single finding in the cli

}

(* Sent by the CLI to the POST /api/???? *)
type tr_query_cache_request = {
entries: tr_cache_key list;
}

(* Response by the backend the the POST /api/???? *)
type tr_query_cache_response = {
cached: (tr_cache_key * tr_cache_match_result) list;
}

(* Sent by the CLI to the POST /api/??? *)
type tr_add_cache_request = {
new_entries: (tr_cache_key * tr_cache_match_result) list;
}
(* TODO: tr_add_cache_response: string result (Ok | Error) *)

(* ----------------------------- *)
(* TODO a better CI config from cloud *)
(* ----------------------------- *)
Expand Down Expand Up @@ -2307,6 +2365,10 @@ type resolution_result = [
| ResolutionError of resolution_error_kind list
]

(* ----------------------------- *)
(* SCA transitive reachability *)
(* ----------------------------- *)

type transitive_finding = {
(* the important part is the sca_match in core_match_extra that
* we need to adjust and especially the sca_match_kind.
Expand All @@ -2318,7 +2380,7 @@ type transitive_finding = {
}

(* ----------------------------- *)
(* SCA part 4: Symbol analysis *)
(* Symbol analysis *)
(* ----------------------------- *)

(* "Symbol analysis" is about determining the third-party functions which
Expand Down
62 changes: 62 additions & 0 deletions semgrep_output_v1.jsonschema

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

24 changes: 23 additions & 1 deletion semgrep_output_v1.proto

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

146 changes: 146 additions & 0 deletions semgrep_output_v1.py

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading