Presto is a distributed SQL query engine that can be used to query data stored in CLP (using SQL). This guide describes how to set up and use Presto with CLP.
:::{warning} Currently, only the clp-json flavor of CLP supports queries through Presto. :::
:::{note} This integration with Presto is under development and may change in the future. It is also being maintained in a fork of the Presto project. At some point, these changes will have been merged into the main Presto repository so that you can use official Presto releases with CLP. :::
CLP supports Presto through two deployment methods:
- Kubernetes (Helm): Presto is deployed as part of the CLP Helm chart. This is the simplest option if you are already using the Kubernetes deployment.
- Docker Compose: Presto is deployed separately using Docker Compose alongside a CLP package installation.
When deploying CLP on Kubernetes using Helm, Presto can be enabled by setting clpConfig.presto to
a non-null configuration and webui.query_engine to "presto". The query_engine setting controls
which search interface the Web UI displays. Presto runs alongside the existing compression pipeline;
the clp-s native query components can optionally be disabled to save resources.
- A running CLP Kubernetes deployment (see the Kubernetes deployment guide)
-
Create a values file to enable Presto:
:caption: presto-values.yaml clpConfig: webui: query_engine: "presto" # Optional: Disable the clp-s native query pipeline to save resources. # NOTE: The API server depends on the clp-s native query pipeline. api_server: null query_scheduler: null query_worker: null reducer: null # Disable results cache retention since the Presto integration doesn't yet support # garbage collection of search results. results_cache: retention_period: null presto: port: 30889 coordinator: logging_level: "INFO" query_max_memory_gb: 1 query_max_memory_per_node_gb: 1 worker: query_memory_gb: 4 system_memory_gb: 8 # Split filter config for the Presto CLP connector. For each dataset, add a filter entry. # Replace <dataset> with the dataset name (use "default" if you didn't specify one when # compressing) and <timestamp-key> with the timestamp key used during compression. # See https://docs.yscope.com/presto/connector/clp.html#split-filter-config-file split_filter: clp.default.<dataset>: - columnName: "<timestamp-key>" customOptions: rangeMapping: lowerBound: "begin_timestamp" upperBound: "end_timestamp" required: false -
Install (or upgrade) the Helm chart with the Presto values:
helm install clp clp/clp DOCS_VAR_HELM_VERSION_FLAG -f presto-values.yaml
-
Verify that the Presto coordinator and worker pods are running:
kubectl get pods -l "app.kubernetes.io/component in (presto-coordinator, presto-worker)"
Once the pods are ready, you can query your logs through Presto using CLP's Web UI.
:::{note}
When using Kubernetes, Presto worker scheduling can be configured using the prestoWorker.scheduling
key in Helm values. See the worker scheduling section of the Kubernetes deployment
guide for details.
:::
- CLP (clp-json) v0.5.0 or higher
- Docker v28 or higher
- Docker Compose v2.20.2 or higher
- Python
- python3-venv (for the version of Python installed)
Using Presto with CLP via Docker Compose requires:
- Setting up CLP and compressing some logs.
- Setting up Presto to query CLP's metadata database and archives.
-
Follow the quick-start guide to download and extract the CLP package, but don't start the package just yet.
-
Before starting the package, update the package's config file (
etc/clp-config.yaml) as follows:-
Set the
package.query_enginekey to"presto".package: storage_engine: "clp-s" query_engine: "presto"
-
Set the
results_cache.retention_periodkey tonullsince the CLP + Presto integration doesn't yet support garbage collection.results_cache: # host: "localhost" # port: 27017 # db_name: "clp-query-results" # stream_collection_name: "stream-files" # # # Retention period for search results, in minutes. Set to null to disable automatic deletion. retention_period: null
-
Update the
prestokey with the host and port of the Presto cluster. If you follow the Setting up Presto section, the host islocalhostand the port is8889.presto: host: "<ip-address>" port: <port>
:::{note} Presto doesn't need to be running before you start CLP. :::
-
-
If you'd like to store your compressed logs on S3, follow the using object storage guide.
:::{note} Currently, the Presto integration only supports the credentials authentication type. :::
-
Continue following the quick-start guide to start CLP and compress your logs. A sample dataset that works well with Presto is postgresql.
-
Clone the CLP repository:
git clone --branch DOCS_VAR_CLP_GIT_REF https://github.com/y-scope/clp.git
-
Navigate to the
tools/deployment/presto-clpdirectory in your terminal. -
Generate the necessary config for Presto to work with CLP:
scripts/set-up-config.sh <clp-json-dir>
- Replace
<clp-json-dir>with the location of the clp-json package you set up in the previous section.
- Replace
-
Configure Presto to use CLP's metadata database as follows:
-
Open and edit
coordinator/config-template/split-filter.json. -
For each dataset you want to query, add a filter config of the form:
{ "clp.default.<dataset>": [ { "columnName": "<timestamp-key>", "customOptions": { "rangeMapping": { "lowerBound": "begin_timestamp", "upperBound": "end_timestamp" } }, "required": false } ] }- Replace
<dataset>with the name of the dataset you want to query. (If you didn't specify a dataset when compressing your logs, they would be compressed into thedefaultdataset.) - Replace
<timestamp-key>with the timestamp key you specified when compressing logs for this particular dataset.
- Replace
-
The complete syntax for this file is here.
-
-
Start a Presto cluster by running:
docker compose up --wait
-
To use more than one Presto worker, you can use the
--scaleoption as follows:docker compose up --wait --scale presto-worker=<num-workers>
- Replace
<num-workers>with the number of Presto worker nodes you want to run.
- Replace
-
To stop the Presto cluster:
docker compose stopTo clean up the Presto cluster entirely:
docker compose downYou can query your compressed logs in your browser from CLP's UI, or from the command line using the Presto CLI.
Each dataset in CLP shows up as a table in Presto. To show all available datasets:
SHOW TABLES;:::{note}
If you didn't specify a dataset when compressing your logs in CLP, your logs will have been stored
in the default dataset.
:::
To show all available columns in the default dataset:
DESCRIBE default;If you wish to show the columns of a different dataset, replace default above.
To query the logs in this dataset:
SELECT * FROM default LIMIT 1;All kv-pairs in each log event can be queried directly using dot-notation. For example, if your logs
contain the field foo.bar, you can query it using:
SELECT foo.bar FROM default LIMIT 1;CLP's UI should be available at http://localhost:4000 (if you changed
webui.host or webui.port in etc/clp-config.yaml, use the new values).
:::{note}
The UI can only run one query at a time, and queries must not end with a ;.
:::
To access the Presto CLI, navigate to the tools/deployment/presto-clp directory and run:
docker compose exec presto-coordinator \
presto-cli \
--catalog clp \
--schema defaultThe Presto CLP integration has the following limitations at present:
- Nested fields containing special characters cannot be queried (see y-scope/presto#8). Allowed characters are alphanumeric characters and underscores. To get around this limitation, you'll need to preprocess your logs to remove any special characters.
These limitations will be addressed in a future release of the Presto integration.