|
| 1 | +--- |
| 2 | +title: 'Adam Kennedy (Voltron Data): Polygraph' |
| 3 | + |
| 4 | +event: 'Eusocial Interest Group Meeting' |
| 5 | +#event_url: https://example.org |
| 6 | + |
| 7 | +location: University of California, Santa Cruz |
| 8 | +address: |
| 9 | + street: 1156 High St |
| 10 | + city: Santa Cruz |
| 11 | + region: CA |
| 12 | + postcode: '95064' |
| 13 | + country: United States |
| 14 | + |
| 15 | +summary: Adam Kennedy (Voltron Data) is speaking about Polygraph, a new effort to make processing and optimizations of query plans more efficient |
| 16 | +abstract: '' |
| 17 | + |
| 18 | +# Talk start and end times. |
| 19 | +# End time can optionally be hidden by prefixing the line with `#`. |
| 20 | +date: '2023-09-07T14:00:00-0700' |
| 21 | +date_end: '2023-09-07T15:00:00-0700' |
| 22 | +all_day: false |
| 23 | + |
| 24 | +# Schedule page publish date (NOT talk date). |
| 25 | +publishDate: '2023-09-06' |
| 26 | + |
| 27 | +authors: [adam.kennedy] |
| 28 | +tags: [] |
| 29 | + |
| 30 | +# Is this a featured talk? (true/false) |
| 31 | +featured: false |
| 32 | + |
| 33 | +image: |
| 34 | + caption: '' |
| 35 | + focal_point: Right |
| 36 | + |
| 37 | +url_code: '' |
| 38 | +url_pdf: '' |
| 39 | +url_slides: '' |
| 40 | +url_video: 'https://www.icloud.com/iclouddrive/0920UPOGUXIosE6viyjHJe6BQ#video1862580471' |
| 41 | + |
| 42 | +# Markdown Slides (optional). |
| 43 | +# Associate this talk with Markdown slides. |
| 44 | +# Simply enter your slide deck's filename without extension. |
| 45 | +# E.g. `slides = "example-slides"` references `content/slides/example-slides.md`. |
| 46 | +# Otherwise, set `slides = ""`. |
| 47 | +slides: |
| 48 | + |
| 49 | +# Projects (optional). |
| 50 | +# Associate this post with one or more of your projects. |
| 51 | +# Simply enter your project's folder or file name without extension. |
| 52 | +# E.g. `projects = ["internal-project"]` references `content/project/deep-learning/index.md`. |
| 53 | +# Otherwise, set `projects = []`. |
| 54 | +projects: |
| 55 | +--- |
| 56 | + |
| 57 | +{{% callout note %}} |
| 58 | +This is an expanded version of Adam Kennedy's presentation at the [2nd International Workshop on Composable Data Management Systems 2023 (CDMS)](https://ceur-ws.org/Vol-3462/CDMS0.pdf) ([agenda](https://ceur-ws.org/Vol-3462)). The following abstract is copied from [there]((https://ceur-ws.org/Vol-3462/CDMS12.pdf)). |
| 59 | +{{% /callout %}} |
| 60 | + |
| 61 | +The maturity and substantial investment in Apache Calcite establish it as the open source standard for query planning and |
| 62 | +optimization across numerous data tools. Nevertheless, utilizing Apache Calcite for dynamic query planning in a diverse tool |
| 63 | +stack with multiple languages has proven challenging. Through the integration of Apache Arrow, we introduce Polygraph: |
| 64 | +a language-independent, parse-free, and efficient format for query plans. Its purpose is to enhance plan interoperability, |
| 65 | +diminish latency and overheads, and facilitate dynamic query optimization. This experimental format allows for the efficient |
| 66 | +exchange of query plans between tools in diverse languages with minimal serialization overhead. |
| 67 | + |
| 68 | +While future query engines are steering away from Java, Calcite remains the solitary mature option for query planning |
| 69 | +across a broad spectrum of workloads. Few alternatives come close to matching its features. However, Calcite relies on |
| 70 | +tree-based JSON or XML plan representations that do not readily lend themselves to certain optimizations and necessitate |
| 71 | +substantial overhead for serialization, I/O, and parsing. The commingling of planners and engines across languages is rare, |
| 72 | +unusual, and complex. Such approaches typically result in ad hoc, internal formats with limited reusability. Addressing |
| 73 | +these challenges, Polygraph relocates the query plan to Arrow. Polygraph employs a graph structure encoded with columnar |
| 74 | +storage techniques. Preliminary experiments indicate an order of magnitude reduction in query plan size compared to JSON |
| 75 | +encoding, without incurring copying and serialization overheads. Arrow provides zero-copy, shared-memory, and parse-free |
| 76 | +capabilities, along with fast RPC via Arrow Flight. In this representation, plan consumers only need to load the components |
| 77 | +and properties of a query plan required for a given computation. These efficiencies substantially reduce the latency between |
| 78 | +plan generation and query execution. Moreover, we envision significant potential for other advancements, including resource |
| 79 | +planning, ML preprocessing, and integration into ML training and inference. |
| 80 | + |
| 81 | +Until recently, there was no urgent imperative to represent query plans efficiently. However, the escalating complexity |
| 82 | +and size of query graphs will persist as data tools become more deeply integrated into intricate ML workloads. Polygraph’s |
| 83 | +agile and decomposable graph representation empowers data engines to contribute to query optimization and resource |
| 84 | +management. Enhanced integration with top-tier ML systems becomes more viable, facilitating the incorporation of run-time |
| 85 | +compute planning and resource management into the query plan, utilizing tools like Apache Acero. The benefits extend |
| 86 | +beyond improvements in space efficiency and latency. Query sub-plans can be optimized in-situ using real-time hardware |
| 87 | +metrics. Value relations and broadcast tables can be seamlessly embedded in the plan as Arrow objects, accessed in a zero-copy |
| 88 | +manner. Large models can be directly incorporated into the query plan, incurring no loading cost until required. Increased |
| 89 | +investment in query plan representations, exemplified by Polygraph, supports the data community in keeping pace with |
| 90 | +advancements in new architectures and problem domains, such as AI. |
| 91 | + |
0 commit comments