Skip to content

Set DataFusion runtime configurations through SQL interface #15594

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

kumarlokesh
Copy link
Contributor

@kumarlokesh kumarlokesh commented Apr 5, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

  1. Extended the session context to handle runtime configurations.
  2. Currently support memory limit configuration by using SET datafusion.runtime.memory_limit = '100M'.
  3. Updated documentation.

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added documentation Improvements or additions to documentation core Core DataFusion crate labels Apr 5, 2025
@kumarlokesh kumarlokesh force-pushed the allow-setting-runtime-configs-through-sql branch 3 times, most recently from 3d743e5 to bfb1d0e Compare April 5, 2025 17:14
@github-actions github-actions bot added the common Related to common crate label Apr 5, 2025
@kumarlokesh kumarlokesh force-pushed the allow-setting-runtime-configs-through-sql branch from bfb1d0e to 8d894be Compare April 5, 2025 17:43
@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Apr 5, 2025
@berkaysynnada
Copy link
Contributor

Hello @kumarlokesh. Thank you for working on this. I have 2 questions/concerns. Let's discuss on them a bit to get a future-proof design

  1. There are also runtime_env: Arc<RuntimeEnv> in SessionState and TaskContext structs, and this RuntimeEnv has a field of pub memory_pool: Arc<dyn MemoryPool>. These MemoryPool implementations have some configurations about memory limit as well. Keeping 2 different things for the same purposes always bring trouble. So, I wonder can we somehow connect those 2 points.

  2. Do you plan to practice this configuration in any part of the execution? While bringing such a feature, implementing at least one use-case makes things more clear and becomes a guidance for the further developments.

@alamb alamb added the enhancement New feature or request label Apr 7, 2025
@kumarlokesh kumarlokesh force-pushed the allow-setting-runtime-configs-through-sql branch from 8d894be to fa34e62 Compare April 7, 2025 18:43
@github-actions github-actions bot added development-process Related to development process of DataFusion execution Related to the execution crate and removed sqllogictest SQL Logic Tests (.slt) common Related to common crate labels Apr 7, 2025
@kumarlokesh kumarlokesh force-pushed the allow-setting-runtime-configs-through-sql branch 5 times, most recently from bae6d2f to 08d038c Compare April 7, 2025 19:40
@kumarlokesh
Copy link
Contributor Author

Hello @kumarlokesh. Thank you for working on this. I have 2 questions/concerns. Let's discuss on them a bit to get a future-proof design

  1. There are also runtime_env: Arc<RuntimeEnv> in SessionState and TaskContext structs, and this RuntimeEnv has a field of pub memory_pool: Arc<dyn MemoryPool>. These MemoryPool implementations have some configurations about memory limit as well. Keeping 2 different things for the same purposes always bring trouble. So, I wonder can we somehow connect those 2 points.
  2. Do you plan to practice this configuration in any part of the execution? While bringing such a feature, implementing at least one use-case makes things more clear and becomes a guidance for the further developments.

@berkaysynnada Thank you for the feedback!

  1. Have revised the PR to ensure runtime environment configuration like memory-limit is exposed through the already defined RuntimeEnv -> memory_pool setting.

  2. As a user can set the configuration datafusion.runtime.memory_limit (and other available runtime environment configurations) through SQL interface, I think having tests around this behaviour should be a good starting point. Can't think of another good use case besides this. Thoughts?

let ctx = SessionContext::new();

// Set memory limit to 100MB using SQL - note the quotes around the value
ctx.sql("SET datafusion.runtime.memory_limit = '100M'")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running this test can't ensure this config is set, because it can run without setting the memory limit and use the default UnboundedMemoryPool, instead I think it should assert the spill count for a query that has spilled some intermediate data.

SET datafusion.runtime.memory_limit = '1M'
set datafusion.execution.sort_spill_reservation_bytes = 0;

select * from generate_series(1, 100000) as t1(v1) order by v1;
-- And assert spill-count from the query is > 0

You can check this

let spill_count = metrics.spill_count().unwrap();
for how to assert the spill file count.

Later after all configurations are added, I think we should make this test case stronger by setting more runtime configs, and do some property test to ensure all of them are properly set.

BTW, I tried the above test locally, and the generate_series UDTF seems not registered in the PR branch, but it works in the main branch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running this test can't ensure this config is set, because it can run without setting the memory limit and use the default UnboundedMemoryPool, instead I think it should assert the spill count for a query that has spilled some intermediate data.

SET datafusion.runtime.memory_limit = '1M'
set datafusion.execution.sort_spill_reservation_bytes = 0;

select * from generate_series(1, 100000) as t1(v1) order by v1;
-- And assert spill-count from the query is > 0

You can check this

let spill_count = metrics.spill_count().unwrap();

for how to assert the spill file count.
Later after all configurations are added, I think we should make this test case stronger by setting more runtime configs, and do some property test to ensure all of them are properly set.

BTW, I tried the above test locally, and the generate_series UDTF seems not registered in the PR branch, but it works in the main branch.

@2010YOUY01 @berkaysynnada addressed above in 6501f21.


/// Parse memory limit from string to number of bytes
/// e.g. '1.5G', '100M'
fn parse_memory_limit(&self, limit: &str) -> Result<usize> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can open this function to external uses?

@berkaysynnada
Copy link
Contributor

Thank you @kumarlokesh for addressing my comments. I don't have further suggestions or concern other than @2010YOUY01 shared

@kumarlokesh kumarlokesh force-pushed the allow-setting-runtime-configs-through-sql branch from 08d038c to 6501f21 Compare April 12, 2025 17:09
@kumarlokesh kumarlokesh requested a review from 2010YOUY01 April 12, 2025 17:43
Copy link
Contributor

@2010YOUY01 2010YOUY01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. I plan to leave this PR open for 2 days more before merging, in case anyone has additional comments.


let mut state = self.state.write();
let mut builder =
RuntimeEnvBuilder::from_runtime_env(state.runtime_env());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation is all cloning Arcs inside RuntimeEnv, I think it shouldn't have any issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate development-process Related to development process of DataFusion documentation Improvements or additions to documentation enhancement New feature or request execution Related to the execution crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Set DataFusion runtime configurations through SQL interface
4 participants