Skip to content

Conversation

Copy link
Contributor

Copilot AI commented Dec 22, 2025

User attempted to implement CORR() as a custom aggregate but discovered AlaSQL's aggregate architecture only supports single-column inputs. Instead of hardcoding CORR, this PR extends the REDUCE mechanism to allow users to create custom aggregate functions with multiple columns.

Changes

Extended REDUCE aggregate mechanism for multi-column support:

  • Modified parser grammar (src/alasqlparser.jison and src/alasqlparser.js) to store all arguments in the args array when creating REDUCE aggregates
  • Changed from exprlist.pop() to exprlist[0], args: exprlist to preserve all column expressions

Updated GROUP BY compiler (src/423groupby.js):

  • Stage 1 & 2: Detect multi-column aggregates and pass all column values to the user-defined function
  • Maintains backward compatibility with single-column aggregates

Updated finalization logic (src/38query.js):

  • Stage 3: Pass appropriate number of undefined values followed by accumulator and stage for consistent function signatures

Added comprehensive tests (test/test2600.js):

  • 2-column CORR function (Pearson correlation)
  • 3-column WEIGHTED_AVG function
  • Backward compatibility test with single-column aggregate
  • NULL/NaN value handling

Usage

Users can now define custom aggregates with multiple columns:

alasql.aggr.CORR = function(valueX, valueY, accumulator, stage) {
    if (stage === 1) {
        // Initialize accumulator with first row
        return {sumX: valueX, sumY: valueY, sumXY: valueX * valueY, 
                sumX2: valueX * valueX, sumY2: valueY * valueY, count: 1};
    } else if (stage === 2) {
        // Accumulate subsequent rows
        accumulator.sumX += valueX;
        accumulator.sumY += valueY;
        accumulator.sumXY += valueX * valueY;
        accumulator.sumX2 += valueX * valueX;
        accumulator.sumY2 += valueY * valueY;
        accumulator.count++;
        return accumulator;
    } else if (stage === 3) {
        // Finalize and return result
        const numerator = accumulator.count * accumulator.sumXY - accumulator.sumX * accumulator.sumY;
        const denominator = Math.sqrt(
            (accumulator.count * accumulator.sumX2 - accumulator.sumX ** 2) *
            (accumulator.count * accumulator.sumY2 - accumulator.sumY ** 2)
        );
        return numerator / denominator;
    }
};

// Use in SQL
alasql('SELECT CORR(x, y) as correlation FROM dataset');

Backward Compatibility

Single-column user-defined aggregates continue to work exactly as before. The system automatically detects the number of arguments and handles them appropriately.

Original prompt

This section details on the original issue you should resolve

<issue_title>Trying to implement CORR() using custom function logic, but it is not working</issue_title>
<issue_description>Hello. I am trying to implement CORR() using custom function logic provided here:
https://github.com/alasql/alasql/wiki/User-Defined-Functions

This is my code.

alasql.aggr.CORR = function(valueX, valueY, accumulator, stage) {
    if (stage === 1) {
        // Initialize the accumulator object
        accumulator = {
            sumX: 0,
            sumY: 0,
            sumXY: 0,
            sumX2: 0,
            sumY2: 0,
            count: 0
        };
        console.log('Stage 1: Initialized accumulator');
    }

    if (stage === 1 || stage === 2) {
        // Check if valueX and valueY are valid numbers
        if (typeof valueX === 'number' && typeof valueY === 'number') {
            accumulator.sumX += valueX;
            accumulator.sumY += valueY;
            accumulator.sumXY += valueX * valueY;
            accumulator.sumX2 += valueX * valueX;
            accumulator.sumY2 += valueY * valueY;
            accumulator.count++;
            console.log('Stage 1/2: Updated accumulator with values:', valueX, valueY);
        } else {
            console.log('Stage 1/2: Skipped non-numeric values:', valueX, valueY);
        }
    }

    if (stage === 3) {
        console.log('Stage 3: Final accumulator state:', accumulator);

        // Calculate the Pearson correlation coefficient
        const count = accumulator.count;
        const sumX = accumulator.sumX;
        const sumY = accumulator.sumY;
        const sumXY = accumulator.sumXY;
        const sumX2 = accumulator.sumX2;
        const sumY2 = accumulator.sumY2;

        const numerator = (count * sumXY) - (sumX * sumY);
        const denominatorX = Math.sqrt((count * sumX2) - (sumX * sumX));
        const denominatorY = Math.sqrt((count * sumY2) - (sumY * sumY));
        const denominator = denominatorX * denominatorY;

        console.log('Stage 3: Calculated values:');

        // Check if the denominator is zero or if there are no valid data points
        if (denominator === 0 || count === 0) {
            console.log('Stage 3: Undefined correlation');
            return null; // Return null to indicate undefined correlation
        }

        const result = numerator / denominator;
        console.log('Stage 3: Correlation coefficient:', result);
        return result;
    }

    return accumulator;
};

But, somehow this returned {0 : {corr: 3}} , which signifies the calculation is not being done properly, considering pearson correlation always lies between -1 and 1...
Any guess or help to figure out what to do?
Thank you in advance.

</issue_description>

<agent_instructions>Please look into this issue and determine if there is a bug for us to solve, or if he is using SQL the wrong way. </agent_instructions>

Comments on the Issue (you are @copilot in this section)

@mathiasrw If you provide some example data and expected output it makes it easier to help out.

From the AlaSQL Bot

Your implementation of the CORR() function logic looks well-structured, but there seems to be an issue with the Pearson correlation coefficient calculation, possibly due to either the data or the implementation of the accumulation logic.

Here are a few potential issues to check:

  1. Data Points: Make sure that the data passed to CORR() contains valid numeric values and there are no undefined or non-numeric values. If valueX or valueY is non-numeric or undefined, it should not contribute to the calculation.

  2. Numerator and Denominator Calculation:

    • Numerator: Ensure that the numerator calculation (count * sumXY) - (sumX * sumY) is correct and makes sense given the data.
    • Denominator: Ensure that the denominator (denominatorX * denominatorY) does not lead to a division by zero. If denominatorX or denominatorY is zero, the correlation should be undefined, hence returning null (which you already handle).
  3. Edge Cases: Ensure your function handles edge cases, such as when there is only one data point or when all valueX or valueY are the same, which would result in a zero denominator.

Here's an improved implementation suggestion:

  • Verify the calculation at each step by logging intermediate results.
  • Ensure stage 1/2 accumulates data correctly.
  • Return null when invalid data (e.g., single data point or zero variance) is encountered.
alasql.aggr.CORR = function(valueX, valueY, accumulator, stage) {
    if (stage === 1) {
        // Initialize the accumulator object
        accumulator = {
            sumX: 0,
            sumY: 0,
            sumXY: 0,
            sumX2: 0...

</details>



<!-- START COPILOT CODING AGENT SUFFIX -->

- Fixes AlaSQL/alasql#1966

<!-- START COPILOT CODING AGENT TIPS -->
---

💬 We'd love your input! Share your thoughts on Copilot coding agent in our [2 minute survey](https://gh.io/copilot-coding-agent-survey).

Copilot AI changed the title [WIP] Implement custom CORR() function logic Add CORR() aggregate function for two-column correlation Dec 22, 2025
Copilot AI requested a review from mathiasrw December 22, 2025 02:36
@mathiasrw
Copy link
Member

@copilot There is a slight misunderstanding.

We dont want to hardcode the CORR function. We want to make it possible for users to make aggrecated functions with more than one column. I imagine we would have to look into managing multiple inputs in custom functions and then make sure they are passed to the function in a generic way without breaking changes.

Copy link
Contributor Author

Copilot AI commented Dec 22, 2025

@copilot There is a slight misunderstanding.

We dont want to hardcode the CORR function. We want to make it possible for users to make aggrecated functions with more than one column. I imagine we ...

You're absolutely right! I've updated the implementation to enable generic multi-column user-defined aggregates instead of hardcoding CORR.

The changes allow users to create custom aggregate functions with multiple columns by defining them like:

alasql.aggr.MY_FUNC = function(col1, col2, ..., accumulator, stage) {
    // Function logic with stage 1 (init), stage 2 (accumulate), stage 3 (finalize)
}

Then use them in SQL: SELECT MY_FUNC(column1, column2) FROM table

The implementation is backward compatible - single-column aggregates work exactly as before. All changes are in the parser and groupby compiler to pass multiple arguments through the 3-stage aggregation process.

See commit 2e20581 for the full implementation with test cases for 2-column and 3-column user-defined aggregates.

Copilot AI changed the title Add CORR() aggregate function for two-column correlation Enable multi-column user-defined aggregate functions Dec 22, 2025
Comment on lines +100 to +101
assert.deepEqual(res.length, 1);
assert(Math.abs(res[0].corr - 1) < 0.0001, 'Expected correlation close to 1');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want you to correlate a deep equal to the exact output you are expecting. This applies to all tests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CAn we please also test with data coming as parameters (corr(?,?))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any tests of 3 or 4 parameters?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants