Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
99 commits
Select commit Hold shift + click to select a range
c6ebd06
add mising util dir to packages list in pyproject.toml
Mateusz-Switala Apr 16, 2026
b6e16e3
fix(components, automl): Use svg image as HTML <img> in AutoML notebooks
DorotaDR Apr 13, 2026
bcc83b3
Merge pull request #39 from Mateusz-Switala/fix-add-utils-to-packages…
openshift-merge-bot[bot] Apr 16, 2026
50c4c80
fix: Remove redundant `--extra test` flag in Dockerfile's `uv sync` c…
hbelmiro Apr 16, 2026
4d8f5a9
chore: update autogluon package versions to 1.5.0+rhaiv.2 in requirem…
DorotaDR Apr 20, 2026
01d2b84
Merge pull request #41 from DorotaDR/fix_automl_banners_odh
openshift-merge-bot[bot] Apr 21, 2026
68e18f5
fix NaNs in AutoML: regression
LukaszCmielowski Apr 21, 2026
bed1121
duplicates and infs handling + unit tests
LukaszCmielowski Apr 21, 2026
830a06f
pylint check fixes
LukaszCmielowski Apr 21, 2026
71f50ed
more pylint checks
LukaszCmielowski Apr 21, 2026
350a869
clean frame after reading
LukaszCmielowski Apr 21, 2026
16880e1
unit tests added
LukaszCmielowski Apr 21, 2026
6b57fd7
update docstrings
LukaszCmielowski Apr 21, 2026
3ce9e4c
PR comments submitted by claude addressed
LukaszCmielowski Apr 21, 2026
246fbfe
update test configs based on review comments
LukaszCmielowski Apr 22, 2026
5f3ac73
keep cleansing in data loading component only
LukaszCmielowski Apr 22, 2026
2b0d44f
ruff check fix
LukaszCmielowski Apr 22, 2026
0b8acb5
Add extra-index-url for RH test index
DorotaDR Apr 22, 2026
cd2437b
update the inf handling from math to np
LukaszCmielowski Apr 22, 2026
eadbe00
Revert "update the inf handling from math to np"
LukaszCmielowski Apr 22, 2026
91d7b65
Merge pull request #47 from LukaszCmielowski/fix_automl_nans
openshift-merge-bot[bot] Apr 22, 2026
9b1ae78
Merge pull request #44 from DorotaDR/fix-autogluon-v150-rhaiv2
openshift-merge-bot[bot] Apr 22, 2026
065e1df
align model.json between tabular and timeseries
Mateusz-Switala Apr 23, 2026
e6e5d3d
refactor: Update S3 model download logic in notebooks for improved pa…
DorotaDR Apr 21, 2026
58f2409
chore(automl): added upper limitation for resources
DorotaDR Apr 24, 2026
dc41f88
change field name to single
Mateusz-Switala Apr 24, 2026
1e3be86
chore(autorag, automl): added upper limitation for resources
DorotaDR Apr 24, 2026
73274f7
chore: ruff changes
DorotaDR Apr 24, 2026
fe4396d
refactor(pipelines): Defined constants for max cpus and memory
DorotaDR Apr 27, 2026
08239e2
Merge pull request #49 from Mateusz-Switala/fix_align_tabula_timesere…
openshift-merge-bot[bot] Apr 27, 2026
9e4898d
Merge pull request #50 from DorotaDR/dev-limit-max-resources
openshift-merge-bot[bot] Apr 27, 2026
448c82a
Merge pull request #51 from DorotaDR/fix_automl_58719
openshift-merge-bot[bot] Apr 28, 2026
53207ac
Update training/automl file to use kfp 2.16.0
Wojciech-Rebisz Apr 30, 2026
d94f89b
Update training/autorag files to use kfp 2.16.0
Wojciech-Rebisz Apr 30, 2026
8cf4b06
fix logic mixup triggering MPS; modified MPS phase returned values to…
filip-komarzyniec Apr 30, 2026
a0e7886
Update ai4rag to 0.5.5
jakub-walaszczyk Apr 30, 2026
1b5e285
Update according to code rabbit review
jakub-walaszczyk Apr 30, 2026
f0462e3
Downgrade protobuf to v6.33.6
Wojciech-Rebisz Apr 30, 2026
6ef8903
Inference notebook template and source code (nb generation part) upda…
filip-komarzyniec May 4, 2026
88523ae
chore(pipelines): Removed .git, Readme.md and .png files from list of…
DorotaDR May 4, 2026
973117c
Merge pull request #56 from filip-komarzyniec/RHOAIENG-59759-AutoRAG-…
openshift-merge-bot[bot] May 4, 2026
7f94b91
Merge pull request #59 from filip-komarzyniec/RHOAIENG-58373-lack-of-…
openshift-merge-bot[bot] May 4, 2026
0489554
Merge pull request #60 from DorotaDR/autorag-rhoai34-remove-additiona…
openshift-merge-bot[bot] May 4, 2026
191b41e
Update ai4rag version in notebooks
jakub-walaszczyk May 4, 2026
f1f4d64
Merge main
jakub-walaszczyk May 4, 2026
24128e6
Set fixed version in metadata.yaml
Wojciech-Rebisz May 4, 2026
9f896c7
Merge pull request #58 from jakub-walaszczyk/update-ai4rag-0.5.5
openshift-merge-bot[bot] May 5, 2026
a697f27
Merge pull request #55 from Wojciech-Rebisz/dev-use-kfp-2-16
openshift-merge-bot[bot] May 5, 2026
41d03bf
move tmp changes to run integration tests in disconnected env
Mateusz-Switala May 5, 2026
6bb5234
fix after coderabbit review
Mateusz-Switala May 5, 2026
490d490
Add required packages to requirements
Wojciech-Rebisz May 5, 2026
9cb2443
Remove --extra-index-url
Wojciech-Rebisz May 5, 2026
faf67e6
make AWS_DEFAULT_REGION optional
witold-nowogorski May 6, 2026
d0ab74f
update README.md
witold-nowogorski May 6, 2026
8255e05
Merge pull request #61 from Mateusz-Switala/tests-automl-integration-…
openshift-merge-bot[bot] May 6, 2026
d805630
Merge pull request #63 from witold-nowogorski/main
openshift-merge-bot[bot] May 6, 2026
3f35e70
Sort AutoRAG requirements
Wojciech-Rebisz May 7, 2026
ca4a219
Update requirements to use kfp 2.16.1
Wojciech-Rebisz May 7, 2026
7d44deb
Update metadata&readmes to use kfp 2.16.1
Wojciech-Rebisz May 7, 2026
106a26f
Update lock to kfp 2.16.1
Wojciech-Rebisz May 7, 2026
e08731b
add DorotaDR to the owners files
Mateusz-Switala May 7, 2026
510180c
Merge pull request #65 from Mateusz-Switala/autox-add-approver
openshift-merge-bot[bot] May 7, 2026
72d878b
Update aiohttp to v3.13.5
Wojciech-Rebisz May 7, 2026
c2cce6e
fix error causing unboundLocalError when MPS path was triggered
filip-komarzyniec May 7, 2026
98f4eb3
Revert "Update lock to kfp 2.16.1"
Wojciech-Rebisz May 7, 2026
ae3c662
Update kfp in pyproject.toml
Wojciech-Rebisz May 7, 2026
e067ab4
Brought back some changes introduced in PR #59 (and mistakenly deleted)
filip-komarzyniec May 7, 2026
4a01553
Update pillow to safe version
Wojciech-Rebisz May 7, 2026
417c99d
dictionary initialisation changed in nb template so that it does not …
filip-komarzyniec May 7, 2026
b64aceb
Update kfp-kubernetes to v2.16.1
Wojciech-Rebisz May 8, 2026
90d1361
special guard for ranker_strategy param (it is always a string in tem…
filip-komarzyniec May 8, 2026
20f0a72
Merge pull request #66 from filip-komarzyniec/61185-unboundLocalError
openshift-merge-bot[bot] May 8, 2026
f847b8f
Rebranding Llama Stack to ogx, removal of in-memory scenario, change …
jakub-walaszczyk May 11, 2026
e3df171
Merge main
jakub-walaszczyk May 11, 2026
f5a3923
Update notebook
jakub-walaszczyk May 11, 2026
4548b9b
static code checks fixes
jakub-walaszczyk May 11, 2026
570c3e4
Lint updates
jakub-walaszczyk May 11, 2026
06379b9
Merge pull request #62 from Wojciech-Rebisz/dev-add-missing-kfp-depen…
openshift-merge-bot[bot] May 11, 2026
4cdb465
Update metadata and .toml
jakub-walaszczyk May 11, 2026
cde7abd
Updat README.md
jakub-walaszczyk May 11, 2026
9aefdf4
Adding support for Hermetic builds in midstream
nsingla Apr 29, 2026
3039cd1
Update for ogx 1.0.0 and ai4rag 0.6.1
jakub-walaszczyk May 13, 2026
932cb2e
fix(autorag): support pre-compiled pipeline YAML override and retry S…
angaduom May 13, 2026
e125dca
fix(autorag): address review feedback on PR #71
angaduom May 13, 2026
7eb5f3c
fix(autorag): respect KFP_VERIFY_SSL in s3_client fixture
angaduom May 14, 2026
8edb421
Update metadata, validation script with allowed ~= operator and noteb…
jakub-walaszczyk May 14, 2026
99e759a
Merge main
jakub-walaszczyk May 14, 2026
40953db
Update after code rabbit review
jakub-walaszczyk May 14, 2026
51e752e
Rename variables in consts.py
Wojciech-Rebisz May 14, 2026
f8c1da6
Update missing param in patter.json
jakub-walaszczyk May 14, 2026
5cb16fa
Update after code rabbit review
jakub-walaszczyk May 14, 2026
faf7d62
Update tests for new parameter constraints
jakub-walaszczyk May 14, 2026
c799309
style(autorag): fix ruff formatting in test_pipeline_integration.py
angaduom May 14, 2026
9ab65b8
Merge pull request #67 from jakub-walaszczyk/rebranding-llama-stack-t…
openshift-merge-bot[bot] May 14, 2026
41d5e31
Merge pull request #71 from angaduom/fix/autorag-pipeline-override-an…
openshift-merge-bot[bot] May 14, 2026
b0ca633
Merge pull request #72 from Wojciech-Rebisz/dev-rename-autox-variables
openshift-merge-bot[bot] May 14, 2026
6df3cc2
implement pr suggestions
nsingla May 12, 2026
66be6f9
Merge pull request #68 from nsingla/RHOAIENG-59984-odh
nsingla May 15, 2026
d2ba48e
Merge remote-tracking branch 'opendatahub-io/main' into merge-from-od…
hbelmiro May 15, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 30 additions & 0 deletions .github/workflows/sync-requirements.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
---
name: Check requirements.txt

on:
pull_request:
paths:
- pyproject.toml
- uv.lock
- requirements.txt

permissions:
contents: read

concurrency:
group: check-requirements-${{ github.head_ref }}
cancel-in-progress: true

jobs:
check:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6

- uses: astral-sh/setup-uv@37802adc94f370d6bfd71619e3f0bf239e1f3b78 # v7.6.0

- name: Verify requirements.txt is up-to-date
run: |
make requirements
git diff --exit-code requirements.txt \
|| { echo ""; echo "requirements.txt is out of sync."; echo "Run 'make requirements' and commit the result."; exit 1; }

Check warning on line 30 in .github/workflows/sync-requirements.yml

View workflow job for this annotation

GitHub Actions / yaml-lint

30:121 [line-length] line too long (131 > 120 characters)
6 changes: 6 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,12 @@ repos:
language: system
files: (^pyproject\.toml$|^uv\.lock$)
pass_filenames: false
- id: sync-requirements
name: sync requirements
entry: make requirements
language: system
files: (^pyproject\.toml$|^uv\.lock$)
pass_filenames: false
- id: ruff-format
name: ruff format
# --force-exclude respects pyproject.toml excludes when files are passed directly
Expand Down
12 changes: 10 additions & 2 deletions .tekton/odh-pipelines-components-ci-on-pull-request.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ metadata:
build.appstudio.redhat.com/commit_sha: '{{revision}}'
build.appstudio.redhat.com/target_branch: '{{target_branch}}'
build.appstudio.redhat.com/pull_request_number: '{{pull_request_number}}'
pipelinesascode.tekton.dev/cancel-in-progress: "false"
pipelinesascode.tekton.dev/cancel-in-progress: "true"
pipelinesascode.tekton.dev/max-keep-runs: "3"
pipelinesascode.tekton.dev/on-cel-expression: event == "pull_request" && target_branch
== "main"
Expand All @@ -26,9 +26,17 @@ spec:
- name: output-image
value: quay.io/opendatahub/odh-pipelines-components:odh-pr
- name: dockerfile
value: Dockerfile
value: Dockerfile.konflux.pipelines-components
- name: path-context
value: .
- name: hermetic
value: 'true'
- name: prefetch-input
value: >-
{"type": "pip", "path": ".",
"requirements_files": ["requirements.txt"],
"requirements_build_files": ["requirements-build.txt"],
"binary": {"arch": ":all:"}}
- name: additional-tags
value:
- 'odh-pr-{{revision}}'
Expand Down
14 changes: 9 additions & 5 deletions .tekton/odh-pipelines-components-ci-on-push.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -25,13 +25,17 @@ spec:
- name: output-image
value: quay.io/opendatahub/odh-pipelines-components:odh-stable
- name: dockerfile
value: Dockerfile
value: Dockerfile.konflux.pipelines-components
- name: path-context
value: .
- name: build-platforms
value:
- linux/x86_64
- linux/aarch64
- name: hermetic
value: 'true'
- name: prefetch-input
value: >-
{"type": "pip", "path": ".",
"requirements_files": ["requirements.txt"],
"requirements_build_files": ["requirements-build.txt"],
"binary": {"arch": ":all:"}}
pipelineRef:
resolver: git
params:
Expand Down
2 changes: 1 addition & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ COPY utils/ utils/
RUN chown -R 1001:1001 /app
USER 1001

RUN uv sync --no-cache --extra test
RUN uv sync --no-cache

RUN uv run python -m scripts.generate_managed_pipelines.generate_managed_pipelines

Expand Down
23 changes: 23 additions & 0 deletions Dockerfile.konflux.pipelines-components
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
FROM registry.redhat.io/ubi9/python-312@sha256:ff373f4b42b662e99954adea770ca87b4ea963186cc752174ccb94aa08fa702d

WORKDIR /app

USER root

COPY requirements.txt ./
RUN pip install --no-cache-dir -r requirements.txt

COPY pyproject.toml __init__.py ./
COPY components/ components/
COPY pipelines/ pipelines/
COPY scripts/ scripts/
COPY utils/ utils/

RUN chown -R 1001:1001 /app
USER 1001

RUN pip install --no-cache-dir --no-deps .

RUN python -m scripts.generate_managed_pipelines.generate_managed_pipelines

CMD ["python", "-m", "scripts.init_managed_pipelines.init_managed_pipelines"]
15 changes: 15 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -105,3 +105,18 @@ readme:

sync-packages:
@$(UVRUN) python -m scripts.sync_packages.sync_packages

AIPCC_INDEX_URL := https://console.redhat.com/api/pypi/public-rhai/rhoai/3.4/cpu-ubi9/simple

requirements:
echo "--index-url $(AIPCC_INDEX_URL)" > requirements.txt
echo "" >> requirements.txt
uv pip compile pyproject.toml --generate-hashes --no-header --no-annotate \
--no-emit-package kfp-components \
--python-version 3.12 \
--index-url $(AIPCC_INDEX_URL) >> requirements.txt
echo "--index-url $(AIPCC_INDEX_URL)" > requirements-build.txt
echo "" >> requirements-build.txt
printf 'setuptools\nwheel\n' | uv pip compile --generate-hashes --no-header --no-annotate \
--python-version 3.12 \
--index-url $(AIPCC_INDEX_URL) - >> requirements-build.txt
1 change: 1 addition & 0 deletions components/data_processing/automl/OWNERS
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
approvers:
- LukaszCmielowski
- DorotaDR
reviewers:
- Mateusz-Switala
- DorotaDR
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
approvers:
- LukaszCmielowski
- DorotaDR
reviewers:
- Mateusz-Switala
- DorotaDR
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,11 @@ The component reads data in chunks to efficiently handle large files without loa

For **regression** tasks the split is random; for **binary** and **multiclass** tasks the split is **stratified** by the label column by default.

Rows with a missing label (NaN / empty in ``label_column``) are dropped after load and before splitting, so regression runs do not propagate null targets into splits or the ``sample_row`` JSON (stratified sampling already dropped per chunk; this applies the same rule to random and first-n-rows
paths).

After sampling, **+/- infinity** values in the frame are replaced with **NaN** (same idea as AutoAI ``loadXy``), then **full-row duplicates** are dropped before the label drop and train/test split.

Authentication uses AWS-style credentials provided via environment variables (e.g. from a Kubernetes secret).

## Inputs 📥
Expand Down Expand Up @@ -90,6 +95,7 @@ def example_pipeline(
- **Owners**:
- Approvers:
- LukaszCmielowski
- DorotaDR
- Reviewers:
- Mateusz-Switala
- DorotaDR
Expand Down
47 changes: 47 additions & 0 deletions components/data_processing/automl/tabular_data_loader/component.py
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,15 @@ def automl_data_loader( # noqa: D417
For **regression** tasks the split is random; for **binary** and **multiclass**
tasks the split is **stratified** by the label column by default.

Rows with a missing label (NaN / empty in ``label_column``) are dropped after load
and before splitting, so regression runs do not propagate null targets into splits
or the ``sample_row`` JSON (stratified sampling already dropped per chunk; this
applies the same rule to random and first-n-rows paths).

After sampling, **+/- infinity** values in the frame are replaced with **NaN** (same
idea as AutoAI ``loadXy``), then **full-row duplicates** are dropped before the
label drop and train/test split.

Authentication uses AWS-style credentials provided via environment variables
(e.g. from a Kubernetes secret).

Expand All @@ -67,6 +76,7 @@ def automl_data_loader( # noqa: D417
""" # noqa: E501
import io
import logging
import math
import os

import boto3
Expand Down Expand Up @@ -305,6 +315,43 @@ def load_data_in_batches(
label_column=label_column,
)

if label_column not in sampled_dataframe.columns:
raise ValueError(
f"Label column {label_column!r} not found in the dataset. "
f"Available columns: {list(sampled_dataframe.columns)}"
)

sampled_dataframe.replace([math.inf, -math.inf], float("nan"), inplace=True)

n_before_dedup = len(sampled_dataframe)
sampled_dataframe.drop_duplicates(inplace=True)
n_dup_dropped = n_before_dedup - len(sampled_dataframe)
if n_dup_dropped:
logger.info("Dropped %s full-row duplicate(s) (%s rows remaining).", n_dup_dropped, len(sampled_dataframe))

if sampled_dataframe.empty:
raise ValueError(
"No valid data rows remain after replacing infinite values and dropping duplicates. "
"The source CSV may contain only infinite/NaN values or duplicate rows."
)

n_before_drop = len(sampled_dataframe)
sampled_dataframe = sampled_dataframe.dropna(subset=[label_column])
n_dropped = n_before_drop - len(sampled_dataframe)
if n_dropped:
logger.info(
"Dropped %s row(s) with missing label in column %r before splitting (loaded %s rows, %s remaining).",
n_dropped,
label_column,
n_before_drop,
len(sampled_dataframe),
)
if sampled_dataframe.empty:
raise ValueError(
f"No rows remain after removing missing values in label column {label_column!r}. "
"Ensure the dataset has at least one row with a non-null label (e.g. empty cells in the target column)."
)

n_samples = len(sampled_dataframe)
logger.info("Read %d rows from s3://%s/%s (sampling_method=%s)", n_samples, bucket_name, file_key, sampling_method)

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
Trip_Distance_km,Time_of_Day,Day_of_Week,Passenger_Count,Traffic_Conditions,Weather,Base_Fare,Per_Km_Rate,Per_Minute_Rate,Trip_Duration_Minutes,Trip_Price
19.35,Morning,Weekday,3.0,Low,Clear,3.56,0.8,0.32,53.82,36.2624
47.59,Afternoon,Weekday,1.0,High,Clear,,0.62,0.43,40.57,
36.87,Evening,Weekend,1.0,High,Clear,2.7,1.21,0.15,37.27,52.9032
30.33,Evening,Weekday,4.0,Low,,3.48,0.51,0.15,116.81,36.4698
,Evening,Weekday,3.0,High,Clear,2.93,0.63,0.32,22.64,15.618000000000002
8.64,Afternoon,Weekend,2.0,Medium,Clear,2.55,1.71,0.48,89.33,60.202799999999996
3.85,Afternoon,Weekday,4.0,High,Rain,3.51,1.66,,5.05,11.2645
43.44,Evening,Weekend,3.0,,Clear,2.97,1.87,0.23,,101.1216
30.45,Morning,Weekday,3.0,High,Clear,2.77,1.78,0.34,110.33,
35.7,Afternoon,Weekday,2.0,Low,Rain,3.39,1.52,0.47,,75.5657
,Morning,Weekday,4.0,,Clear,2.4,0.58,0.43,26.34,14.892
48.53,Night,Weekday,3.0,Low,Clear,4.78,,0.5,79.94,
41.79,Night,Weekend,3.0,High,Clear,4.6,1.77,0.11,86.95,88.13279999999999
11.4,Morning,Weekday,3.0,,Clear,4.12,,0.15,84.12,36.118
9.91,Evening,Weekday,2.0,High,Clear,2.32,1.26,0.34,41.72,28.991400000000002
9.99,Night,Weekday,4.0,High,Clear,4.33,0.85,0.43,34.0,27.441499999999998
15.91,Morning,Weekday,4.0,Low,Clear,4.42,1.77,0.21,114.93,56.716
26.71,Afternoon,Weekend,4.0,Low,Rain,4.3,1.59,0.2,111.18,69.0049
22.17,Night,,4.0,Low,Clear,2.34,1.97,0.41,57.59,69.6268
15.27,Morning,,,Low,Clear,3.93,0.73,0.12,,27.354300000000002
30.98,Afternoon,Weekend,1.0,Low,Rain,4.5,0.84,0.25,57.02,44.7782
7.84,Morning,Weekday,4.0,Medium,,3.73,0.82,0.3,53.8,26.298799999999996
105.94355003672595,Night,Weekend,2.0,Low,Rain,3.94,1.69,0.32,23.03,201.86950918612797
18.95,Night,Weekday,1.0,Low,Clear,3.38,0.78,0.39,54.04,39.2366
23.35,Night,,3.0,Low,Rain,3.59,0.6,0.24,66.8,33.632000000000005
39.47,Afternoon,Weekday,1.0,Low,Clear,,,0.35,7.59,83.69649999999999
10.78,Evening,,3.0,High,Rain,3.92,0.54,0.33,56.07,28.2443
138.09832791310237,Evening,Weekend,4.0,Medium,Rain,2.24,1.75,0.32,94.86,280.87730155406564
30.03,,Weekday,1.0,High,Clear,3.31,1.05,0.36,83.21,64.7971
3.28,Evening,Weekday,2.0,Medium,Clear,2.88,1.76,0.2,78.04,24.260800000000003
30.77,Morning,Weekday,1.0,Low,Clear,3.64,1.33,0.13,109.6,58.8121
9.36,Afternoon,Weekday,1.0,Medium,Clear,2.4,1.85,0.15,7.07,20.7765
4.19,Morning,Weekday,1.0,Low,Clear,4.07,1.89,0.19,69.06,
47.5,Morning,Weekend,,Low,Clear,4.39,0.51,0.3,95.55,57.28
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
import csv
import io
import json
import math
import random
from collections import Counter

Expand Down Expand Up @@ -73,9 +74,82 @@ def dropna(self, subset=None):
if not subset:
return self
col_indices = [self._columns.index(c) for c in subset]
new_rows = [row for row in self._rows if all(row[i] != "" and row[i] is not None for i in col_indices)]

def _cell_missing(val) -> bool:
if val is None or val == "":
return True
try:
return math.isnan(float(val))
except (TypeError, ValueError):
return False

new_rows = [row for row in self._rows if all(not _cell_missing(row[i]) for i in col_indices)]
return MockedDataFrame(self._columns, new_rows)

def replace(self, to_replace, value, inplace=False):
"""Minimal ``DataFrame.replace``: map ±infinity to NaN (float), matching production pandas."""
inf_like = False
if isinstance(to_replace, (list, tuple)):
for x in to_replace:
if isinstance(x, float) and math.isinf(x):
inf_like = True
break
if not inf_like:
out = MockedDataFrame(self._columns, [list(r) for r in self._rows])
return None if inplace else out

def _map_cell(v):
try:
fv = float(v)
if math.isinf(fv):
return float("nan")
except (TypeError, ValueError):
pass
return v

new_rows = [[_map_cell(c) for c in row] for row in self._rows]
out = MockedDataFrame(self._columns, new_rows)
if inplace:
self._columns = out._columns
self._rows = out._rows
return None
return out

def drop_duplicates(self, inplace=False):
"""Drop full-row duplicates (first occurrence kept).

NaN in any cell is treated like pandas duplicate detection (two NaNs in the
same column positions count as equal), not Python ``tuple`` equality.
"""

def _dedup_key_part(cell):
if isinstance(cell, float):
if math.isnan(cell):
return "__PANDAS_NAN__"
return ("float", cell)
try:
fv = float(cell)
if math.isnan(fv):
return "__PANDAS_NAN__"
except (TypeError, ValueError):
pass
return cell

seen: set[tuple] = set()
new_rows = []
for row in self._rows:
key = tuple(_dedup_key_part(c) for c in row)
if key in seen:
continue
seen.add(key)
new_rows.append(list(row))
out = MockedDataFrame(self._columns, new_rows)
if inplace:
self._columns = out._columns
self._rows = out._rows
return None
return out

def _col_index(self, col):
"""Return the index of the given column name."""
return self._columns.index(col)
Expand Down
Loading
Loading