Skip to content

Conversation

@MarekWadinger
Copy link
Contributor

@MarekWadinger MarekWadinger commented Mar 6, 2024

Hello @MaxHalford, @hoanganhngo610, and everyone 👋,

In #1366, @MaxHalford showed interest in implementation of OnlinePCA and OnlineSVD methods in river.

Given my current project involvement with online decomposition methods, I believe the community could benefit from having access to these methods and their maintenance over time. Additionally, I am particularly interested in DMD, which combines the advantages of PCA and FFT. Hence, I propose the introduction of three new methods as part of the new decomposition module:

decomposition.OnlineSVD implemented based on Brand, M. (2006) (proposed by @MaxHalford in issue) with some considerations on re-orthogonalization. Since it is required quite often, compromising computation speed, it could be interesting to align with Zhang, Y. (2022) (I made some effort to implement but I'm yet to expore validity and possibility to implement revert in similar vein).

decomposition.OnlinePCA implemented based on Eftekhari, A. (2019) (proposed by @MaxHalford in issue), as it is currently state-of-the-art with all the proofs and guarantees. Would be happy to validate together if all considerations are handled in proposed OnlinePCA.

decomposition.OnlineDMD implemented based on Zhang, H. 2019. It can operate as MiniBatchTransformer, MiniBatchRegressor (sort of), and works with Rolling so I would need some help figuring out how we'd like to classify it (maybe new base class Decomposer.

Additionally, I propose preprocessing.Hankelizer, which could be beneficial for various regressors and particularly useful for enhancing feature space by introducing time-delayed embedding.

I've tried to include all necessary tests. However, I need to investigate why re-orthogonalization in OnlineSVD yields significantly different values when tested on various operating systems (locally, all tests pass).

Looking forward for your comments and revisions. 😌

MarekWadinger and others added 30 commits February 13, 2024 17:16
…eig + FIX: exponential w in learn many + MINOR: robustness
Standardization of input shapes
@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@MarekWadinger
Copy link
Contributor Author

Hello @MaxHalford and @hoanganhngo610, 👋

I believe the methods are ready for benchmarking. The results are published in this notebook.

In the plot I combine two checks, performance w.r.t. number of features and delay imposed by conversion from pd.DataFrame (dict) to np.array used in the core.
perf-pd_np-n_features

Mean absolute number of processed samples per second is provided here (for n features in range(3,20) as it remains pretty stable):

  • np.array
    3102 OnlineDMD
    19553 OnlinePCA
    631 OnlineSVD (Probably will be completely replaced by Zhang implementation bellow)
    3503 OnlineSVDZhang
  • pd.DataFrame
    1267 OnlineDMD
    18012 OnlinePCA
    683 OnlineSVD
    1718 OnlineSVDZhang

The results in the notebook indicate that using pd.DataFrame slows down OnlinePCA, which is the fastest decomposition implementation, by up to 14%. However, I believe your concerns are likely related to the fact that the core of the decomposition methods works with np.arrays, correct?

What are your thoughts on the performance and adequacy of the evaluation?

Thanks for your time 🙏

@s-bessing
Copy link

s-bessing commented Jul 23, 2025

Is this still active?

@MarekWadinger
Copy link
Contributor Author

MarekWadinger commented Jul 23, 2025

Hey @s-bessing. I would love to have this reviewed and published. I'm actively working with OnlineDMD.

@hoanganhngo610 @MaxHalford are you available to refresh discussion on this?

I'm ready to fix the checks if you could provide some feedback also on my latest comment. :)

Thx

@s-bessing
Copy link

@MarekWadinger, thanks for the reply. I am currently working on an online topic model. For this, I came across the river package and like the approach. Currently, I use a static reducer (UMAP), but I am not entirely satisfied with it since it is static.
As input, I use a growing list of small documents that I convert into embeddings. Would OnlineDMD be a suitable solution for this use case?

@MarekWadinger
Copy link
Contributor Author

Hey @s-bessing.

I used to really like UMAP in my fault detection projects. I believe OnlineDMD could be a match but there are some bottlenecks. It works much better on reasonably noisy data as high autocorrelation may break the underlying SVD computation in case of piece-wise constant behavior (this happens to me, for instance, in OnlineDMDc where I have information about control signal noiseless and does not change for a while). But if there are certain periodic components and dominant patterns in your data, I think you should give it a hit. :)

@kulbachcedric
Copy link
Contributor

Hi @MarekWadinger and @s-bessing,
your PR is quite impressive!
However, we’re currently doing a cleanup of older Pull Requests in the river repository and wanted to check in with you.
If you plan to continue, feel free to push updates or let us know if you need any help or feedback.

Thanks a lot for your contribution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants