Skip to content

Further Croissant RDF integration #858

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

stefanches7
Copy link

@stefanches7 stefanches7 commented Apr 22, 2025

According to #848 comments there are further points in the croissant-rdf code integration

  • Use croissant module for getting some dependencies (but keep the package separate)
  • see the comment below
  • Switch to JAX convention for source tree
  • Improve notebook documentation (and fix networkx failure)

@stefanches7 stefanches7 requested a review from a team as a code owner April 22, 2025 10:22
Copy link

github-actions bot commented Apr 22, 2025

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@stefanches7 stefanches7 changed the title fix test imports according to the JAX style tree, fix Huggingface mockup path Croissant RDF integration Apr 22, 2025
@stefanches7 stefanches7 changed the title Croissant RDF integration Further Croissant RDF integration Apr 22, 2025
@stefanches7
Copy link
Author

About the common dependencies - I am not quite sure how it should work. Should I somehow reference mlcroissant\pyproject.toml in the croissant-rdf\pyproject.toml? @marcenacp maybe you would have an insight from browsing the previous PR codebase

cc @david4096

@marcenacp
Copy link
Contributor

About the common dependencies - I am not quite sure how it should work

@stefanches7 I meant that croissant-rdf code would be included in mlcroissant's library. But we didn't go that path. Thanks a lot for the follow up!

@@ -25,8 +25,7 @@ dependencies = [
"kaggle >=1.6.17",
"openml >=0.15.1",
"rich >=13.9.4",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Remove rich in favour of tqdm as pointed in the initial PR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We started with tqdm but moved to rich because it's... richer ;) Prefer we keep it otherwise we're sort of thrashing. https://medium.com/pythoneers/from-tqdm-to-rich-my-quest-for-better-progress-bars-afff39985ffc

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's stick to rich then, and take the decision if we ever merge the code with mlcroissant.

@@ -3,7 +3,7 @@
import pytest
from rdflib import Graph

from croissant_rdf.providers import HuggingfaceHarvester
from croissant_rdf import HuggingfaceHarvester

OUTPUT_FILEPATH = "./tests/test_output.ttl"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: We should avoid this pattern as pointed out in the initial review. This raises issues when distributing the tests or running them in parallel. Instead we should create temporary files within the test.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same comment would apply to all tests in this PR.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, overlooked. Thanks for the link!

Copy link
Contributor

@marcenacp marcenacp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@david4096
Copy link
Contributor

This is really great to see, thanks @stefanches7 !

@stefanches7
Copy link
Author

About the common dependencies - I am not quite sure how it should work

@stefanches7 I meant that croissant-rdf code would be included in mlcroissant's library. But we didn't go that path. Thanks a lot for the follow up!

Discussed with @david4096 that dependending on the mlcroissant package is probably the way to go, especially since we plan to not only dissect CroissantLD into RDF, but also be able to go back. For that, we will use the mlcroissant functionality.

@marcenacp
Copy link
Contributor

dependending on the mlcroissant package is probably the way to go

If you ever need to do that, let's just merge both packages. Introducing a dependency would be too hard to maintain - for zero return as I underline in the first PR. For now, we can keep them distinct, ping me if we ever need mlcroissant from croissant-rdf.

Thanks again for the good work! Ready to merge once the temporary file issue is solved.

@stefanches7
Copy link
Author

ready to merge once the temporary file issue is solved

Thanks to being on Windows the NamedTemporaryFile issue popped up, people report all kind of headaches with writing / accessing: bravoserver/bravo#111 (comment) (and just see the list of mentions below this comment!).
I've employed a context approach, hope it works well in the long run.

@stefanches7
Copy link
Author

Could it be merged?

@stefanches7
Copy link
Author

How about merging this @marcenacp @ccl-core or does anything prevent this? Thanks

@ccl-core
Copy link
Contributor

ccl-core commented May 7, 2025

It seems like the CI tests are still failing? Is it related to this PR?

@stefanches7
Copy link
Author

@ccl-core no, those relate to the audio datasets this PR has nothing to do with.

Btw I think #857 aims to fix these tests, but it fails those itself too.

@ccl-core
Copy link
Contributor

ccl-core commented May 7, 2025

Thanks! I'll have a look at the other PR :)
For this PR, I'm ok with squashing and merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants