-
Notifications
You must be signed in to change notification settings - Fork 701
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SOLR-17023: Use Modern NLP Models via ONNX and Apache OpenNLP with Solr #1999
base: main
Are you sure you want to change the base?
Conversation
Check it out @jzonthemtn |
configurations.all { | ||
resolutionStrategy { | ||
force 'org.apache.opennlp:opennlp-tools:2.2.0' | ||
force 'org.apache.opennlp:opennlp-dl:2.2.0' | ||
} | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need for force a resolution? This shouldn't be necessary at all...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need apache/lucene#448 to be merged to get us to 2.2, or we ahve to work around Lucene... Is there a better way? This is what Jeff and I hacked in..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
try org.apache.opennlp:opennlp*=2.2.0
in versions.props and remove this from here. It should force the upgrade to 2.2.0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lots of errors when I took out the resolutionStrategy
:
* What went wrong:
Could not determine the dependencies of task ':solr:modules:analysis-extras:compileJava'.
> Could not resolve all task dependencies for configuration ':solr:modules:analysis-extras:compileClasspath'.
> Could not resolve org.apache.opennlp:opennlp-tools:2.2.0.
Required by:
project :solr:modules:analysis-extras
project :solr:modules:analysis-extras > org.apache.opennlp:opennlp-dl:2.2.0
> Cannot find a version of 'org.apache.opennlp:opennlp-tools' that satisfies the version constraints:
Dependency path 'org.apache.solr:analysis-extras:10.0.0-SNAPSHOT' --> 'org.apache.opennlp:opennlp-tools:2.2.0'
Constraint path 'org.apache.solr:analysis-extras:10.0.0-SNAPSHOT' --> 'org.apache.solr:core:10.0.0-SNAPSHOT' (apiElements) --> 'org.apache.opennlp:opennlp-tools:1.9.4' because of the following reason: Computed from com.palantir.consistent-versions' versions.lock in solr-root
Constraint path 'org.apache.solr:analysis-extras:10.0.0-SNAPSHOT' --> 'org.apache.solr:solrj:10.0.0-SNAPSHOT' (apiElements) --> 'org.apache.opennlp:opennlp-tools:1.9.4' because of the following reason: Computed from com.palantir.consistent-versions' versions.lock in solr-root
Dependency path 'org.apache.solr:analysis-extras:10.0.0-SNAPSHOT' --> 'org.apache.lucene:lucene-analysis-opennlp:9.8.0' (compile) --> 'org.apache.opennlp:opennlp-tools:1.9.1'
also, feel free to push up changes to this branch ;-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make sure to run ./gradlew --write-locks
to make sure it takes into account versions.props
changes.
versions.props
Outdated
@@ -70,3 +70,4 @@ org.semver4j:semver4j=5.2.1 | |||
org.slf4j:*=2.0.9 | |||
org.xerial.snappy:snappy-java=1.1.10.5 | |||
software.amazon.awssdk:*=2.20.155 | |||
org.apache.opennlp:opennlp-dl=2.2.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
try org.apache.opennlp:opennlp*=2.2.0
Thanks @risdenk , your suggestions worked. One slight wrinkle that I don't know how to deal with is that the |
|
This is after uploading the model to the PackageStore and reference it in there! |
@test "Check lifecycle of sentiment classification" { | ||
|
||
# GPU versions is linux and windows only, not OSX. So swap jars. | ||
rm -f ${SOLR_TIP}/modules/analysis-extras/lib/onnxruntime_gpu-1.14.0.jar |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just always run on cpu? Why need to test if there is a capable GPU or not? Could be on a VM without a GPU too even w/ linux and windows...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for the purposes of this, I totally agree.. I think I need some Gradle help to figure this out ;-( I've banged at it with no luck.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just CPU is definitely a better choice here. And just a note, ONNX Runtime does not currently support Mac silicon so this will only work on x64 machines.
OpenNLP brings in the onnxruntime-gpu
dependency. I will open a PR in OpenNLP to use onnxruntime
instead for all of the stated reasons.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if there is a workaround, please let me know...!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Release is done. New artifacts should be available soon ;-) (not depending on gpu anymore)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!!!
rm -f ${SOLR_TIP}/modules/analysis-extras/lib/onnxruntime_gpu-1.14.0.jar | ||
# restore | ||
#curl --insecure -o ${SOLR_TIP}/modules/analysis-extras/lib/onnxruntime-1.14.0.jar https://repo1.maven.org/maven2/com/microsoft/onnxruntime/onnxruntime/1.14.0/onnxruntime-1.14.0.jar | ||
cp /Users/epugh/Documents/projects/solr-epugh/onnxruntime-1.14.0.jar ${SOLR_TIP}/modules/analysis-extras/lib/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is using your path...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah... I was in the back of David Smiley's car in the wilds of New Brunswick Canada while I was working on this and didn't want to burn up my data plan ;-). Likewise I need to figure out a better apporoach with the model since it's 600 mb.... I don't know if I can find a toy one that is super small, or if I add docs to download it from hugging facet etc. Longer term, I doin't even think this bats test gets committed the way it is because of all the external dependencies.. Though.. if I had a super small model that I could check into Git, maybe that owuld be okay? Like we have a binary package for some of the package tests..
@risdenk @jzonthemtn so, any ideas how to force the |
also, i am seeing two errors on looking up licenses...
|
@jzonthemtn did open https://issues.apache.org/jira/browse/OPENNLP-1515, but that will require a release of OpenNLP before we can piggy back on it |
I think the first draft is done, and it "works". There are some caveats and things to be figured out, like the |
…n the OpenNLP project
This PR had no visible activity in the past 60 days, labeling it as stale. Any new activity will remove the stale label. To attract more reviewers, please tag someone or notify the [email protected] mailing list. Thank you for your contribution! |
This PR is now closed due to 60 days of inactivity after being marked as stale. Re-opening this PR is still possible, in which case it will be marked as active again. |
this is still valid ;-) |
I updated this PR with main, to see what happened, and we're closer. Things that I think are still holding us back:
|
// Initialize the categorizer. | ||
FileStore fs = req.getCoreContainer().getFileStore(); | ||
|
||
var path = solrHome.resolve(ClusterFileStore.FILESTORE_DIRECTORY); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gerlowskija this is the line I was referenceing, where the below modelFile
and vocabFile
feel like they are not using the ClusterFileStore properly....
Yep, need to wait for Lucene 10, otherwise we get some unit test failures:
|
https://issues.apache.org/jira/browse/SOLR-17023
Description
Code and BATS test that demonstrates downloading a sentiment classification model from Huggingface, converting it to Onnx model, uploading the model to Solr's FileStore, and then configuring a collection and associated pipeline to use the model for enriching content.
Solution
New DocumentCategorizerUpdateProcessorFactory that interacts with the model.
Tests
BATS test. Check this PR out and run:
./gradlew integrationTests --tests test_opennlp.bats
Checklist
Please review the following and check all that apply:
main
branch../gradlew check
.