Skip to content

Loosen model interface checks #1621

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

shkwsk
Copy link
Contributor

@shkwsk shkwsk commented May 9, 2020

Hello.

This patch loosens the checking of the model interface in gRPC Predict API.
If the key of "request.inputs()" satisfies the required key of "signature.inputs()", the request is predictable.
This allows you to add features to a running gRPC client server at any given time, even though they are not used for prediction.
The advantage of this patch is that the API of the model is backwards compatible when AB testing a model with added features on a running service.

Let me know what you think.

@geraldstanje
Copy link

geraldstanje commented May 15, 2020

@shkwsk
i also run into the same issue. please let me know if the following is what you also mean?

my tfs model uses 10 features, but my grpc request sends 11 features (the 10 features the models needs + 1 additional feature (called dummy - which the model does not need))
--> tf serving fails with the following error:

{
   "error":"Failed to process element: 0 key: dummy of \\'instances\\' list. Error: Invalid argument: JSON object: does not have named input: dummy"
}

when you want to do a/b testing as follows, lets say the grpc client sends a request with 20 features:

  • modelA (needs 10 features from that request)
  • modelB (needs 5 features from that request)

currently tensorflow serving would not allow that.

cc @rmothukuru

@shkwsk
Copy link
Contributor Author

shkwsk commented May 16, 2020

@geraldstanje
Thanks for your comment.

It seems to be the same as your issue, but from the error messages you show, it looks like you're using the RESTful API.
This patch loosens the check for keys in the protocol buffer map in gRPC API.

The RESTful API may be able to do the same fix, but I don't have a good idea right now.
In #1622, I have modified the RESTful API to use protocol buffer, and you can use it in conjunction with this PR to solve your error messages:)

@geraldstanje
Copy link

geraldstanje commented May 16, 2020

@shkwsk thanks for the pr - lets get it merged.
im using grpc but only tested it with the rest api in this case...
that pr will make the testing of different models possible.

@shkwsk
Copy link
Contributor Author

shkwsk commented May 17, 2020

@geraldstanje
I don't merge that PR into this because the issues I want to solve are different.
In essence, it's better to loosen the checks in RESTful API.

In any case, I'll be waiting for the official reviewer's comments.

@geraldstanje
Copy link

geraldstanje commented May 17, 2020

@shkwsk i think both grpc and rest interface need to implement the same checks...

cc @misterpeddy @lilao

@shkwsk
Copy link
Contributor Author

shkwsk commented May 23, 2020

@geraldstanje
I tried loosen the checks in the RESTful API as well.
Unfortunately, I can't prepare a model to handle the instances format, so I haven't tested it.
Could you please check if there is still an error?

@geraldstanje
Copy link

@shkwsk ok. let me test over the weekend.

@geraldstanje
Copy link

geraldstanje commented Jun 6, 2020

@shkwsk do you have a script to builds the tfs docker image which is based on your branch?

@shkwsk
Copy link
Contributor Author

shkwsk commented Jun 7, 2020

@geraldstanje
I think you can switch branches and build as documented.
https://github.com/tensorflow/serving/blob/master/tensorflow_serving/g3doc/setup.md#build
I also merged in the latest commits so you can be built.

@geraldstanje
Copy link

geraldstanje commented Jun 7, 2020

@shkwsk i tried, but that field since Dockerfile.devel clones https://github.com/tensorflow/serving... but your code is in https://github.com/shkwsk/serving/tree/shkwsk-patch1

#!/bin/bash

USER=geri
TF_SERVING_VERSION_GIT_BRANCH=shkwsk-patch1
TAG=shkwsk-patch1

git clone --branch="${TF_SERVING_VERSION_GIT_BRANCH}" https://github.com/shkwsk/serving

cd serving && \
	docker build --pull -t $USER/tensorflow-serving-devel:$TAG \
  	--build-arg TF_SERVING_VERSION_GIT_BRANCH="${TF_SERVING_VERSION_GIT_BRANCH}" \
  	-f tensorflow_serving/tools/docker/Dockerfile.devel .

cd serving && \
	docker build -t $USER/tensorflow-serving:$TAG \
  	--build-arg TF_SERVING_BUILD_IMAGE=$USER/tensorflow-serving-devel:$TAG \
  	-f tensorflow_serving/tools/docker/Dockerfile .

any idea?

@shkwsk
Copy link
Contributor Author

shkwsk commented Jun 7, 2020

@geraldstanje
How about rewriting the Dockerfile.devel temporarily?
Could you please edit it after the git clone?

diff --git a/tensorflow_serving/tools/docker/Dockerfile.devel b/tensorflow_serving/tools/docker/Dockerfile.devel
index fb247c1..3ac4a35 100644
--- a/tensorflow_serving/tools/docker/Dockerfile.devel
+++ b/tensorflow_serving/tools/docker/Dockerfile.devel
@@ -88,8 +88,8 @@ RUN mkdir /bazel && \

 # Download TF Serving sources (optionally at specific commit).
 WORKDIR /tensorflow-serving
-RUN git clone --branch=${TF_SERVING_VERSION_GIT_BRANCH} https://github.com/tensorflow/serving . && \
-    git remote add upstream https://github.com/tensorflow/serving.git && \
+RUN git clone --branch=${TF_SERVING_VERSION_GIT_BRANCH} https://github.com/shkwsk/serving . && \
+    git remote add upstream https://github.com/tensorflow/serving.git && \
     if [ "${TF_SERVING_VERSION_GIT_COMMIT}" != "head" ]; then git checkout ${TF_SERVING_VERSION_GIT_COMMIT} ; fi

@geraldstanje
Copy link

geraldstanje commented Jun 15, 2020

@shkwsk i tested both grpc and rest api, both worked fine.

please take a look: @misterpeddy @lilao @netfs

@shkwsk
Copy link
Contributor Author

shkwsk commented Jun 15, 2020

@geraldstanje Thanks for your testing!

@shkwsk
Copy link
Contributor Author

shkwsk commented Aug 13, 2020

@misterpeddy @lilao @netfs Please check and merge this PR?

@netfs
Copy link
Collaborator

netfs commented Dec 3, 2020

i think having different # of features in request compared to that what model expects -- is usually an indicator of training vs serving "skew".

you are assuming here that its kosher to send request with more features than the model expects. This may be OK for your scenario, but might be an indicator of bad model pushed to serving in other cases (and you will want to fail the request in such cases, for monitoring to alert appropriately).

IOW i am not convinced that this change is safe for all situations and its OK to loosen this check.

@netfs netfs self-requested a review December 3, 2020 02:57
@shkwsk
Copy link
Contributor Author

shkwsk commented Dec 5, 2020

@netfs
I understand what you mean.

But I need this patch for model migration on the running service.
Loosening checks allows to adding/removing features seamlessly.

If you are not convinced to loosen the validation checks by default, how about adding the option to choose to loosen?

@netfs
Copy link
Collaborator

netfs commented Dec 7, 2020

@netfs
I understand what you mean.

But I need this patch for model migration on the running service.
Loosening checks allows to adding/removing features seamlessly.

why can't versioning be used to do this safely? IOW send versioned requests. And maybe use labels to make the transition usable [0].

If you are not convinced to loosen the validation checks by default, how about adding the option to choose to loosen?

[0] https://www.tensorflow.org/tfx/serving/serving_config#assigning_string_labels_to_model_versions_to_simplify_canary_and_rollback

@shkwsk
Copy link
Contributor Author

shkwsk commented Dec 28, 2020

@netfs
If you want to switch the model in version, you need to distribute the configuration file to the server.

It is difficult to manage the status of all servers while distributing the file to each server.
Also, the client needs to know the status of each server and the version of the model on the stateful server.

I update multiple models daily on our S3 bucket, and run AB tests between the updated models.
It is not practical to switch the model version on the client side on a stateful server every day.
This is very costly, so I want to switch the requests by model_name.

@netfs
Copy link
Collaborator

netfs commented Jan 6, 2021

I update multiple models daily on our S3 bucket, and run AB tests between the updated models.
It is not practical to switch the model version on the client side on a stateful server every day.

you still need to notify the client to switch its inference request to contain different set of features (corresponding to the new B version of the model) -- no? how is this switch different than asking client to also use a different explicit version?

This is very costly, so I want to switch the requests by model_name.

@Arnold1
Copy link

Arnold1 commented Jun 3, 2022

@netfs @shkwsk @chrisolston any progress on that?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants