About Concrete ML philosophy for data preprocessing?

While writing another post, a question came up about the philosophy of Concrete ML: how would you recommend setting up the data preprocessing?

In your examples, the encrypted model takes as input the same dimensions that were used to train the model in plaintext. This looks fine but in practice, you often do transformation on the raw data. Hence, the local data is usually transformed on the Inference server before calling the ML model. I wonder whether this question already came up in your discussions: should this transformation (e.g. a PCA or a one-hot encoding) be done by the “encryptor” on its plaintext data (i.e. to directly encrypt the transformed data) or by the “model holder” using FHE?.

I did not used yet Concrete ML so I may miss some details about its philosophy, sorry about that. Thank in advance for your answer!

Hi !

Indeed that’s a discussion we had. Currently, the assumption is that the preprocessing is done on the client machine before encrypting the data. However, we are working on making Concrete ML and Concrete Numpy interoperate smoothly such that a PCA or other transformations could be applied on the server side, right before the model, in FHE. Note that this could impact the final accuracy of the model as now we are applying the preprocessing in the quantized realm (with 8bits of precision max) rather than in full precision (fp32).

So, for now, I would rather advise doing the preprocessing on the client machine when possible. When we will unlock more bits of precision for our FHE computation, we will naturally allow any preprocessing to be done on the server side.