Training ML models on encrypted data

Hi,

Can I use Concrete ML to train for example an XGBoost model on encrypted data / features ?

1 Like

Hello @mdumitruguzu

Thanks for your question.

This question is a question we’re often asked, and eg, it was answered in Training on encrypted data. You’ll see there that it is currently not possible: Zama is today on private inference, and the training side is let for later, or with other tools as Federated Learning or Multi Party Computation.

We train on clear input set and then, infer on encrypted data: it allows to protect one’s data (eg, when you do your cancer test) but is not made to protect or share datasets between entities.

Cheers

Thank you for the quick answer.

1 Like

And now? some of your videos seems to tell that now it’s possible to train with encrypted data

Hello @enricoversino , it is now possible to train linear binary classifier on encrypted data using Concrete ML.

We are working on extending this to other models but we don’t support XGBoost training on encrypted data yet.

Can i have some code?

Yes ofc, we have a tutorial in concrete-ml/docs/advanced_examples/LogisticRegressionTraining.ipynb at main · zama-ai/concrete-ml · GitHub and if you want to have a look at the source code it can be found in concrete-ml/src/concrete/ml/sklearn/linear_model.py at main · zama-ai/concrete-ml · GitHub

Thanks. But the code named LogisticRegressionTraining doesn’t seems to use encrypted data to train the model (in my understanding): the code seems to use “normal” Xtrain and ytrain partition of data. For us to have an example of code that learn us how to encrypt input data then train model, decrypt model and use with non encrypted data has capital importance because our AI team needs urgently to find a solution about anonymization of real world data. In the meantime, are you interested to collaborate with us? We are the AI team of Università vita e Salute - San Raffaele, based in Milan, Italy

The current implementation does the computation in FHE under the hood indeed but we didn’t expose the client-server part needed for deployment yet.

This is something that we have planned but not yet integrated.

My colleague @jfrery created a github gist to play around with it: Example of deployement code for FHE training · GitHub

Thanks of informations

Thanks! Enjoyed the discussion.

Actually, i was planning to submit a pr which aims to decouple training and encryption of train set.

The current implementation takes in plain text data and performs key generations, encryption and training inside it.

I wanted to understand how much difficult it would be to decouple the key generation + encryption process and the training process.

Thanks in advance.

@nlok5923 please see Jordan’s code in this gist. It does exactly what you need I think.

Hi @andrei-stoian-zama

Thanks for the response!

I had seen the replication code for the deployment api. But i think my usecase is a bit different.

Based on my understanding you are creating the fhe_circuit which has been generated during model training. And then saving this fhe_circuit on the server end and then performing model inferences on the models end based on the encrypted inputs provided by the client (where the client first encrypt the inputs with his own key and the send it to server to compute on it.)

Actually, my goal is that i need to train the model using the encrypted data itself (As i don’t have the data in plaintext form i might get access to the metadata (column names, no of rows, types etc) of the data). But in this example it is using plaintext data for training and generating fhe_circuit .

Also, based on my understanding with fhe="execute" parameter the function while training first generates the key, encrypts the data and then train the model on encrypted data. I wanted to know whether instead of doing key generation and encryption in fit function, Can we decouple it in such a way that the fit function directly takes in encrypted data and trains the model and build fhe_circuit .

Thanks!

I think there might bit a bit of confusion here:

The model training “circuit” (a compiled program working only on encrypted data) is generated on a development machine. It can then be deployed to the server. To generate this circuit, some bounds about typical data, that a future user of the training system will send to the server, is used. This is called the “input-set”.

To generate this circuit no actual data is used.

Finally, once the circuit is deployed, a future user, who has real data, will encrypt said data and send it to the server to be trained on with the compiled circuit. The user will get back an encrypted model that they decrypt with their secret key.

Oh Got it!

Actually, i followed this piece of code where it trains the model using plain data.

it would be amazing if you could share an example how we can generate the circuit without using the actual data.

Here’s the code that generates the training circuit:

Thanks @andrei-stoian-zama For sharing this piece of code.

Got it clarified now!