What is meant by 'compiling' the model?

Hello Zama Community,

I am back again with some questions.

I wanted to clarify in the following Jupyter Notebook concrete-ml/DecisionTreeClassifier.ipynb at main · zama-ai/concrete-ml · GitHub what it means to compile the model by the command model.compile(x_train). To be precise , I am using the Random Forest, not Decision Tree.

I precisely wish to know what is the difference. I am trying to run a random forest, but it is very very slow. It gets hung on the model.predict() step. My data file has 67795 rows × 19 columns and training set is 23433 rows x 15 columns.

Model compilation means the generation, for a specific ML model, of executable code that computes the model prediction on encrypted data. This includes, among others, the automatic analysis of the model computation graph to find the best cryptographic parameters, to convert floating point computation to integer, and finally, the generation the executable code.

While it is possible to execute Concrete-ML models on non-encrypted data (without compiling) for development purposes, in order to perform secure computation on encrypted data, the model needs to be compiled.

A model that is compiled can be said to “execute in FHE”. When executed this way on encrypted data, the execution time is much greater than execution time on clear data, due to the complexity involved in homomorphic computations.

I would suggest you first measure the execution time on a single data point first, by passing a single row to .predict(...,execute_in_fhe=True).

To improve the runtime performance in FHE of RandomForestClassifier (and other tree ensemble methods) experimentation is key, here are a few things you can try:

  • reduce the depth of the trees max_depth parameter to 2 or 3.
  • reduce the number of estimators in the ensemble
  • reduce the bitwidth of quantization (4-5 bits are the best compromise between accuracy and runtime speed)

This way the execution time for a single data point can be brought down to as low as ~1-2 seconds while maintaining accuracy.

1 Like

Dear Andrei,

This is very informative; I was already guessing something along the lines of what you had mentioned, but it is nice to formally hear this.

I would suggest you first measure the execution time on a single data point first, by passing a single row to .predict(…,execute_in_fhe=True).

Good idea! Among the three other ones you have suggested.

Short question in this context; how do I reduce the bitwidth of the quantisation?

Your answer is very informative and clear. Thank you,

To tune the quantization bits you can adjust n_bits in the RandomForestClassifier constructor:

    model = RandomForestClassifier(
        max_depth=max_depth,
        n_estimators=n_estimators,
        n_bits=n_bits
    )

n_bits defaults to 6 which may exceed the precision required for your dataset.

Let us know if you encounter any issues with tweaking these parameters !

1 Like

Thank you!

After tweaking with parameters I have been able to at least arrive at a halting for the prediction with a reasonably correct estimate / prediction.

Actually, the n_bits thing is quite straight forward, but it has not yet been added to the documentation (maybe I should see carefully?) so didn’t know if it was even in the constructor. I understand that the project is in its initial phases so I sometimes try to hack around.

A few more questions ; Since I am running from Docker in MacOS, I get the following error.

WARNING: You are currently using the software variant of concrete-csprng which does not have access to a hardware source of randomness. To ensure the security of your application, please arrange to provide a secret by using the concrete_csprng::set_soft_rdseed_secret function.’

This maybe because docker does not have access to /dev/urandom, but I do not know how to give it access to it either. Actually a very cool feature, did not know Apple had a hardware RNG.

Everything else is perfect now. I do have other questions, but I will put them in a different question as they are of different nature.

Hello @dalvi

Great that you were able to make some progress with @andrei-stoian-zama 's support! Let me continue here, since I’m a proud macOS user.

  • n_bits are a bit discussed in Quantization - Concrete ML . In the new documentation (a new version of the library is coming soon), it should be even clearer.
  • if you are on macOS, you could use our tools natively, by having pip install concrete-ml. I do that, for example. I have experienced that it was much faster than with Docker.
  • if you want to continue with Docker, we had had the issue, did an issue to developers, and finally got this workaround:

could you enable the “Use the new Virtualization framework” experimental feature in Preferences and try again? On my Intel laptop I can see rdseed with the new framework.

Tell us how it works for you
Cheers

Dear Benoit,

Thank you so much for your reply. I am also a MacOS user, not sure a proud one :laughing: .

  • Yes I now see it in the Quantization link. Excited to see the new version of the library!

  • Yes I just installed it natively on MacOS and I am getting same response. Since this may be a hardware question, my Mac is MacBook Pro 2015 version.

  • It is indeed working quicker now.

Thank you for your help, I will be soon posting a new question regarding another query I have.

danish

1 Like

Hey @dalvi

So yes, the HW is certainly old enough to miss some features they use in the Concrete Core library, to generate random numbers. Remark that this is not an error but a warning. If you want to know more about the details, then it might be better if you create an independent question about this warning in #concrete-lib category.

Happy to see you’re making some progress. Yes sure, please create as many questions as you have. And please star our repo!

Cheers