Impact of dataset size on compile time

Hello Zama Community,

Will the dataset be compiled when the trained model is compiled?

This is because:
When I execute the example(digits dataset) in ConvolutionalNeuralNetwork.ipynb, the model compiles fast.
But when I execute the same CNN model on the MNIST handwritten dataset, it takes almost an hour to compile.

This is the dataset size:
train data:60000,1,28,28
test data:10000,1,28,28
Does dataset size have an impact on compile time?

This is network structure:

I’m not a native English speaker, please forgive my poor expression.

With the current version of the Concrete-ML package, there may be an issue with long compile times. But compilation time, in general, is not related to the dataset size that much.

In an upcoming version which will be released very soon compilation time is 1-2 seconds, irrespective of calibration dataset size.

For now I propose, as a workaround, to sample only random 500 samples from your train data to use as the compilation calibration set:

    calib_samples = numpy.random.sample(range(x_train.shape[0]), 500)
    q_module = compile_torch_model(
        net,
        x_train[calib_samples,::],
        n_bits=n_bits,
    )

(sketch of workaround code, it may need modification to make it run)

1 Like

Hello andrei-stoian-zama,

Thank you very much for your reply!

Following your suggestion, I randomly sampled 500 data. The results are amazing, the compilation time is reduced by hundreds of times.

Also, I tried using only one data point. The compilation speed is faster, it only takes about 4 seconds.And The evaluation accuracy of the model has not changed.(use virtual_lib)

  q_module = compile_torch_model(
        net,
        x_train[[0],::],
        n_bits=n_bits,
    )

Thanks again for your reply!

In addition, I want to know whether the upcoming new version supports the Maxpooling or Avgpooling. And what is the release time of the new version, I don’t know if it is convenient to disclose.

We do not recommend that you use a single data point for compilation as this data is used for calibration: determining the ranges of possible values that each intermediate value in the computation graph can take. Not only this is used for quantizing the neural network, but also to determine cryptographic parameters. Thus, a representative sample of data should be given. Usually we use the whole training set.

We will be supporting AvgPool very soon, but I can’t disclose the exact release date. We release new versions after at the end of each quarter usually.

Please note that in VL (Virtual Lib) mode you are not actually processing data securely - this is just a simulation mode to help you develop models quickly. You will need to compile without this mode (as in the example I gave above) to have secure FHE computation .

1 Like

Thanks for your quick reply, I understand what you mean.

Thanks again, your work is awesome.

2 Likes

Our pleasure and thanks for the nice words!

The easiest way to thank us is to star our repo and to speak about the product to your friends and colleagues :slight_smile: And when you’ve nice things to share, please, we love to see what users can do with our hard work.

Cheers