I am working on a setup where I want to use a hybrid LoRA training approach combined with Concrete ML’s deployment architecture, specifically utilising:
FHEModelDev: For model compilation and saving artefacts,
FHEModelClient: For key generation, encryption of payloads, and decryption of results,
FHEModelServer: For executing FHE computations on encrypted data.
I aim to fine-tune a pre-trained model on encrypted data in a deployment similar to a typical Concrete ML setup:
The server hosts the compilation artefact, including client specifications and the FHE executable.
The client generates keys, encrypts payloads, and sends them to the server, receiving encrypted results.
My questions are:
How can I adapt the HybridFHEModel for LoRA fine-tuning in such a deployment?
During LoRA fine-tuning, only the LoRA-specific weights (AB) are updated while base weights (W) are fixed. How would this work with FHEModelServer executing computations on encrypted activations?
Are there specific configurations or pitfalls to watch out for when splitting the LoRA fine-tuning process between FHE-enabled server computations (for W) and local computations (for AB) on the client?
Any insights, code snippets, or references to existing examples would be highly appreciated!
That being said, I don’t think we tried the full integration of deployed LORA ourselves, I can’t say whether this works right out of the box. If you encounter bugs lets us know we’ll guide you through the process of making it work.
We’ll have more updates on Hybrid LORA at the end of Q4.
I have a question regarding the communication protocol used in hybrid FHE LoRA training. Specifically, assuming the client encrypts their input data and sends it to the server:
Does the server compute the entire forward pass of the model with the original weights W in one step and return the final encrypted activations, and gradient of the loss with respect to original weights W, to the client, requiring only one round of communication for a single forward pass? In parallel, the client is performing the same computations but for the AB arrays.
Or is the communication done in multiple steps for a single forward pass, where the client decrypts intermediate activations layer by layer (i.e., the server sends encrypted activations to the client, the client decrypts and applies LoRA computations, and this repeats for every layer with LoRA adapters)?
Understanding this distinction would help clarify the protocol’s computational and communication overhead. Thank you!
The communication is done in multiple steps. For each linear layer a communication is performed. This can not be avoided in the client-server setting.
The current implementation in Concrete ML has a high communication overhead but we are releasing a large improvement on this at the end of the quarter.
You can contact me directly by email if you need more details: hello@zama.ai