Secure Query Execution on Encrypted Data in HDFS Using Apache Spark and Fully Homomorphic

Mrinmoy · March 12, 2025, 6:45am

In a private database query system, a client issues queries to a database and
obtains the results without learning any additional information about the database
while ensuring that the server does not learn the query itself. We propose a novel
framework for enabling secure query execution on the Hadoop Distributed File
System (HDFS) by integrating Apache Spark with Fully Homomorphic En-
cryption (FHE). This framework allows Spark to perform computations directly
on encrypted data without requiring access to the secret key, thereby preserving
user data privacy.

Once encrypted, the dataset is uploaded to HDFS for secure storage. We implement fundamental query operations such as min/max functions, comparison, and sorting on encrypted
data, which are essential for filtering, aggregation, and other SQL-like queries.
To enable seamless querying, we introduce a trusted third party (TTP)
that transforms user queries into their encrypted equivalents. For example, a query
such as ’salary’ < 5000 is rewritten as less than(salary, enc(5000)). The
transformed query is then executed on the encrypted database, and the encrypted
results are returned to the TTP, which decrypts the data before forwarding it to
the user.
So how to do that ? by using this library please give some direction that is most efficient.(our most of the columns are boolean data type, some columns are int and some of string)

IceTDrinker · March 12, 2025, 10:41am

Hello @Mrinmoy

The TTP is a security risk, as if the systems of the thirdparty are compromised an attacker can retrieve decrypted query results if they were not permanently deleted from the machine.

Having said that we had an SQL bounty for TFHE-rs, I guess the winning submission will be of interest to you:

bounty topic:

github.com/zama-ai/bounty-program

Create an implementation of an SQL encrypted query on a clear database using TFHE-rs

opened 02:09PM - 09 Feb 24 UTC

closed 03:41PM - 13 May 24 UTC

zaccherinij

🎯 Bounty 📁 TFHE-rs

## Overview TFHE-rs is a pure Rust implementation of TFHE for Boolean and int…eger arithmetics over encrypted data. With this bounty, we are asking users to implement an SQL encrypted query on a clear database. Here's a simplified overview of how FHE can be used to implement an SQL encrypted query on a clear database: **1. Query Encryption** The first step involves the user encrypting their SQL query using their FHE secret key before sending it to the server hosting the database. The encryption ensures that the server cannot see the actual query, only its encrypted form. **2. Processing Encrypted Query on Clear Database** The server then processes this encrypted query on the clear database. FHE computations mixing clear and encrypted data will return encrypted data, it means the server can perform the necessary SQL operations (like SELECT, JOIN, etc.) using the clear data as well as the encrypted query. The result of this operation is an encrypted version of the query result. **3. Decrypting the Result** Finally, the encrypted result is sent back to the user, who can decrypt it using their private key to obtain the clear text result of the SQL query. We ask that you, the hunters, implement a database that can manipulate signed and unsigned integers (8, 16, 32, 64 bits), booleans and short strings (32 ASCII characters) as possible data types for a column. As we are not looking at a classical database use case, we don’t expect the database to be resilient to crashes, be highly available etc. it is merely a datastore addressed in a similar way to a database via an encrypted SQL query. ## How to participate? 1️⃣ <a href="https://zama.ai/bounty-and-grant-program" target="_blank">Register here</a>. 2️⃣ When ready, submit your code [here](https://zama.ai/bounty-program). 🗓️ Submission deadline: May 12th, 2024. ## Description ### Expected Work The implementation should live in its own crate (and not in tfhe-rs/tfhe/example) and depend on either tfhe-rs 0.5.x or the main branch of tfhe-rs. We ask that your crate passes the clippy checks when treating warnings as errors. ``` cargo clippy -- --no-deps -D warnings ``` We ask you to provide both an executable and an API. ### Executable For the clear database storage you can use existing databases or tools but we require that you manage to load a database from a directory with the following structure: db_dir - table_1.csv - table_2.csv example of a table_content.csv: ``` column_1_name:uint32,colum_2_name:bool,column_3_name:string 0,true,"first line" 123,false,"some other line" ``` The executable takes as input a database to load with the format discussed earlier and a file containing the SQL query in plaintext to avoid dealing with escaping quotes on the command line, we **require** the following format for the output: ```console $ cargo run --release -- --input-db /path/to/db_dir --query-file query.txt Runtime: 42.24 s Clear DB query result: (some result) Encrypted DB query result: (some result) Results match: YES ``` erroneous results will automatically disqualify a submission. ### API You are free to use the High Level API with CPU or GPU or the integer API with CPU or GPU. We only expect you to implement ONE API, implementing more APIs won’t improve you submission’s classification, correctness is mandatory, incorrect implementations will be disqualified, performance will be the main differentiator for submissions. For the encryption we ask that you use a [Select query from the sqlparser crate](https://docs.rs/sqlparser/latest/sqlparser/ast/struct.Select.html) as input. You may choose another type if you explain how it's a better choice. You must return your own EncryptedQuery representation that can be used on the device of your choice. There are 3 possible APIs to implement - CPU Integer API (uses tfhe::integer::ServerKey, an EncryptedQuery, a Table struct representing the clear database), - GPU Integer API (uses tfhe::integer::CudaServerKey and a CudaEncryptedQuery, a Table struct representing the clear database) - High-Level API CPU/GPU (uses tfhe::ServerKey, EncryptedQuery built on HL API types, a Table struct representing the clear database) The APIs are specified in detail below, choose the one(s) you wish to implement (only implement methods of the API you chose) #### CPU Integer API ```rust fn default_cpu_parameters() -> PBSParameters; fn encrypt_query(query: sqlparser::ast::Select) -> EncryptedQuery; /// Loads a directory with a structure described above in a format /// that works for your implementation of the encryted query fn load_tables(path) -> Tables /// # Inputs: /// - sks: The server key to use /// - input: your EncryptedQuery /// - tables: the plain data you run the query on /// /// # Output /// - EncryptedResult fn run_fhe_query( sks: &tfhe::integer::ServerKey, input: &EncryptedQuery, data: &Tables, ) -> EncrypedResult; /// The output of this function should be a string using the CSV format /// You should provide a way to compare this string with the output of /// the clear DB system you use for comparison fn decrypt_result(clientk_key: &ClientKey, result: &EncryptedResult) -> String ``` #### GPU Integer API ```rust fn default_gpu_parameters() -> PBSParameters; fn encrypt_query_gpu(query: sqlparser::ast::Select) -> CudaEncryptedQuery; /// Loads a directory with a structure described above in a format /// that works for your implementation of the encryted query fn load_tables(path) -> Tables /// # Inputs: /// - sks: The server key to use /// - input: your CudaEncryptedQuery /// - tables: the plain data you run the query on /// /// # Output /// - EncryptedResult fn run_fhe_query_gpu( sks: &tfhe::integer::ServerKey, input: &CudaEncryptedQuery, data: &Tables, ) -> EncrypedResult; /// The output of this function should be a string using the CSV format /// You should provide a way to compare this string with the output of /// the clear DB system you use for comparison fn decrypt_result_gpu(clientk_key: &ClientKey, result: &CudaEncryptedResult) -> String ``` #### High-Level API ```rust /// Returns parameters and the device (CudaGpu/Cpu) on which the function /// should be ran fn default_parameters() -> (PBSParameters, tfhe::Device); fn load_tables(path) -> Tables fn encrypt_query(query: String) -> EncryptedQuery; // The server key is provided to allow setting up threads if needed fn run_fhe_query( sks: &tfhe::integer::ServerKey, input: &EncryptedQuery, data: &Tables, ) -> EncrypedResult; fn decrypt_result(clientk_key: &ClientKey, result: &EncrypedResult) -> String ``` ### SQL support We ask to be able to run the following encrypted queries: Requested: SQL Select: https://www.w3schools.com/sql/sql_select.asp SQL Select Distinct: https://www.w3schools.com/sql/sql_distinct.asp SQL Where: https://www.w3schools.com/sql/sql_where.asp SQL And: https://www.w3schools.com/sql/sql_and.asp SQL Or: https://www.w3schools.com/sql/sql_or.asp SQL Not: https://www.w3schools.com/sql/sql_not.asp SQL In: https://www.w3schools.com/sql/sql_in.asp SQL Between: https://www.w3schools.com/sql/sql_between.asp For integer values we expect you to manage the <, <=, >, >=, = operators For strings we expect you to manage the = operator only Note that the != (not equal) operator is built via the NOT operator of SQL and an = operator. Note that some operators may re-use other implementations, for example the IN operator is indicated to be equivalent to multiple OR operators. Bonus API: SQL Joins: https://www.w3schools.com/sql/sql_join.asp The format of the encrypted query and result are free, the evaluation of this bounty will be on speed and correctness on the Requested APIs, bonus APIs are NOT required and would only be used as a last resort to differentiate two submissions. ### Other Details #### Benchmarking Implementations will be benchmarked on the following AWS hardware: CPU: hpc7a.96xlarge (192 vCPU, 768 GB Memory) GPU: p3.2xlarge (note that TFHE-rs only supports single GPU execution for now) (1 Tesla V100, 16 GB GPU Memory) #### Parameters The allowed parameters are: PARAM_MESSAGE_2_CARRY_2_KS_PBS, PARAM_MULTI_BIT_MESSAGE_2_CARRY_2_GROUP_2_KS_PBS, PARAM_MULTI_BIT_MESSAGE_2_CARRY_2_GROUP_3_KS_PBS Your solution must be able to run with the multi bit and non multi bit parameter sets, here it should be straightforward as the message and carry moduli are the same. Note that the multi bit parameters should be faster on GPU, namely the group 3 should be the fastest. It could be the case on CPU as well but it can saturate even the largest CPU machines with threads and performance suffers as a consequence. #### Turtle shell queries As Mario Kart has its turtle shells, you will be able, if you want to, to provide one example of a small DB (with the directory and csv structure discussed earlier) and a “turtle shell” query you feel is very hard/has corner cases alongside your submission, we will run the DB and the query on the submissions we gather, corner cases beware! The performance and correctness on these hunter provided use cases will be used for the evaluation of the bounty, in addition to cases we’ll choose ourselves, we ask that the query runs in under 5 minutes on a laptop/commodity hardware, though our benchmarking machine are larger we want to keep runtimes reasonable when possible. Good luck! ## Reward ### 🥇Best submission: up to €5,000 To be considered best submission, a solution must be efficient, effective and demonstrate a deep understanding of the core problem. Alongside the technical correctness, it should also be submitted with a clean code, clear explanations and a complete documentation. ### 🥈Second-best submission: up to €3,000 For a solution to be considered the second best submission, it should be both efficient and effective. The code should be neat and readable, while its documentation might not be as exhaustive as the best submission, it should cover the key aspects of the solution. ### 🥉Third-best submission: up to €2,000 The third best submission is one that presents a solution that effectively tackles the challenge at hand, even if it may have certain areas of improvement in terms of efficiency or depth of understanding. Documentation should be present, covering the essential components of the solution. _Reward amounts are decided based on code quality and speed performance on a m6i.metal AWS server._ ## Related links and references - [SQL Select](https://www.w3schools.com/sql/sql_select.asp) - [Rust sqlparser](https://docs.rs/sqlparser/latest/sqlparser/) - [TFHE-rs documentation](https://docs.zama.ai/tfhe-rs) - [TFHE-rs rust API docs](https://docs.rs/tfhe/latest/tfhe) - [TFHE-rs repository](https://github.com/zama-ai/tfhe-rs) ## Questions? Do you have a specific question about this bounty? Simply comment this issue and our team will get back to you!

Regards

Mrinmoy · March 12, 2025, 11:45am

But in this bounty query is encrypted and they are query on clear data. In our problem we are want to query on encrypted data and our query is also encrypted.

IceTDrinker · March 12, 2025, 12:02pm

Can you describe the full workflow or have a diagram showing what you want to achieve?

Mrinmoy · March 13, 2025, 4:34am

I have structure data that basically store in a csv file in HDFS. And we are using spark to sql like query on the data.
Now we are want do above task end to end. That is we are encrypting all of data by element wise( like take a column and next take each element and then encrypting it) and store in csv file not encrypting schemas of the table.Next we store the csv file to the HDFS.
Now if I wnat query on encrypted data we first make query like ’salary’ < 5000 is rewritten as less_than(salary, enc(5000)). Write this custom query on to spark and then hdfs return corresponding encrypted data. And this decrypt at user end.

IceTDrinker · March 13, 2025, 1:04pm

While I understand the broad idea I would like to understand how that protocole would look like in practice @Mrinmoy the reason being that TFHE-rs may or may not be able to do what you want today.

So something like:

client has a public key,
server does something
client sends an encrypted request to make sure only the client knows what’s in the request

etc.

So that we can give you pointers to see if it’s doable with TFHE-rs.

We understand the general question which is doing some encrypted requested (and apparently encrypted data as well), but depending on what your protocole looks like there may be security problems (like data being under a key from someone who is not supposed to access the data, which in that case would break the protocole)

Mrinmoy · March 18, 2025, 5:04am

Initially, I need to encrypt a CSV file element-wise, where all the columns in the file contain boolean values. I want to upload this file and perform all possible operations on the boolean data. After that, I think about key-related issues.

IceTDrinker · March 18, 2025, 9:32am

Hello @Mrinmoy

You did not describe how your protocole works. I cannot provide more assistance than giving you the SQL bounty link and let you adapt it to work on encrypted data.

Have a good day

Mrinmoy · March 24, 2025, 5:05am

My Protocol is following -

1. Data Encryption & Storage

Step 1.1: Encrypt each element of selected columns in a CSV file using a homomorphic encryption scheme (e.g., BFV, CKKS, or TFHE).
Step 1.2: Store the encrypted CSV file securely in HDFS while preserving the structure for efficient querying.

2. Integration with Spark

Step 2.1: Connect Apache Spark to HDFS and load the encrypted data into a DataFrame.
Step 2.2: Ensure Spark operations work directly on encrypted values without decryption.

3. Encrypted Query Execution

Step 3.1: Perform SQL queries (WHERE, MIN, MAX, etc.) directly on encrypted data using homomorphic encryption-compatible computations in Spark.
Step 3.2: The results of these queries remain encrypted throughout processing.

4. Secure Result Retrieval

Step 4.1: Retrieve the encrypted results from Spark.
Step 4.2: The user decrypts the results locally using their private key.

5. Future Work: Key Management

Step 5.1: Implement a secure key distribution mechanism to manage encryption and decryption keys across users.
Step 5.2: Consider integrating access control policies for secure multi-user access.

IceTDrinker · March 24, 2025, 8:19am

Hello @Mrinmoy

Thanks for detailing the process more, some questions to make sure I understand:

1.1 : who owns the key to the data ?
3.1 : who makes the encrypted request and with what key ?
4.2 : here the user decrypts with their private key, implying that the data and requests are either under the user key, or some form of keyswitching is done from the server key to the user key, correct ?
5.1 : what problem does the secure key distribution mechanism address ?

Thanks

Mrinmoy · March 25, 2025, 4:55am

Initially, we assume that there is only one user who possesses both the public key and the private key. The user uses the public key to encrypt the data, then uploads it to HDFS. Afterward, queries are performed on the encrypted data in Spark, and the results are returned in an encrypted form. The user then decrypts the results using their private key.

Once this process is validated with a single user, we can explore how to extend this concept to multiple users.

IceTDrinker · March 25, 2025, 8:52am

Then the SQL bounty updated to work with encrypted data should fit your use case for one user as far as I can tell.

Also you can still encrypt with a private key in that case as the user owns the data.

Regards