Secure Query Execution on Encrypted Data in HDFS Using Apache Spark and Fully Homomorphic

In a private database query system, a client issues queries to a database and
obtains the results without learning any additional information about the database
while ensuring that the server does not learn the query itself. We propose a novel
framework for enabling secure query execution on the Hadoop Distributed File
System (HDFS) by integrating Apache Spark with Fully Homomorphic En-
cryption (FHE). This framework allows Spark to perform computations directly
on encrypted data without requiring access to the secret key, thereby preserving
user data privacy.

Once encrypted, the dataset is uploaded to HDFS for secure storage. We implement fundamental query operations such as min/max functions, comparison, and sorting on encrypted
data, which are essential for filtering, aggregation, and other SQL-like queries.
To enable seamless querying, we introduce a trusted third party (TTP)
that transforms user queries into their encrypted equivalents. For example, a query
such as ’salary’ < 5000 is rewritten as less than(salary, enc(5000)). The
transformed query is then executed on the encrypted database, and the encrypted
results are returned to the TTP, which decrypts the data before forwarding it to
the user.
So how to do that ? by using this library please give some direction that is most efficient.(our most of the columns are boolean data type, some columns are int and some of string)

Hello @Mrinmoy

The TTP is a security risk, as if the systems of the thirdparty are compromised an attacker can retrieve decrypted query results if they were not permanently deleted from the machine.

Having said that we had an SQL bounty for TFHE-rs, I guess the winning submission will be of interest to you:

bounty topic:

Regards

But in this bounty query is encrypted and they are query on clear data. In our problem we are want to query on encrypted data and our query is also encrypted.

Can you describe the full workflow or have a diagram showing what you want to achieve?

I have structure data that basically store in a csv file in HDFS. And we are using spark to sql like query on the data.
Now we are want do above task end to end. That is we are encrypting all of data by element wise( like take a column and next take each element and then encrypting it) and store in csv file not encrypting schemas of the table.Next we store the csv file to the HDFS.
Now if I wnat query on encrypted data we first make query like ’salary’ < 5000 is rewritten as less_than(salary, enc(5000)). Write this custom query on to spark and then hdfs return corresponding encrypted data. And this decrypt at user end.

While I understand the broad idea I would like to understand how that protocole would look like in practice @Mrinmoy the reason being that TFHE-rs may or may not be able to do what you want today.

So something like:

  • client has a public key,
  • server does something
  • client sends an encrypted request to make sure only the client knows what’s in the request

etc.

So that we can give you pointers to see if it’s doable with TFHE-rs.

We understand the general question which is doing some encrypted requested (and apparently encrypted data as well), but depending on what your protocole looks like there may be security problems (like data being under a key from someone who is not supposed to access the data, which in that case would break the protocole)

Initially, I need to encrypt a CSV file element-wise, where all the columns in the file contain boolean values. I want to upload this file and perform all possible operations on the boolean data. After that, I think about key-related issues.

Hello @Mrinmoy

You did not describe how your protocole works. I cannot provide more assistance than giving you the SQL bounty link and let you adapt it to work on encrypted data.

Have a good day

My Protocol is following -

1. Data Encryption & Storage

  • Step 1.1: Encrypt each element of selected columns in a CSV file using a homomorphic encryption scheme (e.g., BFV, CKKS, or TFHE).
  • Step 1.2: Store the encrypted CSV file securely in HDFS while preserving the structure for efficient querying.

2. Integration with Spark

  • Step 2.1: Connect Apache Spark to HDFS and load the encrypted data into a DataFrame.
  • Step 2.2: Ensure Spark operations work directly on encrypted values without decryption.

3. Encrypted Query Execution

  • Step 3.1: Perform SQL queries (WHERE, MIN, MAX, etc.) directly on encrypted data using homomorphic encryption-compatible computations in Spark.
  • Step 3.2: The results of these queries remain encrypted throughout processing.

4. Secure Result Retrieval

  • Step 4.1: Retrieve the encrypted results from Spark.
  • Step 4.2: The user decrypts the results locally using their private key.

5. Future Work: Key Management

  • Step 5.1: Implement a secure key distribution mechanism to manage encryption and decryption keys across users.
  • Step 5.2: Consider integrating access control policies for secure multi-user access.

Hello @Mrinmoy

Thanks for detailing the process more, some questions to make sure I understand:

1.1 : who owns the key to the data ?
3.1 : who makes the encrypted request and with what key ?
4.2 : here the user decrypts with their private key, implying that the data and requests are either under the user key, or some form of keyswitching is done from the server key to the user key, correct ?
5.1 : what problem does the secure key distribution mechanism address ?

Thanks

Initially, we assume that there is only one user who possesses both the public key and the private key. The user uses the public key to encrypt the data, then uploads it to HDFS. Afterward, queries are performed on the encrypted data in Spark, and the results are returned in an encrypted form. The user then decrypts the results using their private key.

Once this process is validated with a single user, we can explore how to extend this concept to multiple users.

Then the SQL bounty updated to work with encrypted data should fit your use case for one user as far as I can tell.

Also you can still encrypt with a private key in that case as the user owns the data.

Regards