Secure Query Execution on Encrypted Data in HDFS Using Apache Spark and Fully Homomorphic

In a private database query system, a client issues queries to a database and
obtains the results without learning any additional information about the database
while ensuring that the server does not learn the query itself. We propose a novel
framework for enabling secure query execution on the Hadoop Distributed File
System (HDFS) by integrating Apache Spark with Fully Homomorphic En-
cryption (FHE). This framework allows Spark to perform computations directly
on encrypted data without requiring access to the secret key, thereby preserving
user data privacy.

Once encrypted, the dataset is uploaded to HDFS for secure storage. We implement fundamental query operations such as min/max functions, comparison, and sorting on encrypted
data, which are essential for filtering, aggregation, and other SQL-like queries.
To enable seamless querying, we introduce a trusted third party (TTP)
that transforms user queries into their encrypted equivalents. For example, a query
such as ’salary’ < 5000 is rewritten as less than(salary, enc(5000)). The
transformed query is then executed on the encrypted database, and the encrypted
results are returned to the TTP, which decrypts the data before forwarding it to
the user.
So how to do that ? by using this library please give some direction that is most efficient.(our most of the columns are boolean data type, some columns are int and some of string)

Hello @Mrinmoy

The TTP is a security risk, as if the systems of the thirdparty are compromised an attacker can retrieve decrypted query results if they were not permanently deleted from the machine.

Having said that we had an SQL bounty for TFHE-rs, I guess the winning submission will be of interest to you:

bounty topic:

Regards

But in this bounty query is encrypted and they are query on clear data. In our problem we are want to query on encrypted data and our query is also encrypted.

Can you describe the full workflow or have a diagram showing what you want to achieve?

I have structure data that basically store in a csv file in HDFS. And we are using spark to sql like query on the data.
Now we are want do above task end to end. That is we are encrypting all of data by element wise( like take a column and next take each element and then encrypting it) and store in csv file not encrypting schemas of the table.Next we store the csv file to the HDFS.
Now if I wnat query on encrypted data we first make query like ’salary’ < 5000 is rewritten as less_than(salary, enc(5000)). Write this custom query on to spark and then hdfs return corresponding encrypted data. And this decrypt at user end.

While I understand the broad idea I would like to understand how that protocole would look like in practice @Mrinmoy the reason being that TFHE-rs may or may not be able to do what you want today.

So something like:

  • client has a public key,
  • server does something
  • client sends an encrypted request to make sure only the client knows what’s in the request

etc.

So that we can give you pointers to see if it’s doable with TFHE-rs.

We understand the general question which is doing some encrypted requested (and apparently encrypted data as well), but depending on what your protocole looks like there may be security problems (like data being under a key from someone who is not supposed to access the data, which in that case would break the protocole)