Inaccurate results after statistical computations

Hi,

I’m working on computing the variance of encrypted data. However, I noticed that as the input range increases (i.e., higher maximum values in the data), the results tend to become inaccurate.

Could someone help me understand why this is happening, and how I might improve the accuracy?

Here’s the relevant part of my code:

def calculate_list_into_sum(array, sum, array_size):
    total = fhe.zero()
    for i in range(array_size):
        total += array[i] * sum
    return total
def calculate_sum_and_square_sum(array, array_size):
    total = fhe.zero()
    total_sq = fhe.zero()
    for i in range(array_size):
        total += array[i]
        total_sq += array[i] * array[i]
    return total, total_sq
@fhe.compiler({"array": "encrypted"})
def calculate_variance_numerator(array):
    array_size = array.size
    array_sum, array_sq_sum = calculate_sum_and_square_sum(array, array_size)
    list_squared = np.square(array)
    list_into_sum = calculate_list_into_sum(array, array_sum, array_size)
    component_one = array_sq_sum * (array_size * array_size)
    component_two = list_into_sum * (array_size * 2)
    component_three = array_sum * array_sum * array_size
    return (component_three + component_one) - component_two
lrange = int(input("Please provide the range upper limit: "))
lsize = int(input("Please provide the array length: ")) inputset = [np.random.randint(0, lrange, size=lsize) for _ in range(5)]
circuit = calculate_variance_numerator.compile(inputset)

Thank you.

Hello @rraj,

That should not happens if your data used in evaluation are well represented by the dataset used at compile time. In your code you don’t show the evaluation so I can’t say if the dataset differs than the one used for the compilation.

In the code you show the inputset is 5 random number between 0 and the upper limit, that means more that the upper limit is high more your inputset not represent the actual range of possible values, and the transpilation pipeline of concrete-python cannot infer the right bounds of values.

I your case the following inputset should be more accurate.

inputset = [
    [0 for _ in range(lsize)],
    [lrange for _ in range(lsize)]
]

Best,
Quentin.

Hi @yundsi

Thank you for your response.

I’m basically performing the calculations in a loop by generating random arrays:

array = np.random.randint(1, lrange, size=lsize)
enc_array = circuit.encrypt(array)
enc_result = circuit.run(enc_array)

as per your suggestion, I tried:

inputset = [np.array([0 for _ in range(lsize)]), np.array([lrange for _ in range(lsize)])]

and even:

inputset = [np.array([0 for _ in range(lsize)]), np.array([lrange for _ in range(lsize)])]
inputset += [np.random.randint(0, lrange, size=lsize) for _ in range(3)]

However with these inputsets the script gets terminated with the message “Killed” (I’m using docker).

I noticed that the result is inaccurate in the input arrays that have larger (or multiple large) numbers for example: 22, 3, 21, 11, 22. The input arrays with smaller numbers return accurate result, example: 17, 15, 7, 10, 4.

Is the inaccuracy because the numbers exceed certain bits during calculations, since there are a lot of computations (calculating squares of each element of the array and taking the sum) ? Also, why does the script gets terminated with this inputset while the code previous inputset doesn’t ?

Thanks

Hey @rraj ,

Yes I think that the issue some of your values go out of the bounds calculated by concrete-python during the compilation, that’s why I suggest to you to use the worst case inputset for the compilation.

I don’t known why the scripts gets Killed… it could be the OOM killer or something like that but I don’t see any reasons, I need to investigate more to understand what is happening here. I’ll try to give a deeper look soon, let me knwon if you have more insight.