Comparing SGDClassifier Accuracy on Spambase dataset: Concrete-ml vs. Scikit-learn

Clara_Le · September 10, 2024, 4:51am

Hi everyone,

I’m working with the SGDClassifier on encrypted data, and I’ve noticed something interesting. On most datasets, the accuracy of the concrete-ml model is pretty similar to the scikit-learn model. However, when I tried it on the spambase dataset from OpenML (id=44), the concrete-ml model’s accuracy (60.97%) was way lower than scikit-learn’s (87.68%).

I was wondering if anyone has any insights into why there might be such a big difference for this particular dataset?

Thanks!

Here is my code:

clf_sklearn = SklearnSGDClassifier(
    random_state=42,
    max_iter=15,
    loss="log_loss",
)

train_start = time.time()
clf_sklearn.fit(X_train_scaled, y_train)
train_end = time.time()
print(f"Training time in seconds: {(train_end - train_start):.4f}")

test_start = time.time()
y_pred_sklearn_scaled = clf_sklearn.predict(X_test_scaled)
test_end = time.time()
print(f"Test time in seconds: {(test_end - test_start):.4f}")

accuracy_sklearn = accuracy_score(y_test, y_pred_sklearn_scaled)
print(f"Sklearn scaled data accuracy: {accuracy_sklearn*100:.2f}%")

Training time in seconds: 0.0206
Test time in seconds: 0.0020
Sklearn scaled data accuracy: 87.68%

# Using Concrete model on scaled data
parameters_range = (-1.0, 1.0)

clf_concrete = ConcreteSGDClassifier(
    random_state=42,
    max_iter=1000,
    fit_encrypted=True,
    parameters_range=parameters_range,
    # verbose=True,
)

# Train with encrypt execution
start_time = time.time()
clf_concrete.fit(X_train_scaled, y_train, fhe="simulate") # Can change to fhe="simulate" for faster run time
end_time = time.time()
print(f"Training time in seconds: {(end_time - start_time):.4f}")

# Compile model using training data
compile_start = time.time()
clf_concrete.compile(X_train_scaled)
compile_end = time.time()
print(f"Compile time in seconds: {(compile_end - compile_start):.4f}")

# Measure accuracy on the test set using execute
test_start = time.time()
y_pred_fhe = clf_concrete.predict(X_test_scaled, fhe="simulate")
test_end = time.time()
print(f"Test time in seconds: {(test_end - test_start):.4f}")
print(f"Test time per sample: {(test_end - test_start)/X_test.shape[0]:.4f}")

accuracy_fhe = accuracy_score(y_test, y_pred_fhe)
print(f"Full encrypted fit (simulated) accuracy: {accuracy_fhe*100:.2f}%")

Training time in seconds: 89.9006
Compile time in seconds: 1.8087
Test time in seconds: 6.8132 Test time per sample: 0.0071
Full encrypted fit (simulated) accuracy: 60.97%

jfrery · September 10, 2024, 7:32am

Hi @Clara_Le,

Could you share how you are scaling the input?

Concrete SGD training uses quantization so the input normalization is very important. You can also increase parameters_range to something like (-4.0, 4.0) and check if there is any change.

We will be able to help more once we know the preprocessing you apply.

Clara_Le · September 10, 2024, 7:35am

Hi jfrery,
Thank you for your reply
I scaled the input to be between (-1.0, 1.0) as recommended.

scaler = MinMaxScaler(feature_range=[-1, 1])
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Edit: I tried different parameters_range such as (-4.0, 4.0), (-10.0, 10.0), (-15.0, 15.0). The accuracy score improved to 67.08%.

jfrery · September 10, 2024, 10:38am

Thanks. I checked a bit but it’s not obvious to me what’s going on with that dataset. What I can see is that the weights of the sklearn SGD need to reach quite high absolute values which is not something easy to do in quantized. Having bigger range (parameters_range) helps but that’s at the cost of precision on the parameters (i.e. the bigger the range the less granularity we can have. At some point the parameters can’t be updated because the gradient becomes smaller that a quantization step).

I don’t have a solution for this dataset right now. I have opened an issue on our end to look at this more in depth.

Clara_Le · September 10, 2024, 1:35pm

Thanks for opening an issue. I’ll keep an eye on it.

andrei-stoian-zama · September 10, 2024, 1:52pm

May I suggest that you apply PCA on your data first? Additionally you could try PCA with whitening. That should give you a much nicer distribution for your features.