Hi everyone,
I’m working with the SGDClassifier on encrypted data, and I’ve noticed something interesting. On most datasets, the accuracy of the concrete-ml model is pretty similar to the scikit-learn model. However, when I tried it on the spambase dataset from OpenML (id=44), the concrete-ml model’s accuracy (60.97%) was way lower than scikit-learn’s (87.68%).
I was wondering if anyone has any insights into why there might be such a big difference for this particular dataset?
Thanks!
Here is my code:
clf_sklearn = SklearnSGDClassifier(
random_state=42,
max_iter=15,
loss="log_loss",
)
train_start = time.time()
clf_sklearn.fit(X_train_scaled, y_train)
train_end = time.time()
print(f"Training time in seconds: {(train_end - train_start):.4f}")
test_start = time.time()
y_pred_sklearn_scaled = clf_sklearn.predict(X_test_scaled)
test_end = time.time()
print(f"Test time in seconds: {(test_end - test_start):.4f}")
accuracy_sklearn = accuracy_score(y_test, y_pred_sklearn_scaled)
print(f"Sklearn scaled data accuracy: {accuracy_sklearn*100:.2f}%")
Training time in seconds: 0.0206
Test time in seconds: 0.0020
Sklearn scaled data accuracy: 87.68%
# Using Concrete model on scaled data
parameters_range = (-1.0, 1.0)
clf_concrete = ConcreteSGDClassifier(
random_state=42,
max_iter=1000,
fit_encrypted=True,
parameters_range=parameters_range,
# verbose=True,
)
# Train with encrypt execution
start_time = time.time()
clf_concrete.fit(X_train_scaled, y_train, fhe="simulate") # Can change to fhe="simulate" for faster run time
end_time = time.time()
print(f"Training time in seconds: {(end_time - start_time):.4f}")
# Compile model using training data
compile_start = time.time()
clf_concrete.compile(X_train_scaled)
compile_end = time.time()
print(f"Compile time in seconds: {(compile_end - compile_start):.4f}")
# Measure accuracy on the test set using execute
test_start = time.time()
y_pred_fhe = clf_concrete.predict(X_test_scaled, fhe="simulate")
test_end = time.time()
print(f"Test time in seconds: {(test_end - test_start):.4f}")
print(f"Test time per sample: {(test_end - test_start)/X_test.shape[0]:.4f}")
accuracy_fhe = accuracy_score(y_test, y_pred_fhe)
print(f"Full encrypted fit (simulated) accuracy: {accuracy_fhe*100:.2f}%")
Training time in seconds: 89.9006
Compile time in seconds: 1.8087
Test time in seconds: 6.8132 Test time per sample: 0.0071
Full encrypted fit (simulated) accuracy: 60.97%