Making a fast satellite image classification use case [was: Some questions about GPU acceleration] #839

summer-xrx · 2024-08-18T08:36:16Z

Hello, I am very curious about GPU acceletarion. Now, I have the following questions:

Why does the concrete library support GPU acceleration, while other mainstream libraries such as google/fully-homomorphic-encryption, SEAL, OpenFHE, TenSEAL (based on SEAL), and secretflow do not support GPU acceleration?
Compared to the aforementioned libraries, what are the advantages of the concrete library?
Why can't GPU acceleration be implemented in those aforementioned libraries now? What are the technical challenges of using GPU to accelerate homomorphic encryption?
Why does the concrete library implement GPU acceleration while concrete-ml does not? Are there any technical challenges in implementing GPU acceleration in homomorphic machine learning?
Looking forward to your answer, thank you!

bcm-at-zama · 2024-08-19T07:51:03Z

Hey,

Why does the concrete library support GPU acceleration, while other mainstream libraries such as google/fully-homomorphic-encryption, SEAL, OpenFHE, TenSEAL (based on SEAL), and secretflow do not support GPU acceleration?

Well, we can only answer for Zama and its libraries. For questions to other products, you should go to their support, or maybe to discord.fhe.org in their respective channel.

Having a GPU acceleration is for sure a good thing, since we can make it faster than on CPU. At Zama, we are at the beginning for what's related to Concrete and Concrete ML and already, we see very significant improvements.

Compared to the aforementioned libraries, what are the advantages of the concrete library?

Hard to summarize in a few sentences. We have very easy tools to use, for developers to build privacy in their app without knowing anything in cryptography. Same API than Torch and scikit-learn, which is convenient for users. Exact computations thanks to TFHE, as opposed to other libraries which use CKKS. Open source, everything which is claimed can be reproduced on your side. To name a few.

Why can't GPU acceleration be implemented in those aforementioned libraries now? What are the technical challenges of using GPU to accelerate homomorphic encryption?

Again, ask the other companies / maintainers. Pretty sure they also work on GPU, if they are still active (some libs you mentioned are a bit abandoned or deprecated) and sufficiently staffed to do it.

The challenges is about making it faster than CPU, it requires knowledge and expertise in GPU programming, and time.

Why does the concrete library implement GPU acceleration while concrete-ml does not? Are there any technical challenges in implementing GPU acceleration in homomorphic machine learning?
Looking forward to your answer, thank you!

Concrete GPU was released in Q2 2024, Concrete ML GPU will be released in Q3 2024, so just a bit of patience here.

Cheers

summer-xrx · 2024-08-19T09:53:44Z

Hi, @bcm-at-zama,
Thank you very much for your generous reply! May I ask another question? When we run a CNN network using concrete-ml library on the server for image classification, it takes a long time to classify an image when the network is large. When the network becomes smaller, the time consumption will decrease, but the accuracy will decrease significantly. From this, we infer that this library is of scientific or experimental nature and still has a long way to go before it can be practically applied. May I ask if our inference is correct?

bcm-at-zama · 2024-08-19T09:59:33Z

Hey, could you share your code, maybe? Hard to say without knowing, it depends on how big your NN is. Having a good accuracy with a smaller NN also depends quite a lot on the task you want to perform. We can perform not-that-small NN in FHE on non-trivial tasks as CIFAR, in less than a minute (and already less than 30s with GPU), eg, https://github.com/zama-ai/concrete-ml/blob/main/use_case_examples/cifar/cifar_brevitas_training/README.md#accuracy-and-performance . And yes, we're making progress on the speed side, quarters after quarters.

summer-xrx · 2024-08-19T10:52:42Z

Hi, @bcm-at-zama,
The CPU of the server has 64 cores and 128 threads. The memory is 996G.
The task is satellite image classification, which is a "10-classification" task. The dataset is NaSC-TG2, which contains 20000 RGB images that are 128*128, including 16000 for training and 4000 for test. In our code, to reduce time consumption, we "resized" the images into 64 * 64 images.
The definition of "small network" is as follows:

The definition of "big network" is as follows:

In the task, using the "small network" can achieve an accuracy of 71% in plaintext. When nbits=6, the accuracy in ciphertext decreases to 61.2%, and the processing time for each image is about 5 minutes. When nbits=7, the accuracy in ciphertext is 67%, but the processing time for each image is about 27 minutes.
If the "big network" is used, an accuracy of 79% can be achieved in plaintext. However, when nbits=6, the accuracy in ciphertext is around 61% (10 minutes) . When nbits=7, higher accuracy can be guaranteed, but the processing time for each image is longer.
Looking forward to your reply, thank you!

The code of "small network" is as follows:

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset, Subset
from dataloader import train_loader, test_loader,test_dataset
import torch.nn.functional as F
from concrete.ml.torch.compile import compile_torch_model
import numpy as np
import time
from tqdm import tqdm


device = torch.device('cpu')


class CNN(nn.Module):
    def __init__(self):
        super(CNN, self).__init__()
        self.conv1 = nn.Conv2d(3, 8, kernel_size=7, stride=2)
        self.conv2 = nn.Conv2d(8, 12, kernel_size=3, stride=1)
        self.conv3 = nn.Conv2d(12, 8, kernel_size=3, stride=1)
        self.pool = nn.AvgPool2d(kernel_size=3, stride=2)
        self.fc1 = nn.Linear(72,10)
        self.relu = nn.ReLU()

    def forward(self, x):
        x = self.relu(self.conv1(x))
        x = self.pool(x)
        x = self.conv2(x)
        x = self.pool(x)
        x = self.conv3(x)
        x = x.view(-1, 72)
        x = self.fc1(x)
        return x

def test(model, test_loader, device):
    model.eval()
    correct = 0
    total = 0
    with torch.no_grad():
        for inputs, labels in test_loader:
            inputs, labels = inputs.to(device), labels.to(device)
            outputs = model(inputs)
            _, predicted = torch.max(outputs.data, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()
    print(f'Accuracy of the model on the 4000 test images: {100 * correct / total:.2f}%')
    return 100*correct/total


#def train(model, train_loader, criterion, optimizer, scheduler, device, epochs=5):
def train(model, train_loader, criterion, optimizer, device, epochs=5):
    model.train()
    max_accurrancy_rate = 0
    for epoch in range(epochs):
        running_loss = 0.0
        for i, (inputs, labels) in enumerate(train_loader):
            inputs, labels = inputs.to(device), labels.to(device)

            outputs = model(inputs)
            loss = criterion(outputs, labels)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            running_loss += loss.item()
            if i % 25 == 24:
                print(f'Epoch [{epoch + 1}/{epochs}], Step [{i + 1}/{len(train_loader)}], Loss: {running_loss / 100:.4f}')
                running_loss = 0.0
        accurrancy_rate = test(model, test_loader, device)
        torch.save(model.state_dict(),f"./models_new/model_epoch{epoch}_{accurrancy_rate}.pth")
        #test(model, test_loader, device)
    #scheduler.step()

def test_with_concrete(quantized_module, test_loader, use_sim):
    all_y_pred = np.zeros((len(test_loader.dataset)), dtype=np.int64)
    all_targets = np.zeros((len(test_loader.dataset)), dtype=np.int64)
    idx = 0
    for data, target in tqdm(test_loader):
        data = data.numpy()
        target = target.numpy()
        fhe_mode = "simulate" if use_sim else "execute"
        y_pred = quantized_module.forward(data, fhe=fhe_mode)
        endidx = idx + target.shape[0]
        all_targets[idx:endidx] = target
        y_pred = np.argmax(y_pred, axis=1)
        all_y_pred[idx:endidx] = y_pred
        idx += target.shape[0]
    n_correct = np.sum(all_targets == all_y_pred)
    return n_correct / len(test_loader.dataset)


model = CNN().to(device)

model.load_state_dict(torch.load('models_new64/model_epoch84_70.5.pth'))
test(model, test_loader, device)
# criterion = nn.CrossEntropyLoss()
# optimizer = optim.Adam(model.parameters(), lr=0.0005)
#optim_scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 20, gamma=0.8)


#train(model, train_loader, criterion, optimizer, optim_scheduler, device, epochs=1)
# train(model, train_loader, criterion, optimizer, device, epochs=100)
#exit()

#model = CNN()
#model.load_state_dict(torch.load('models/model_epoch4.pht'))
# model=torch.load('models\model_epoch29.pht')

#model = CNN().to(device)
#model.load_state_dict(torch.load('models/model.pth'),strict=False)

#test(model, test_loader, device)

#import torch.onnx
#dummy_input = torch.randn(32, 1, 128, 128)
#torch.onnx.export(model, dummy_input, "model.onnx", do_constant_folding=True)

train_features = []
train_labels = []

for inputs, labels in train_loader:
    train_features.append(inputs)
    train_labels.append(labels)

train_features = torch.cat(train_features) #tensor
train_labels = torch.cat(train_labels)

x_train = train_features.to(device)#.numpy()
y_train = train_labels.to(device)#.numpy()

n_bits = 6

test_features = []
test_labels = []

for inputs, labels in test_loader:
    test_features.append(inputs)
    test_labels.append(labels)

test_features = torch.cat(test_features) #tensor
test_labels = torch.cat(test_labels)

x_test = test_features.to(device)#.numpy()
y_test = test_labels.to(device)#.numpy()

print("===================Start Compile========================")
q_module = compile_torch_model(model, x_train[:,:], n_bits=n_bits,rounding_threshold_bits={"n_bits": n_bits+1, "method": "approximate"})
# # q_module = compile_torch_model(model, x_train, n_bits=6,rounding_threshold_bits={"n_bits": 6, "method": "approximate"})


print(q_module.fhe_circuit.statistics)

start_time = time.time()
accs = test_with_concrete(
    q_module,
    test_loader,
    use_sim=True,
)
sim_time = time.time() - start_time

print(f"Simulated FHE execution for {n_bits} bit network accuracy: {(100*accs):.2f}%")
 # Generate keys first
t = time.time()
q_module.fhe_circuit.keygen()
print(f"Keygen time: {time.time()-t:.2f}s")
# Run inference in FHE on a single encrypted example
mini_test_dataset = TensorDataset(torch.Tensor(x_test[:1, :]), torch.Tensor(y_test[:1]))
mini_test_dataloader = DataLoader(mini_test_dataset)

t = time.time()
accuracy_test = test_with_concrete(
    q_module,
    mini_test_dataloader,
    use_sim=False,
)
elapsed_time = time.time() - t
time_per_inference = elapsed_time / len(mini_test_dataset)
accuracy_percentage = 100 * accuracy_test

print(
    f"Time per inference in FHE: {time_per_inference:.2f} "
    f"with {accuracy_percentage:.2f}% accuracy")
'''

bcm-at-zama · 2024-08-19T11:46:26Z

Thanks a lot for this information, it's very interesting. I need to talk with the team, to analyse that. As it's summer here, it may take a bit of time, but we'll come back to you.

What about we have a zoom call, to discuss about your use-case? If it interests you, would you send an email to [email protected] and say you want to speak with Benoit, please? We can certainly help you making your use-case even better. Cheers

bcm-at-zama · 2024-08-19T12:50:01Z

Also @summer-xrx could you send a self-contained piece of code, please? Eg, it seems that your code in #839 (comment) does not work / has no reference to its dataset, eg.

bcm-at-zama · 2024-08-19T13:02:09Z

Also, one thing which would help would be to compile with show_mlir=True and attach the MLIR. Doing this, we can see the number of PBS, which are the expensive operations.

summer-xrx · 2024-08-20T04:56:15Z

OK, @bcm-at-zama, thank you for your help!
The result of "small network" with nbits=6 is as follows. From the picture, we can see that programmable_bootstrap_count=10396, and "simulate" mode is very fast but "execute" mode is slow.

The result of "small network" with nbits=7 is as follows. From the picture, we can see that programmable_bootstrap_count=10396, and "simulate" mode is fast but "execute" mode is very slow (56min/image).

The result of "big network" with nbits=6 is as follows. From the picture, we can see that programmable_bootstrap_count=20792, and "simulate" mode is fast but "execute" mode is very slow (10min/image).

When running the code, CPU utilization rate is nearly 100%:

Our goal is to achieve a high level of accuracy (>=75%) with low time consumption. (<=5 minutes)
The complete code will be sent to you via email later. Looking forward to your reply, thank you!

bcm-at-zama · 2024-08-20T13:46:58Z

Thanks for the extra info here! In particular, your 'When running the code, CPU utilization rate is nearly 100%' is worth having a look. We'll have a look, it's some work, so be patient please.

Regarding your 10k PBS or 20k PBS: we have about 150k PBS in our CIFAR example, with a nbits=6 if I am not wrong, and it runs in about ~40s on CPU and ~20s in CPU. Which means I think doing the 10k or 20k PBS in less than 5 min is really achievable, at first sight. The tech team will have a look and we'll tell you more.

Yes please send the code, we need it, we can't wait without.

bcm-at-zama · 2024-08-20T13:49:52Z

CC @andrei-stoian-zama @jfrery

bcm-at-zama · 2024-08-20T13:51:02Z

@summer-xrx : please send a mail to [email protected] for Benoit, and let's meet to discuss over Zoom

summer-xrx · 2024-08-20T17:39:32Z

Hello, @bcm-at-zama,
The complete code has been sent to your email [email protected]. It is the "small network" with nbits=6. I'm sorry that I don't have the conditions to have a Zoom meeting with you on my end.

Looking forward to your reply, thank you!

bcm-at-zama · 2024-08-21T08:02:34Z

Let me have a look to your email, thanks.

bcm-at-zama · 2024-08-21T08:03:32Z

Why can't you use Zoom, @summer-xrx ? Or maybe tell me what kind of call we can have, I'm pretty open (Google Meet, eg). I would like to discuss the use-case in more details with you. Cheers

bcm-at-zama · 2024-08-21T08:07:28Z

@summer-xrx : could you re-send your mail please

with the .py directly, we don't take zip files from external sources
and with a real "From" such that we can answer to you
Once I have your email, we can follow up privately if needed
Thanks

bcm-at-zama · 2024-08-21T08:09:50Z

And when I read "78.83M", I am a bit worried: we'll not be able to audit such a large code. Hopefully it's just data, that we don't really need and that you could replace with random inputs / datasets?

summer-xrx · 2024-08-21T08:26:32Z

Hello, @bcm-at-zama,
Don't worry. The codes are 9KB at all. The remaining files are datasets and our trained models, I think they would help. In the new zip file, I have deleted the trainning dataset.
Thanks.

[edit by @bcm-at-zama : removed the zip file]

summer-xrx · 2024-08-21T08:43:55Z

My e-mail is [email protected]. And about why I can't use Zoom, I have 3 reasons.

Because of the nature of my work, I cannot hold video meetings with foreigners.
My network here is not very stable, I'm worried it will affect the quality of the meeting.
My spoken English is not good.

I hope you can understand my difficulties, thank you!

bcm-at-zama · 2024-08-21T09:34:35Z

@summer-xrx : we can only offer support if users send us the code copy pasted in text in an email, or if they create a private/public github repo, as we cannot open external files. Thus, I deleted the zip file in your previous message

summer-xrx · 2024-08-23T04:40:40Z

Hello, @bcm-at-zama,
I have created a public GitHub repo named "[satellite-image-classification]"(https://github.com/summer-xrx/satellite-image-classification). I'm sorry for keeping you waiting for so long due to my work reasons.

Looking forward to your reply, thank you!

bcm-at-zama · 2024-08-23T07:00:40Z

Thanks a lot, we'll have a look! It may take a bit of time, we'll keep you updated.

summer-xrx · 2024-08-27T02:22:38Z

Hello, may I ask if there is any new information?

bcm-at-zama · 2024-08-27T08:02:37Z

No there isn't: we'll tell you when we have updates, but it may not be that soon. We have some priorities to take care of + some people are in holidays. We'll keep you updated

andrei-stoian-zama · 2024-08-28T14:41:38Z

Could you post your model's MLIR ? See here how to get it: https://docs.zama.ai/concrete-ml/deep-learning/fhe_assistant#complexity-analysis

You should also check out these performance tips:
https://docs.zama.ai/concrete-ml/deep-learning/optimizing_inference

bcm-at-zama mentioned this issue Aug 19, 2024

Some questions about GPU acceleration zama-ai/concrete#1003

Closed

bcm-at-zama changed the title ~~Some questions about GPU acceleration~~ Making a fast satellite image classification use case [was: Some questions about GPU acceleration] Aug 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Making a fast satellite image classification use case [was: Some questions about GPU acceleration] #839

Making a fast satellite image classification use case [was: Some questions about GPU acceleration] #839

summer-xrx commented Aug 18, 2024

bcm-at-zama commented Aug 19, 2024

summer-xrx commented Aug 19, 2024

bcm-at-zama commented Aug 19, 2024

summer-xrx commented Aug 19, 2024 •

edited

Loading

bcm-at-zama commented Aug 19, 2024

bcm-at-zama commented Aug 19, 2024

bcm-at-zama commented Aug 19, 2024

summer-xrx commented Aug 20, 2024

bcm-at-zama commented Aug 20, 2024

bcm-at-zama commented Aug 20, 2024

bcm-at-zama commented Aug 20, 2024

summer-xrx commented Aug 20, 2024

bcm-at-zama commented Aug 21, 2024

bcm-at-zama commented Aug 21, 2024

bcm-at-zama commented Aug 21, 2024

bcm-at-zama commented Aug 21, 2024

summer-xrx commented Aug 21, 2024 •

edited by bcm-at-zama

Loading

summer-xrx commented Aug 21, 2024

bcm-at-zama commented Aug 21, 2024

summer-xrx commented Aug 23, 2024

bcm-at-zama commented Aug 23, 2024

summer-xrx commented Aug 27, 2024

bcm-at-zama commented Aug 27, 2024

andrei-stoian-zama commented Aug 28, 2024

Making a fast satellite image classification use case [was: Some questions about GPU acceleration] #839

Making a fast satellite image classification use case [was: Some questions about GPU acceleration] #839

Comments

summer-xrx commented Aug 18, 2024

bcm-at-zama commented Aug 19, 2024

summer-xrx commented Aug 19, 2024

bcm-at-zama commented Aug 19, 2024

summer-xrx commented Aug 19, 2024 • edited Loading

bcm-at-zama commented Aug 19, 2024

bcm-at-zama commented Aug 19, 2024

bcm-at-zama commented Aug 19, 2024

summer-xrx commented Aug 20, 2024

bcm-at-zama commented Aug 20, 2024

bcm-at-zama commented Aug 20, 2024

bcm-at-zama commented Aug 20, 2024

summer-xrx commented Aug 20, 2024

bcm-at-zama commented Aug 21, 2024

bcm-at-zama commented Aug 21, 2024

bcm-at-zama commented Aug 21, 2024

bcm-at-zama commented Aug 21, 2024

summer-xrx commented Aug 21, 2024 • edited by bcm-at-zama Loading

summer-xrx commented Aug 21, 2024

bcm-at-zama commented Aug 21, 2024

summer-xrx commented Aug 23, 2024

bcm-at-zama commented Aug 23, 2024

summer-xrx commented Aug 27, 2024

bcm-at-zama commented Aug 27, 2024

andrei-stoian-zama commented Aug 28, 2024

summer-xrx commented Aug 19, 2024 •

edited

Loading

summer-xrx commented Aug 21, 2024 •

edited by bcm-at-zama

Loading