-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ggml-backend update: buffer types, backend registry, graph compare, tests #620
Conversation
Nice upate, Normally, one has to create the graph at every moment before computation; it cannot be reused, and simply assigning IDs to input tensors wouldn't suffice. ggml_allocr_reset(compute_alloc);
struct ggml_cgraph * gf = build_graph(z, decode_graph);
ggml_allocr_alloc_graph(compute_alloc, gf);
if (ggml_backend_is_cpu(backend)) {
ggml_backend_cpu_set_n_threads(backend, n_threads);
}
ggml_backend_graph_compute(backend, gf);
ggml_backend_tensor_get(gf->nodes[gf->n_nodes - 1], work_result->data, 0, ggml_nbytes(work_result)); ggml_allocr_reset(compute_alloc);
ggml_set_input(gf, compute_alloc, Z_INPUT, z);
ggml_allocr_alloc_graph(compute_alloc, gf);
if (ggml_backend_is_cpu(backend)) {
ggml_backend_cpu_set_n_threads(backend, n_threads);
}
// gf is created once, it can be the graph used for measure buffer size
ggml_backend_graph_compute(backend, gf);
ggml_backend_tensor_get(gf->nodes[gf->n_nodes - 1], work_result->data, 0, ggml_nbytes(work_result));
// constant tensor alloc it once time when initialize the params of model
attn_scale = ggml_new_tensor_1d(ctx, GGML_TYPE_F32, 1);
ggml_allocr_alloc(alloc, attn_scale);
float scale = 1.0f / sqrt((float) d_head);
ggml_backend_tensor_set(attn_scale, &scale, 0, sizeof(scale));
// to avoid assert(data == NULL)
// alloc all tensors linked to this context
for (struct ggml_tensor * t = ggml_get_first_tensor(ctx); t != NULL; t = ggml_get_next_tensor(ctx, t)) {
if(t->data == NULL) {
ggml_allocr_alloc(alloc, t);
}
}
If possible, it could be beneficial for parameter tensors to be dynamically passed in a backend (dynamic offloading) as needed during computation. |
I think that what you mean here is that graphs used for measure cannot be re-used for computation, which is not very good for graphs that only need to be evaluated once. I think we could solve this by assigning offsets rather than absolute addresses to tensors (ie. in the
I will do this.
The goal is indeed to help debugging issues with the backends. I already found some differences in the GELU op between the CPU and CUDA backends while testing this with the gpt-2 example, and I am sure that there are more cases. I will try to add automated tests for all the ops.
I don't really understand what you mean by this. If you prefer to use Spanish, you can send me a message to the email address in my github profile. |
comment |
5443094
to
22ab6a4
Compare
@ggerganov is the github CI capable of using the Metal backend? I suspect that it is failing somewhere during
|
Hm, seems it does not work, but it does not print any meaningful error. |
13fdba2
to
c9e244b
Compare
I am not sure what is happening either, it seems that the log is missing the output from |
ad7a50b
to
d3db8e3
Compare
@FSSRepo can you help me define a few test cases for im2col? I need different values of parameters to test, currently it is tested with these (the default values in the constructor): // GGML_OP_IM2COL
struct test_im2col : public test_case {
const ggml_type type_a;
const ggml_type type_b;
const std::array<int64_t, 4> ne_a;
const std::array<int64_t, 4> ne_b;
const int s0;
const int s1;
const int p0;
const int p1;
const int d0;
const int d1;
const bool is_2D;
std::string vars() override {
return VARS_TO_STR11(type_a, type_b, ne_a, ne_b, s0, s1, p0, p1, d0, d1, is_2D);
}
test_im2col(ggml_type type_a = GGML_TYPE_F16, ggml_type type_b = GGML_TYPE_F32,
std::array<int64_t, 4> ne_a = {10, 10, 10, 10},
std::array<int64_t, 4> ne_b = {10, 10, 10, 10},
int s0 = 1, int s1 = 1,
int p0 = 0, int p1 = 0,
int d0 = 1, int d1 = 1,
bool is_2D = false)
: type_a(type_a), type_b(type_b), ne_a(ne_a), ne_b(ne_b), s0(s0), s1(s1), p0(p0), p1(p1), d0(d0), d1(d1), is_2D(is_2D) {}
ggml_tensor * build_graph(ggml_context * ctx) override {
ggml_tensor * a = ggml_new_tensor(ctx, type_a, 4, ne_a.data());
ggml_tensor * b = ggml_new_tensor(ctx, type_b, 4, ne_b.data());
ggml_tensor * out = ggml_im2col(ctx, a, b, s0, s1, p0, p1, d0, d1, is_2D);
return out;
}
}; |
My mistake, I gave you the wrong order of the tensors. Wait a little bit. |
Fixed: // GGML_OP_IM2COL
struct test_im2col : public test_case {
const ggml_type type_input;
const ggml_type type_kernel;
const std::array<int64_t, 4> ne_input;
const std::array<int64_t, 4> ne_kernel;
// stride
const int s0;
const int s1;
// padding
const int p0;
const int p1;
// dilatation
const int d0;
const int d1;
// mode
const bool is_2D;
std::string vars() override {
return VARS_TO_STR11(type_input, type_kernel, ne_input, ne_kernel, s0, s1, p0, p1, d0, d1, is_2D);
}
test_im2col(ggml_type type_input = GGML_TYPE_F16, ggml_type type_kernel = GGML_TYPE_F32,
std::array<int64_t, 4> ne_input = {10, 10, 3, 1}, // [input_width, input_height, input_channels, 1]
std::array<int64_t, 4> ne_kernel = {3, 3, 3, 1}, // [kernel_width, kernel_height, input_channels, 1]
int s0 = 1, int s1 = 1,
int p0 = 1, int p1 = 1,
int d0 = 1, int d1 = 1,
bool is_2D = true)
: type_input(type_input), type_kernel(type_kernel), ne_input(ne_input), ne_kernel(ne_kernel), s0(s0), s1(s1), p0(p0), p1(p1), d0(d0), d1(d1), is_2D(is_2D) {}
ggml_tensor * build_graph(ggml_context * ctx) override {
ggml_tensor * input = ggml_new_tensor(ctx, type_input, 4, ne_input.data());
ggml_tensor * kernel = ggml_new_tensor(ctx, type_kernel, 4, ne_kernel.data());
ggml_tensor * out = ggml_im2col(ctx, kernel, input, s0, s1, p0, p1, d0, d1, is_2D);
return out;
}
}; |
I had to swap the kernel and input types, but otherwise it works. |
The ggml-ci is also failing with Metal because it cannot find the It is still going to fail after that because some ops are broken in Metal when the number of columns is not a multiple of 4 and there aren't checks for that, and additionally both CUDA and Metal have broken support for broadcasting with add and mul, but I think we should keep these tests enabled to remind us to fix it. |
Will take a detailed look tomorrow. Great work on the tests - I was just thinking about something like this. Much needed. |
19ca35e
to
1999164
Compare
Regarding multiple CUDA devices: is it fully functional at this point? Is there going to be a conflict around the shared memory pool if it is used by more than one device or thanks to the lock everything should be good? With this infrastructure, if let's say I wanted to implement an alternative backend for an existing GPU (for example, |
I wouldn't say that it is fully functional already because I haven't run enough tests to be confident that everything works correctly, but it should work. The memory pool is already per-device, so that shouldn't cause issues. There are still most definitely synchronization issues in the CUDA backend, so it is not possible to use two devices simultaneously (eg. in different threads). It should work for splitting a model to multiple devices with
Yes, I don't see why that wouldn't work, except maybe for the few hooks that some backends (cuda/opencl) have in ggml.c. Ultimately the goal is to be able to use a single binary for all the backends, and detect the availability at runtime. |
How come none of the
Edit: got it - dim[3] is not supported |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great stuff! Merge at will
…zing tensors without ggml-alloc
ggml-ci
ggml-ci
ggml_backend_alloc_ctx_tensors
ggml_unary_op_name
andggml_op_desc
Buffer types
Backend registry
Allows enumerating and initializing the available backends.
Graph copy & compare between backends
Copy a graph to a different backend and evaluate it on both one op at a time. A callback can be used to compare the results of each operation.
ggml_backend_alloc_ctx_tensors
Allocate all the tensors in a context and a backend buffer in one step. Ref: #578
ggml_unary_op_name
andggml_op_desc
ggml_op_desc
can be used as a replacement ofggml_op_name
that also returns the name of the unary op when the op isGGML_OP_UNARY
.CUDA multiple device support
ggml_backend_cuda_init
takes aint device
parameter that specifies the CUDA device to use. To use the default device, pass0
. Each device is registered as a different backend in the backend registry, with namesCUDA0
for device 0,CUDA1
for device 1 and so on.Backend op tests
Each op implemented in the backends is tested against the CPU backend. Tensors are initialized with random data, and the result of the op is compared using a normalized MSE. New ops implemented in the backends should add a test in
test-backend-ops.cpp
. Only F16/F32 tests for now, quantized types are not yet supported in the test.