Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about performance in for loops #2087

Open
jeshi96 opened this issue Oct 4, 2024 · 1 comment
Open

Question about performance in for loops #2087

jeshi96 opened this issue Oct 4, 2024 · 1 comment

Comments

@jeshi96
Copy link

jeshi96 commented Oct 4, 2024

Hi,

In a similar vein to #2086, I also had a question about dpnp's performance with respect to for loops, going in and out of the array context. I tested dpnp on a simple covariance matrix computation, that loops over the columns and computes the covariance matrix. However, dpnp's single-threaded performance is significantly slower than NumPy's performance; one guess is that this could be due to repeatedly going in and out of sycl's queue, but I was wondering if you had more insights on this, or if there was a way to fix the performance of this.

Here is the covariance scalability plot (number of threads vs. running time) comparing dpnp and NumPy:

image

As before, the test environment is an Intel Xeon Platinum 8380 processor, with 80 threads. The tests were run taking the average (median) over 10 runs and discarding the first run (so the cache is warm).

For the inputs, I used M = 3200, N = 4000, where float_N = np.float64(N) and data = np.fromfunction(lambda i, j: (i * j) / M, (N, M), dtype=np.float64). For dpnp, I ran data= dpnp.asarray(data, device="cpu") (and float_N = dpnp.asarray(float_N, device="cpu"), although this didn't seem to make a difference either way) before starting the tests (not included in the timing results). The code for the covariance computation is as follows (for dpnp; the NumPy code is the exact same, except using numpy):

mean = dpnp.mean(data, axis=0)
data -= mean
cov = dpnp.zeros((M, M), dtype=data.dtype, device=device)
for i in range(M):
    cov[i:M, i] = cov[i, i:M] = data[:, i] @ data[:, i:M] / (float_N - 1.0)

cov.sycl_queue.wait()

Any help is much appreciated -- thanks!

Best,
Jessica

@ndgrigorian
Copy link
Collaborator

I have been doing benchmarking of this example.

A first recommendation I have is to pull the entire denominator, (float_N - 1.0) out of the for loop, and instead do something like

float_N_dev = float_N - 1.0
for i in range(M):
    ... / float_N_dev

as this subtraction with a host Python scalar (1.0) will incur the cost of synchronization to move the value to the device from the host, as well as the allocation of additional memory allocation for float_N - 1.0 for each loop.

This saved me between 10% and 20% alone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants