How many copies are occurred when getting an object from distributed memory #5423

YarShev · 2021-10-14T12:52:57Z

YarShev
Oct 14, 2021

Hi guys,

I would like to find out how many data copies are occurred when getting an object from distributed memory. An example:

import pandas
from distributed.client import Client, get_client
c = Client()

df = pandas.DataFrame([1, 2, 3])
f = c.scatter(df)

df2 = c.gather(f)

def foo():
    df3 = get_client().gather(f)
    ...

c.submit(foo)

Would df2 be a copy of df at the main process and df3 at the worker process (2 copies)?

Thanks in advance!

YarShev · 2021-10-20T11:03:14Z

YarShev
Oct 20, 2021
Author

cc @jsignell

2 replies

YarShev Oct 20, 2021
Author

cc @mrocklin

YarShev Oct 20, 2021
Author

In the case of a single node, I guess there should be only one copy in the following line:

f = c.scatter(df)
# data copy from the main process memory to a worker process memory

In the case of multiple nodes (for simplicity, 2), I guess there should be at least two copies:

First copy is in the same line I mentioned above.
Second copy is in the df3 = get_client().gather(f) if df is scattered in one node and foo is executed in a different one.

Am I correct?

fjetter · 2021-10-20T12:48:24Z

fjetter
Oct 20, 2021
Maintainer

scatter will, by default, create a copy on every worker. How many workers there are depends on how many you are requesting. If you do not specify anything this depends on the number of CPUs your machine has.

In general, gather will then create another copy. df2 and df3 will be new, distinct instances/copies of the same data.

This is not what you are asking for but unless you have a very specific reason to use scatter (imho, most users use this by accident) a better approach might be

import pandas
from distributed.client import Client
c = Client()

def generate_dataframe():
    return pandas.DataFrame([1, 2, 3])

df_fut = c.submit(generate_dataframe)

df2 = c.gather(f)

def foo(df):
    do_stuff_with_df(df)

c.submit(foo, df_fut)

I can also recommend looking into dask.delayed which is a more high-level interface for a similar set of features.

4 replies

YarShev Oct 20, 2021
Author

@fjetter , will gather, which is called within a remote task on a worker, really create a copy of the data on the worker even though the data was preliminary scattered onto that worker?

fjetter Oct 20, 2021
Maintainer

Yes. The gather API is not intended to be used this way and does not implement a short cut. in fact, I'm not even sure if that will work if it's the same worker. I never tried and I'm not sure if we test anything like that.

To avoid a data copy, you'd need to access the worker itself and load the data. That's fairly simple but I cannot guarantee this is stable and won't change in the future. get_worker().data["future-key"]

YarShev Oct 20, 2021
Author

@fjetter , okay, I see. Is that you said about gather true for future.result()? Does the latter always return a copy of the data as well?

YarShev Oct 22, 2021
Author

@fjetter , a friendly reminder

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How many copies are occurred when getting an object from distributed memory #5423

{{title}}

Replies: 2 comments 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

How many copies are occurred when getting an object from distributed memory #5423

YarShev Oct 14, 2021

Replies: 2 comments · 6 replies

YarShev Oct 20, 2021 Author

YarShev Oct 20, 2021 Author

YarShev Oct 20, 2021 Author

fjetter Oct 20, 2021 Maintainer

YarShev Oct 20, 2021 Author

fjetter Oct 20, 2021 Maintainer

YarShev Oct 20, 2021 Author

YarShev Oct 22, 2021 Author

YarShev
Oct 14, 2021

Replies: 2 comments 6 replies

YarShev
Oct 20, 2021
Author

YarShev Oct 20, 2021
Author

YarShev Oct 20, 2021
Author

fjetter
Oct 20, 2021
Maintainer

YarShev Oct 20, 2021
Author

fjetter Oct 20, 2021
Maintainer

YarShev Oct 20, 2021
Author

YarShev Oct 22, 2021
Author