[BUG] Building/updating top-k item index for retrieval models evaluations is quite slow #339

gabrielspmoreira · 2022-04-06T16:24:27Z

Bug description

The evaluation of retrieval models is pretty slow. In particular, the steps related to batch-predict: (1) saving the model and (2) building / updating the top-k index, doing df.map_partition() over all the item features to generate item embeddings.

Steps/Code to reproduce bug

Download this preprocessed H&M dataset
Run this retrieval training script. It will train for a single epoch and only 10 steps (but running the full script will take more than 10 minutes)!

python scripts/retrieval_train_eval.py --data_path /mnt/nvme0n1/datasets/handm_fashion/data_preproc_v03 --output_path ./ --model_type two_tower --two_tower_mlp_layers 256,128,64 --two_tower_activation relu --two_tower_dropout 0.1 --logits_temperature 0.1 --l2_reg 1e-4 --loss categorical_crossentropy --train_batch_size 4096 --eval_batch_size 128 --epochs 1 --train_steps_per_epoch 10 --lr 1e-4 --lr_decay_rate 0.90 --lr_decay_steps 1000 --optimizer adam --train_metrics_steps 100 --two_tower_embedding_sizes_multiplier 2.0 --neg_sampling inbatch --log_to_tensorboard --log_to_wandb

Include a breakpoint within RetrievalModel._load_or_update_candidates_index() and check the runtimes for the following lines. During that process

#In TFModelEncode.__init__() ..
        # takes about 1m30s
	model.save(save_path) 

#In Model.encode() ...
        # takes about 4m30s. Note: It uses 100% of a CPU core and small GPU usage (maybe by dask), so wondering if it is really using GPUs when forwarding features through the item tower.
	outputs = concat_func([encode_func(self.model, batch) for batch in iterator_func(df)])

The text was updated successfully, but these errors were encountered:

rjzamora · 2022-04-11T14:30:01Z

@marcromeyn - I had trouble getting an enviroment working well enough to reproduce this issue, but I did confirm with a toy code that the "model-reloading" overhead (which I suggested as a possible problem offline) is probably not the problem here.

One thing that could be an issue is the use of bare compute() operations in Merlin-models. I believe this will result in many python threads thrashing the same GPU (device-0) at once. Note that I raised a feature request in Merlin-core for a compute_dask_object utility that would make it a bit easier for us to "formalize" all dask computation in Merlin - Until that is done, perhaps you cound try setting compute(scheduler="synchronous") (if you are running on a single GPU anyway)?

Clarification: The compute issue may be a red herring in this case, but I do think it is best for us to formalize that process either way.

marcromeyn · 2022-04-12T07:05:15Z

Thanks for looking into this @rjzamora! I added the one-line change to a PR, @gabrielspmoreira will now check if this indeed speeds up the batch-predict issues. We will keep you posted!

gabrielspmoreira · 2022-04-12T18:23:51Z

Thanks @marcromeyn and @rjzamora .
As you suggested, I have tried ddf.compute(scheduler="synchronous") in that pipeline.

Here is the line and timing before the change

#1m50s
embedding_df = data.map_partitions(model_encode, filter_input_columns=[id_column]).compute()

After the suggested change

#20s
embedding_ddf = data.map_partitions(model_encode, filter_input_columns=[id_column])
#1m20s
embedding_df = embedding_ddf.compute(scheduler="synchronous")

I have noticed that data was a dask_cudf.DataFrame with a single partition. So I tried to split data into more partitions before those lines, but than ddf.compute() timing got worse

data = data.to_ddf().repartition(npartitions=10)
...
#3m33s
embedding_df = embedding_ddf.compute(scheduler="synchronous")

rjzamora · 2022-04-12T19:34:09Z

Thanks for testing @gabrielspmoreira !

At first, I was going to say that 20s seems slow for a map_paritions call without the compute, because that line should only be calling model_encode on a small piece of metadata (an empty pd/cudf DataFrame). However, that is probably when the underlying model is being loaded. So, unless you suggest otherwise, I'll just assume for now that it should take ~10-20s to load the model.

Now for the compute call... When I suggested passing scheduler="synchronous", I was assuming that you have cudf-backed data and only want to execute this on a single GPU. Is that the case? (Just checking)

I have noticed that data was a dask_cudf.DataFrame with a single partition. So I tried to split data into more partitions before those lines, but than ddf.compute() timing got worse

If you only have one GPU, then it definitly makes sense that you are getting better performance with a single large partition than you are with multiple smaller ones (since they cannot execute in parallel anyway). Since you only have a single partition anyway, would you mind trying the same logic without using dask/map_partitions? E.g...

df_data = data.compute(scheduler="synchronous")
embedding_df = model_encode(df_data, filter_input_columns=[id_column])

I'm interested to know if this runs any faster.

gabrielspmoreira · 2022-04-28T21:59:52Z

Hey @rjzamora . Sorry for the delay in your suggested test. I was pretty busy these days with other tasks.

I tested again the current approach with the latest code

#18s
embedding_ddf = data.map_partitions(model_encode, filter_input_columns=[id_column])
#51s
embedding_df = embedding_ddf.compute(scheduler="synchronous")

Then replaced those lines here by your latest suggestion above

#4s
df_data = data.compute(scheduler="synchronous")
#3m18s
embedding_df = model_encode(df_data, filter_input_columns=[id_column])

And I see low GPU usage in both cases (~1-5%) during those commands execution.
Thoughts?
@marcromeyn for viz

EvenOldridge · 2022-10-12T17:03:31Z

@gabrielspmoreira to retest using new API

viswa-nvidia · 2022-10-26T17:29:24Z

@gabrielspmoreira , please set the milestone we should target this bug fix for.

gabrielspmoreira · 2022-10-26T18:21:05Z

@EvenOldridge @viswa-nvidia Have re-tested this with RetrievalModel (V1) and now the full pipeline - training for 10 steps + evaluating on train set (100 steps) + evaluating on the full eval set (750 steps) takes only 1m15, compared to the previous 10m when this issue were open.
I also tested with RetrievalModel (V2) and it took 1m59s.
So I am closing this issue as now we have a reasonable runtime and GPU utilization throughout the pipeline execution.

gabrielspmoreira added bug Something isn't working status/needs-triage labels Apr 6, 2022

karlhigley mentioned this issue Apr 6, 2022

[RMP] Performance evaluation and improvements for model training and serving #346

Closed

8 tasks

gabrielspmoreira assigned marcromeyn Apr 6, 2022

rjzamora mentioned this issue Apr 11, 2022

[FEA] Add compute_with_dask utility for down-stream Merlin libriaries NVIDIA-Merlin/core#70

Open

rnyak added this to the Merlin 22.05 milestone Apr 11, 2022

karlhigley mentioned this issue May 3, 2022

[RMP] Improve the speed of training retrieval models with Merlin Models NVIDIA-Merlin/Merlin#259

Closed

3 tasks

EvenOldridge added enhancement New feature or request area/performance and removed bug Something isn't working status/needs-triage labels May 6, 2022

gabrielspmoreira mentioned this issue Aug 18, 2022

[Task] Simplify Top-k recommender API #622

Closed

5 tasks

gabrielspmoreira closed this as completed Oct 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Building/updating top-k item index for retrieval models evaluations is quite slow #339

[BUG] Building/updating top-k item index for retrieval models evaluations is quite slow #339

gabrielspmoreira commented Apr 6, 2022 •

edited

Loading

rjzamora commented Apr 11, 2022 •

edited

Loading

marcromeyn commented Apr 12, 2022

gabrielspmoreira commented Apr 12, 2022

rjzamora commented Apr 12, 2022

gabrielspmoreira commented Apr 28, 2022 •

edited

Loading

EvenOldridge commented Oct 12, 2022

viswa-nvidia commented Oct 26, 2022

gabrielspmoreira commented Oct 26, 2022

[BUG] Building/updating top-k item index for retrieval models evaluations is quite slow #339

[BUG] Building/updating top-k item index for retrieval models evaluations is quite slow #339

Comments

gabrielspmoreira commented Apr 6, 2022 • edited Loading

Bug description

Steps/Code to reproduce bug

rjzamora commented Apr 11, 2022 • edited Loading

marcromeyn commented Apr 12, 2022

gabrielspmoreira commented Apr 12, 2022

rjzamora commented Apr 12, 2022

gabrielspmoreira commented Apr 28, 2022 • edited Loading

EvenOldridge commented Oct 12, 2022

viswa-nvidia commented Oct 26, 2022

gabrielspmoreira commented Oct 26, 2022

gabrielspmoreira commented Apr 6, 2022 •

edited

Loading

rjzamora commented Apr 11, 2022 •

edited

Loading

gabrielspmoreira commented Apr 28, 2022 •

edited

Loading