Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Building/updating top-k item index for retrieval models evaluations is quite slow #339

Closed
gabrielspmoreira opened this issue Apr 6, 2022 · 8 comments
Assignees
Labels
area/performance enhancement New feature or request
Milestone

Comments

@gabrielspmoreira
Copy link
Member

gabrielspmoreira commented Apr 6, 2022

Bug description

The evaluation of retrieval models is pretty slow. In particular, the steps related to batch-predict: (1) saving the model and (2) building / updating the top-k index, doing df.map_partition() over all the item features to generate item embeddings.

Steps/Code to reproduce bug

  1. Download this preprocessed H&M dataset
  2. Run this retrieval training script. It will train for a single epoch and only 10 steps (but running the full script will take more than 10 minutes)!
python scripts/retrieval_train_eval.py --data_path /mnt/nvme0n1/datasets/handm_fashion/data_preproc_v03 --output_path ./ --model_type two_tower --two_tower_mlp_layers 256,128,64 --two_tower_activation relu --two_tower_dropout 0.1 --logits_temperature 0.1 --l2_reg 1e-4 --loss categorical_crossentropy --train_batch_size 4096 --eval_batch_size 128 --epochs 1 --train_steps_per_epoch 10 --lr 1e-4 --lr_decay_rate 0.90 --lr_decay_steps 1000 --optimizer adam --train_metrics_steps 100 --two_tower_embedding_sizes_multiplier 2.0 --neg_sampling inbatch --log_to_tensorboard --log_to_wandb 
  1. Include a breakpoint within RetrievalModel._load_or_update_candidates_index() and check the runtimes for the following lines. During that process
#In TFModelEncode.__init__() ..
        # takes about 1m30s
	model.save(save_path) 

#In Model.encode() ...
        # takes about 4m30s. Note: It uses 100% of a CPU core and small GPU usage (maybe by dask), so wondering if it is really using GPUs when forwarding features through the item tower.
	outputs = concat_func([encode_func(self.model, batch) for batch in iterator_func(df)]) 
@rjzamora
Copy link

rjzamora commented Apr 11, 2022

@marcromeyn - I had trouble getting an enviroment working well enough to reproduce this issue, but I did confirm with a toy code that the "model-reloading" overhead (which I suggested as a possible problem offline) is probably not the problem here.

One thing that could be an issue is the use of bare compute() operations in Merlin-models. I believe this will result in many python threads thrashing the same GPU (device-0) at once. Note that I raised a feature request in Merlin-core for a compute_dask_object utility that would make it a bit easier for us to "formalize" all dask computation in Merlin - Until that is done, perhaps you cound try setting compute(scheduler="synchronous") (if you are running on a single GPU anyway)?

Clarification: The compute issue may be a red herring in this case, but I do think it is best for us to formalize that process either way.

@rnyak rnyak added this to the Merlin 22.05 milestone Apr 11, 2022
@marcromeyn
Copy link
Contributor

Thanks for looking into this @rjzamora! I added the one-line change to a PR, @gabrielspmoreira will now check if this indeed speeds up the batch-predict issues. We will keep you posted!

@gabrielspmoreira
Copy link
Member Author

Thanks @marcromeyn and @rjzamora .
As you suggested, I have tried ddf.compute(scheduler="synchronous") in that pipeline.

Here is the line and timing before the change

#1m50s
embedding_df = data.map_partitions(model_encode, filter_input_columns=[id_column]).compute()

After the suggested change

#20s
embedding_ddf = data.map_partitions(model_encode, filter_input_columns=[id_column])
#1m20s
embedding_df = embedding_ddf.compute(scheduler="synchronous")

I have noticed that data was a dask_cudf.DataFrame with a single partition. So I tried to split data into more partitions before those lines, but than ddf.compute() timing got worse

data = data.to_ddf().repartition(npartitions=10)
...
#3m33s
embedding_df = embedding_ddf.compute(scheduler="synchronous")

@rjzamora
Copy link

Thanks for testing @gabrielspmoreira !

At first, I was going to say that 20s seems slow for a map_paritions call without the compute, because that line should only be calling model_encode on a small piece of metadata (an empty pd/cudf DataFrame). However, that is probably when the underlying model is being loaded. So, unless you suggest otherwise, I'll just assume for now that it should take ~10-20s to load the model.

Now for the compute call... When I suggested passing scheduler="synchronous", I was assuming that you have cudf-backed data and only want to execute this on a single GPU. Is that the case? (Just checking)

I have noticed that data was a dask_cudf.DataFrame with a single partition. So I tried to split data into more partitions before those lines, but than ddf.compute() timing got worse

If you only have one GPU, then it definitly makes sense that you are getting better performance with a single large partition than you are with multiple smaller ones (since they cannot execute in parallel anyway). Since you only have a single partition anyway, would you mind trying the same logic without using dask/map_partitions? E.g...

df_data = data.compute(scheduler="synchronous")
embedding_df = model_encode(df_data, filter_input_columns=[id_column])

I'm interested to know if this runs any faster.

@gabrielspmoreira
Copy link
Member Author

gabrielspmoreira commented Apr 28, 2022

Hey @rjzamora . Sorry for the delay in your suggested test. I was pretty busy these days with other tasks.

I tested again the current approach with the latest code

#18s
embedding_ddf = data.map_partitions(model_encode, filter_input_columns=[id_column])
#51s
embedding_df = embedding_ddf.compute(scheduler="synchronous")

Then replaced those lines here by your latest suggestion above

#4s
df_data = data.compute(scheduler="synchronous")
#3m18s
embedding_df = model_encode(df_data, filter_input_columns=[id_column])

And I see low GPU usage in both cases (~1-5%) during those commands execution.
Thoughts?
@marcromeyn for viz

@EvenOldridge
Copy link
Member

@gabrielspmoreira to retest using new API

@viswa-nvidia
Copy link

@gabrielspmoreira , please set the milestone we should target this bug fix for.

@gabrielspmoreira
Copy link
Member Author

@EvenOldridge @viswa-nvidia Have re-tested this with RetrievalModel (V1) and now the full pipeline - training for 10 steps + evaluating on train set (100 steps) + evaluating on the full eval set (750 steps) takes only 1m15, compared to the previous 10m when this issue were open.
I also tested with RetrievalModel (V2) and it took 1m59s.
So I am closing this issue as now we have a reasonable runtime and GPU utilization throughout the pipeline execution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/performance enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants