unify xpu and cpu backend and use paged attention #1009

sywangyi · 2024-11-22T01:08:30Z

What does this PR do?

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

Signed-off-by: Wang, Yi A <[email protected]>

* refine class IPEXPagedCache's update method Signed-off-by: Liu, Kaixuan <[email protected]> * replace tensor on xpu to List to avoid memory copy Signed-off-by: Liu, Kaixuan <[email protected]> * split IPEXPagedCache's update function into `update_for_prefill` and `update_for_decode` Signed-off-by: Liu, Kaixuan <[email protected]> --------- Signed-off-by: Liu, Kaixuan <[email protected]>

Signed-off-by: Liu, Kaixuan <[email protected]>

* enable qkv * split key value into 2 lists

Signed-off-by: Wang, Yi A <[email protected]>

#979) * enable gpt2, falcon has core dump error in PagedAttention.single_query_cached_kv_attention * enable new_decoder_arch falcon * only keep 1 config * rm autocast

* fix bug when run IPEXCausalModel forward directly; fix bug when using `save_pretrain` Signed-off-by: Liu, Kaixuan <[email protected]> * add LinearGelu Op support for XPU Signed-off-by: Liu, Kaixuan <[email protected]> * fix unit test error Signed-off-by: Liu, Kaixuan <[email protected]> * adjust unit test case Signed-off-by: Liu, Kaixuan <[email protected]> * fix bug Signed-off-by: Liu, Kaixuan <[email protected]> --------- Signed-off-by: Liu, Kaixuan <[email protected]>

* skip assited decoding unit test for models using paged attention Signed-off-by: Liu, Kaixuan <[email protected]> * XPU CI tests get almost all passed Signed-off-by: Liu, Kaixuan <[email protected]> --------- Signed-off-by: Liu, Kaixuan <[email protected]>

Signed-off-by: Wang, Yi A <[email protected]>

HuggingFaceDocBuilderDev · 2024-11-22T01:13:49Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Signed-off-by: jiqing-feng <[email protected]>

* fix ci config * fix test versions * fix ipex version Signed-off-by: jiqing-feng <[email protected]>

Signed-off-by: jiqing-feng <[email protected]>

* use python3.9 test Signed-off-by: jiqing-feng <[email protected]>

* change ipex transformers limited verison in setup * fix inc tests Signed-off-by: jiqing-feng <[email protected]>

Signed-off-by: Liu, Kaixuan <[email protected]>

yao-matrix · 2024-11-25T02:57:16Z

@IlyasMoutawwakil @echarlaix , pls help review, we can also have a meeting to review it if needed. Thx.

* fix bert and vit patch * fix vit and bert save Signed-off-by: jiqing-feng <[email protected]>

IlyasMoutawwakil · 2024-11-25T09:23:18Z

@yao-matrix reviewing right now

jiqing-feng · 2024-11-25T09:23:27Z

Hi @IlyasMoutawwakil , please also merge this PR #1024. Thanks!

jiqing-feng · 2024-11-26T04:33:07Z

Hi @IlyasMoutawwakil . I have replied and fixed your comments, please take the 2nd round review. Thanks~

* simplify forward and save pretrained since no jit support * fix format * rm warmup because no jit mode anymore * simplify forward for causal lm model * fix paged pkv forward * disable use_cache when just run forward --------- Signed-off-by: jiqing-feng <[email protected]>

optimum/intel/ipex/modeling_base.py

echarlaix · 2024-11-26T09:51:13Z

optimum/intel/ipex/modeling_base.py


-        if isinstance(model, torch.jit.RecursiveScriptModule):


TorchScript models will not be compatible anymore which is an important breaking change, we need to catch this to inform users

also we need to update the documentation

optimum-intel/docs/source/ipex/inference.mdx

Line 18 in ad8a4cb

For now, support is only enabled for CPUs and the original model will be exported via TorchScript. In the future `torch.compile` will be used and model exported via TorchScript will get deprecated.

echarlaix · 2024-11-26T09:51:51Z

optimum/intel/ipex/modeling_base.py

-
-        return cls(model, config=config, model_save_dir=model_save_dir, **kwargs)
+        task = cls.export_feature
+        model = TasksManager.get_model_from_task(


why not use cls.auto_model_class ?

optimum/exporters/ipex/model_patcher.py

tests/ipex/test_modeling.py

Signed-off-by: Liu, Kaixuan <[email protected]>

* nice code * device type adjustment Signed-off-by: Liu, Kaixuan <[email protected]>

* enable compile for non-generation tasks * add no_grad in forward * warmup compiled model * disable compile not ready models * set system level optimize for torch.compile * fix typo * add comments * set torch minimum version for compiling Signed-off-by: jiqing-feng <[email protected]>

jiqing-feng · 2024-11-29T05:24:08Z

Hi @echarlaix @IlyasMoutawwakil , please review the new changes. Thanks

echarlaix · 2024-11-29T17:16:31Z

optimum/intel/ipex/modeling_base.py

            )
+            return TSModelForCausalLM.from_pretrained(model_id, **kwargs)


An instance of TSModelForCausalLM will be created for all IPEXModel (even for encoder models) which doesn't really make sense to me. Also it's not tested anywhere from what I see, I prefer to raise an error here instead of keeping support that we're not sure works / is compatible with the previous integration

tests/ipex/test_modeling.py

echarlaix · 2024-11-29T17:20:05Z

optimum/intel/ipex/modeling_base.py

-
-        return cls(model, config=config, model_save_dir=model_save_dir, **kwargs)
+        model = cls.auto_model_class.from_pretrained(model_id, **kwargs)
+        return cls(model, config=model.config, export=True, **kwargs)


why would export be needed ?

Suggested change

return cls(model, config=model.config, export=True, **kwargs)

return cls(model, config=model.config, **kwargs)

echarlaix · 2024-11-29T17:21:16Z

optimum/intel/ipex/modeling_base.py


-        if isinstance(model, torch.jit.RecursiveScriptModule):


also we need to update the documentation

optimum-intel/docs/source/ipex/inference.mdx

Line 18 in ad8a4cb

For now, support is only enabled for CPUs and the original model will be exported via TorchScript. In the future `torch.compile` will be used and model exported via TorchScript will get deprecated.

* fix readme and push to hub support Signed-off-by: jiqing-feng <[email protected]> * rm export in tests Signed-off-by: jiqing-feng <[email protected]> * test with torch 2.5.* Signed-off-by: jiqing-feng <[email protected]> --------- Signed-off-by: jiqing-feng <[email protected]>

jiqing-feng · 2024-12-02T05:20:37Z

Hi @echarlaix @IlyasMoutawwakil . Please review the new changes.

echarlaix · 2024-12-02T17:00:03Z

docs/source/ipex/inference.mdx

-You can load your model and apply IPEX optimizations (including weight prepacking and graph mode). For supported architectures like LLaMA, BERT and ViT, further optimizations will be applied by patching the model to use custom operators.
-For now, support is only enabled for CPUs and the original model will be exported via TorchScript. In the future `torch.compile` will be used and model exported via TorchScript will get deprecated.
+You can load your model and apply IPEX optimizations (apply torch.compile for non-generation tasks). For supported architectures like LLaMA, BERT and ViT, further optimizations will be applied by patching the model to use custom operators.
+For now, support is enabled for Intel CPU/GPU. The TorchScript is deprecated.


Suggested change

For now, support is enabled for Intel CPU/GPU. The TorchScript is deprecated.

For now, support is enabled for Intel CPU/GPU. Previous models converted to TorchScript will be deprecated in v1.22.

tests/ipex/test_modeling.py

echarlaix · 2024-12-02T17:03:09Z

tests/ipex/test_pipelines.py

+        dtype = torch.float32
+        if IS_XPU:
+            dtype = torch.float16


Suggested change

dtype = torch.float32

if IS_XPU:

dtype = torch.float16

dtype = torch.float16 if IS_XPU_AVAILABLE else torch.float32

echarlaix · 2024-12-02T17:05:10Z

tests/ipex/utils_tests.py



+IS_XPU = is_torch_xpu_available(check_device=True)


Suggested change

IS_XPU = is_torch_xpu_available(check_device=True)

IS_XPU_AVAILABLE = is_torch_xpu_available(check_device=True)

echarlaix · 2024-12-02T17:05:44Z

tests/ipex/test_pipelines.py

@@ -56,7 +59,6 @@ class PipelinesIntegrationTest(unittest.TestCase):
        "gpt2",
        "gpt_neo",
        "gpt_neox",
-        "llama",


why not keep it ?

echarlaix · 2024-12-02T17:07:34Z

tests/ipex/test_modeling.py

-    @unittest.skipIf(is_ipex_version("<", "2.3.0"), reason="Only ipex version > 2.3.0 supports ipex model patching")
-    def test_patched_model(self):


can we keep this test ? (using a new model)

echarlaix · 2024-12-02T17:42:10Z

tests/ipex/test_modeling.py

+        if IS_XPU:
+            dtype = torch.float16
+        # Test model forward do not need cache.
+        ipex_model = IPEXModelForCausalLM.from_pretrained(model_id, torch_dtype=dtype, use_cache=False)


we need to test default here :

Suggested change

ipex_model = IPEXModelForCausalLM.from_pretrained(model_id, torch_dtype=dtype, use_cache=False)

ipex_model = IPEXModelForCausalLM.from_pretrained(model_id, torch_dtype=dtype)

echarlaix · 2024-12-02T17:43:36Z

tests/ipex/test_modeling.py

-        "llama",
        "llama2",
-        # "phi",
-        "distilgpt2",


why remove llama and distilgpt2 test ?

* fix tests * fix typo * add patched tests * change forward to generate * fix tests * fix test model name --------- Signed-off-by: jiqing-feng <[email protected]>

jiqing-feng · 2024-12-03T07:40:01Z

Hi @echarlaix. I have fixed all your comments, please take a review.

For the change of using generate instead of forward in causal lm tests, we need to pass a cache_class (IPEXPagedCache) in forward just like StaticCache. Otherwise, the past_key_values.update will raise an error because past_key_values is None.

The only way to support forward without pask_key_value in the inputs when use_cache=True is to create a cache class in forward, but it's not reasonable because generate already created the cache class. I would like to hear your opinion. Thanks!!

jiqing-feng · 2024-12-04T05:45:38Z

Hi @echarlaix . We plan to merge this PR this year which means we need the review done before Xmas. Appreciate it if you could prioritize this PR. Thanks!

IlyasMoutawwakil · 2024-12-04T09:04:07Z

For the change of using generate instead of forward in causal lm tests, we need to pass a cache_class (IPEXPagedCache) in forward just like StaticCache. Otherwise, the past_key_values.update will raise an error because past_key_values is None.

I'm pretty sure all calls to past_key_values.update in the forward are guarded with an if past_key_values is not None.

jiqing-feng · 2024-12-04T09:19:46Z

For the change of using generate instead of forward in causal lm tests, we need to pass a cache_class (IPEXPagedCache) in forward just like StaticCache. Otherwise, the past_key_values.update will raise an error because past_key_values is None.

I'm pretty sure all calls to past_key_values.update in the forward are guarded with an if past_key_values is not None.

Right, I will follow it to see if it could work in ipex patching models. Thanks.

sywangyi · 2024-12-05T01:12:38Z

For the change of using generate instead of forward in causal lm tests, we need to pass a cache_class (IPEXPagedCache) in forward just like StaticCache. Otherwise, the past_key_values.update will raise an error because past_key_values is None.

I'm pretty sure all calls to past_key_values.update in the forward are guarded with an if past_key_values is not None.

because if past_key_values is None and use_cache = True. DynamicCache will be created in modeling. see
https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/modeling_llama.py#L863, as an example. we do not maintain dynamicCache logic in optimum-intel ipex modeling now. ipex modeling only support pagedCache.

* fix forward without pkv * patch gpt2 block forward * fix typo * revert causal lm tests Signed-off-by: jiqing-feng <[email protected]>

jiqing-feng · 2024-12-05T02:46:08Z

Hi @sywangyi . The use_cache is from generation_config, which means it's just a parameter for generation and should not block the model forward without past_key_values. We can only guarantee the inputs have a cache in generation if use_cache=True, but not for just calling forward.

Hi @IlyasMoutawwakil . I have fixed it by your comment, please check if there are any changes required before merging. Thanks!

sywangyi and others added 14 commits October 8, 2024 22:57

add page attention implementation remove jit logic

1c35c4f

Signed-off-by: Wang, Yi A <[email protected]>

add support in transformers 4.45

973e034

Signed-off-by: Wang, Yi A <[email protected]>

fix congif (#935)

8b574d0

move patch model to init

541a236

Signed-off-by: Wang, Yi A <[email protected]>

fix bug when doing beam search (#954)

80e8071

Signed-off-by: Liu, Kaixuan <[email protected]>

enable qkv concat layer (#958)

184faea

* enable qkv * split key value into 2 lists

add xpu cache optimiztion

b341db6

Signed-off-by: Wang, Yi A <[email protected]>

xpu mlp optimization

34ce74d

Signed-off-by: Wang, Yi A <[email protected]>

optimize cache ops in xpu, improve for beam search

45130c9

Signed-off-by: Wang, Yi A <[email protected]>

enable gpt2, falcon has core dump error in PagedAttention.single_quer… (

74eec8b

#979) * enable gpt2, falcon has core dump error in PagedAttention.single_query_cached_kv_attention * enable new_decoder_arch falcon * only keep 1 config * rm autocast

Merge branch 'main' into paged_attn

459c78c

Signed-off-by: Wang, Yi A <[email protected]>

sywangyi changed the title ~~Paged attn~~ unify xpu and cpu backend and use paged attention Nov 22, 2024

jiqing-feng added 3 commits November 22, 2024 09:22

fix ci config (#1010)

1ab0233

Signed-off-by: jiqing-feng <[email protected]>

Fix tests versions (#1011)

b0cd5db

* fix ci config * fix test versions * fix ipex version Signed-off-by: jiqing-feng <[email protected]>

fix torch test version (#1012)

e31e6d4

Signed-off-by: jiqing-feng <[email protected]>

sywangyi marked this pull request as draft November 22, 2024 01:34

use python3.9 test (#1013)

ed35ffc

* use python3.9 test Signed-off-by: jiqing-feng <[email protected]>

sywangyi marked this pull request as ready for review November 22, 2024 03:00

jiqing-feng and others added 2 commits November 22, 2024 13:11

change ipex transformers limited verison in setup (#1015)

a5c48a8

* change ipex transformers limited verison in setup * fix inc tests Signed-off-by: jiqing-feng <[email protected]>

add XPU LinearAddAdd op (#1017)

388265f

Signed-off-by: Liu, Kaixuan <[email protected]>

fix bert and vit patch (#1022)

ad9b795

* fix bert and vit patch * fix vit and bert save Signed-off-by: jiqing-feng <[email protected]>

jiqing-feng mentioned this pull request Nov 25, 2024

Improve INC CI test torch version #1027

Merged

3 tasks

Merge branch 'main' into paged_attn

0d7f8b6

IlyasMoutawwakil reviewed Nov 26, 2024

View reviewed changes

optimum/intel/ipex/modeling_base.py Outdated Show resolved Hide resolved

echarlaix reviewed Nov 26, 2024

View reviewed changes

echarlaix mentioned this pull request Nov 26, 2024

add ipex backend UKPLab/sentence-transformers#3083

Open

IlyasMoutawwakil reviewed Nov 26, 2024

View reviewed changes

optimum/exporters/ipex/model_patcher.py Outdated Show resolved Hide resolved

echarlaix reviewed Nov 26, 2024

View reviewed changes

tests/ipex/test_modeling.py Show resolved Hide resolved

echarlaix mentioned this pull request Nov 26, 2024

Add IPEX sentence transformers support #1034

Merged

kaixuanliu added 2 commits November 27, 2024 09:29

nice code (#1035)

51030e5

Signed-off-by: Liu, Kaixuan <[email protected]>

Paged attn (#1036)

587837e

* nice code * device type adjustment Signed-off-by: Liu, Kaixuan <[email protected]>

changwangss mentioned this pull request Nov 27, 2024

Support layerwise quantization #1018

Merged

3 tasks

echarlaix reviewed Nov 29, 2024

View reviewed changes

sywangyi and others added 2 commits December 2, 2024 10:11

Merge branch 'main' into paged_attn

52f8d32

echarlaix reviewed Dec 2, 2024

View reviewed changes

Fix tests (#1047)

b84274c

* fix tests * fix typo * add patched tests * change forward to generate * fix tests * fix test model name --------- Signed-off-by: jiqing-feng <[email protected]>

Patch gpt2 block forward for passing input_lens. (#1050)

d8251d1

* fix forward without pkv * patch gpt2 block forward * fix typo * revert causal lm tests Signed-off-by: jiqing-feng <[email protected]>

IlyasMoutawwakil approved these changes Dec 5, 2024

View reviewed changes

IlyasMoutawwakil merged commit 41f0a46 into main Dec 5, 2024
37 of 45 checks passed

IlyasMoutawwakil deleted the paged_attn branch December 5, 2024 08:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unify xpu and cpu backend and use paged attention #1009

unify xpu and cpu backend and use paged attention #1009

sywangyi commented Nov 22, 2024

HuggingFaceDocBuilderDev commented Nov 22, 2024

yao-matrix commented Nov 25, 2024

IlyasMoutawwakil commented Nov 25, 2024

jiqing-feng commented Nov 25, 2024

jiqing-feng commented Nov 26, 2024

echarlaix Nov 26, 2024

echarlaix Nov 29, 2024

jiqing-feng Dec 2, 2024

echarlaix Nov 26, 2024

jiqing-feng commented Nov 29, 2024

echarlaix Nov 29, 2024

jiqing-feng Dec 2, 2024

echarlaix Nov 29, 2024

jiqing-feng Dec 2, 2024 •

edited

Loading

echarlaix Nov 29, 2024

jiqing-feng commented Dec 2, 2024

echarlaix Dec 2, 2024

echarlaix Dec 2, 2024 •

edited

Loading

echarlaix Dec 2, 2024 •

edited

Loading

echarlaix Dec 2, 2024

echarlaix Dec 2, 2024

echarlaix Dec 2, 2024

echarlaix Dec 2, 2024

jiqing-feng commented Dec 3, 2024 •

edited

Loading

jiqing-feng commented Dec 4, 2024

IlyasMoutawwakil commented Dec 4, 2024

jiqing-feng commented Dec 4, 2024

sywangyi commented Dec 5, 2024

jiqing-feng commented Dec 5, 2024

		)
		return TSModelForCausalLM.from_pretrained(model_id, **kwargs)

	return cls(model, config=model.config, export=True, **kwargs)
	return cls(model, config=model.config, **kwargs)

	For now, support is enabled for Intel CPU/GPU. The TorchScript is deprecated.
	For now, support is enabled for Intel CPU/GPU. Previous models converted to TorchScript will be deprecated in v1.22.

	IS_XPU = is_torch_xpu_available(check_device=True)
	IS_XPU_AVAILABLE = is_torch_xpu_available(check_device=True)

		@unittest.skipIf(is_ipex_version("<", "2.3.0"), reason="Only ipex version > 2.3.0 supports ipex model patching")
		def test_patched_model(self):

	ipex_model = IPEXModelForCausalLM.from_pretrained(model_id, torch_dtype=dtype, use_cache=False)
	ipex_model = IPEXModelForCausalLM.from_pretrained(model_id, torch_dtype=dtype)

unify xpu and cpu backend and use paged attention #1009

unify xpu and cpu backend and use paged attention #1009

Conversation

sywangyi commented Nov 22, 2024

What does this PR do?

Before submitting

HuggingFaceDocBuilderDev commented Nov 22, 2024

yao-matrix commented Nov 25, 2024

IlyasMoutawwakil commented Nov 25, 2024

jiqing-feng commented Nov 25, 2024

jiqing-feng commented Nov 26, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jiqing-feng commented Nov 29, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jiqing-feng Dec 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jiqing-feng commented Dec 2, 2024

Choose a reason for hiding this comment

echarlaix Dec 2, 2024 • edited Loading

Choose a reason for hiding this comment

echarlaix Dec 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jiqing-feng commented Dec 3, 2024 • edited Loading

jiqing-feng commented Dec 4, 2024

IlyasMoutawwakil commented Dec 4, 2024

jiqing-feng commented Dec 4, 2024

sywangyi commented Dec 5, 2024

jiqing-feng commented Dec 5, 2024

jiqing-feng Dec 2, 2024 •

edited

Loading

echarlaix Dec 2, 2024 •

edited

Loading

echarlaix Dec 2, 2024 •

edited

Loading

jiqing-feng commented Dec 3, 2024 •

edited

Loading