-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DO NOT MERGE] Upstream codebase diff #470
base: main
Are you sure you want to change the base?
Conversation
@@ -0,0 +1,35 @@ | |||
name: cpu-test |
Check failure
Code scanning / Scorecard
Token-Permissions High
Remediation tip: Visit https://app.stepsecurity.io/secureworkflow.
Tick the 'Restrict permissions for GITHUB_TOKEN'
Untick other options
NOTE: If you want to resolve multiple issues at once, you can visit https://app.stepsecurity.io/securerepo instead.
Click Remediation section below for further remediation help
@@ -0,0 +1,45 @@ | |||
name: codespell |
Check failure
Code scanning / Scorecard
Token-Permissions High
Remediation tip: Visit https://app.stepsecurity.io/secureworkflow.
Tick the 'Restrict permissions for GITHUB_TOKEN'
Untick other options
NOTE: If you want to resolve multiple issues at once, you can visit https://app.stepsecurity.io/securerepo instead.
Click Remediation section below for further remediation help
def test_stateless_process_group(worker): | ||
port1 = get_open_port() | ||
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s: | ||
s.bind(("", port1)) |
Check warning
Code scanning / CodeQL
Binding a socket to all network interfaces Medium test
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix AI about 1 month ago
To fix the problem, we need to bind the socket to a specific interface instead of all interfaces. In this case, we can bind it to the loopback interface 127.0.0.1
, which is commonly used for local testing and development. This change will limit the socket to accept connections only from the local machine, reducing the security risks.
-
Copy modified line R125
@@ -124,3 +124,3 @@ | ||
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s: | ||
s.bind(("", port1)) | ||
s.bind(("127.0.0.1", port1)) | ||
port2 = get_open_port() |
|
||
sock = socket.socket(family=family, type=socket.SOCK_STREAM) | ||
sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1) | ||
sock.bind(addr) |
Check warning
Code scanning / CodeQL
Binding a socket to all network interfaces Medium
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix AI about 1 month ago
To fix the problem, we need to ensure that the socket binds to a specific interface rather than all interfaces. This can be achieved by modifying the create_server_socket
function to check if the provided address is empty or 0.0.0.0
and raise an error or use a default specific interface instead.
- Modify the
create_server_socket
function to validate the address. - If the address is empty or
0.0.0.0
, raise an error or use a default specific interface. - Update the
run_server
function to handle the potential error raised bycreate_server_socket
.
-
Copy modified lines R613-R615 -
Copy modified lines R644-R648
@@ -612,2 +612,5 @@ | ||
def create_server_socket(addr: Tuple[str, int]) -> socket.socket: | ||
if addr[0] in ("", "0.0.0.0"): | ||
raise ValueError("Binding to all interfaces is not allowed. Please specify a valid IP address.") | ||
|
||
family = socket.AF_INET | ||
@@ -640,3 +643,7 @@ | ||
sock_addr = (args.host or "", args.port) | ||
sock = create_server_socket(sock_addr) | ||
try: | ||
sock = create_server_socket(sock_addr) | ||
except ValueError as e: | ||
logger.error(e) | ||
return | ||
|
# Llama3.2 models more reliable. | ||
|
||
TOOL_CALL_REGEX = re.compile( | ||
r"\[([a-zA-Z]+\w*\(([a-zA-Z]+\w*=.*,\s*)*([a-zA-Z]+\w*=.*\s)?\),\s*)*([a-zA-Z]+\w*\(([a-zA-Z]+\w*=.*,\s*)*([a-zA-Z]+\w*=.*\s*)?\)\s*)+\]", |
Check failure
Code scanning / CodeQL
Inefficient regular expression High
# Llama3.2 models more reliable. | ||
|
||
TOOL_CALL_REGEX = re.compile( | ||
r"\[([a-zA-Z]+\w*\(([a-zA-Z]+\w*=.*,\s*)*([a-zA-Z]+\w*=.*\s)?\),\s*)*([a-zA-Z]+\w*\(([a-zA-Z]+\w*=.*,\s*)*([a-zA-Z]+\w*=.*\s*)?\)\s*)+\]", |
Check failure
Code scanning / CodeQL
Inefficient regular expression High
# Llama3.2 models more reliable. | ||
|
||
TOOL_CALL_REGEX = re.compile( | ||
r"\[([a-zA-Z]+\w*\(([a-zA-Z]+\w*=.*,\s*)*([a-zA-Z]+\w*=.*\s)?\),\s*)*([a-zA-Z]+\w*\(([a-zA-Z]+\w*=.*,\s*)*([a-zA-Z]+\w*=.*\s*)?\)\s*)+\]", |
Check failure
Code scanning / CodeQL
Inefficient regular expression High
# Llama3.2 models more reliable. | ||
|
||
TOOL_CALL_REGEX = re.compile( | ||
r"\[([a-zA-Z]+\w*\(([a-zA-Z]+\w*=.*,\s*)*([a-zA-Z]+\w*=.*\s)?\),\s*)*([a-zA-Z]+\w*\(([a-zA-Z]+\w*=.*,\s*)*([a-zA-Z]+\w*=.*\s*)?\)\s*)+\]", |
Check failure
Code scanning / CodeQL
Inefficient regular expression High
# Llama3.2 models more reliable. | ||
|
||
TOOL_CALL_REGEX = re.compile( | ||
r"\[([a-zA-Z]+\w*\(([a-zA-Z]+\w*=.*,\s*)*([a-zA-Z]+\w*=.*\s)?\),\s*)*([a-zA-Z]+\w*\(([a-zA-Z]+\w*=.*,\s*)*([a-zA-Z]+\w*=.*\s*)?\)\s*)+\]", |
Check failure
Code scanning / CodeQL
Inefficient regular expression High
# Llama3.2 models more reliable. | ||
|
||
TOOL_CALL_REGEX = re.compile( | ||
r"\[([a-zA-Z]+\w*\(([a-zA-Z]+\w*=.*,\s*)*([a-zA-Z]+\w*=.*\s)?\),\s*)*([a-zA-Z]+\w*\(([a-zA-Z]+\w*=.*,\s*)*([a-zA-Z]+\w*=.*\s*)?\)\s*)+\]", |
Check failure
Code scanning / CodeQL
Inefficient regular expression High
# Llama3.2 models more reliable. | ||
|
||
TOOL_CALL_REGEX = re.compile( | ||
r"\[([a-zA-Z]+\w*\(([a-zA-Z]+\w*=.*,\s*)*([a-zA-Z]+\w*=.*\s)?\),\s*)*([a-zA-Z]+\w*\(([a-zA-Z]+\w*=.*,\s*)*([a-zA-Z]+\w*=.*\s*)?\)\s*)+\]", |
Check failure
Code scanning / CodeQL
Inefficient regular expression High
# Llama3.2 models more reliable. | ||
|
||
TOOL_CALL_REGEX = re.compile( | ||
r"\[([a-zA-Z]+\w*\(([a-zA-Z]+\w*=.*,\s*)*([a-zA-Z]+\w*=.*\s)?\),\s*)*([a-zA-Z]+\w*\(([a-zA-Z]+\w*=.*,\s*)*([a-zA-Z]+\w*=.*\s*)?\)\s*)+\]", |
Check failure
Code scanning / CodeQL
Inefficient regular expression High
# Llama3.2 models more reliable. | ||
|
||
TOOL_CALL_REGEX = re.compile( | ||
r"\[([a-zA-Z]+\w*\(([a-zA-Z]+\w*=.*,\s*)*([a-zA-Z]+\w*=.*\s)?\),\s*)*([a-zA-Z]+\w*\(([a-zA-Z]+\w*=.*,\s*)*([a-zA-Z]+\w*=.*\s*)?\)\s*)+\]", |
Check failure
Code scanning / CodeQL
Inefficient regular expression High
Limit decode bucket size to num_hpu_blocks
Signed-off-by: youkaichao <[email protected]>
…ject#10621) Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: chaunceyjiang <[email protected]>
…es (vllm-project#9850) Signed-off-by: Wallas Santos <[email protected]> Co-authored-by: Michael Goin <[email protected]>
Signed-off-by: simon-mo <[email protected]>
Signed-off-by: xffxff <[email protected]> Co-authored-by: Isotr0py <[email protected]>
…oject#10639) Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Signed-off-by: Sanket Kale <[email protected]> Co-authored-by: Sanket Kale <[email protected]> Co-authored-by: mgoin <[email protected]>
Signed-off-by: rickyx <[email protected]>
Signed-off-by: youkaichao <[email protected]> Co-authored-by: youkaichao <[email protected]>
Signed-off-by: DarkLight1337 <[email protected]>
Signed-off-by: youkaichao <[email protected]>
Fixes issue with multi LoRA during `profile_run`.
We are seeing 10% performance regression in the llama-based model due to vllm-project#10239. The mark_step() function needs to be configured differently for each model to achieve the best performance. For some models, mark_step() for every decoder step would be optimal, but for other models, it's better to run it every n-th step. We are adding a counter to only register the hook for every n-th step, which can be configured with VLLM_CONFIG_HIDDEN_LAYERS
i think inception was a decent movie overall
Signed-off-by: Konrad Zawora <[email protected]>
With this patch, mp executor does not hang at the end of application out of the box, and exits gracefully.
New useful checks were added, and we're not running them on habana_main per PR. This PR fixes that.
@@ -0,0 +1,32 @@ | |||
name: Lint documentation |
Check failure
Code scanning / Scorecard
Token-Permissions High
Remediation tip: Visit https://app.stepsecurity.io/secureworkflow.
Tick the 'Restrict permissions for GITHUB_TOKEN'
Untick other options
NOTE: If you want to resolve multiple issues at once, you can visit https://app.stepsecurity.io/securerepo instead.
Click Remediation section below for further remediation help
Without this change we can observe below error: ``` [rank0]: File "/software/users/kdamaszke/repos/vllm-fork/vllm/model_executor/models/mllama.py", line 959, in forward [rank0]: full_text_row_masked_out_mask = full_text_row_masked_out_mask.view( [rank0]: RuntimeError: shape '[4, -1, 1]' is invalid for input of size 3 ``` It occurs when one of the requests is removed from the batch earlier. In that case, language model is still working on the shapes padded to the bucketed batch size, while encoder input doesn't. This change is aligning the batch size on `encoder_seq_lens` to the expected one.
As now one_hot operator has implementation for eager and compile mode the workaround is not needed any longer.
Fix for batch size padding in multi-step scheduling by @SanjuCSudhakaran. Co-authored-by: Sanju C Sudhakaran <[email protected]>
during warmup the inference mode is used, but at runtime it's overwritten by inference mode - this causes recompilations due to dispatch key mismatch in torch.compile. This switches the no_grad mode to inference_mode from base class. --------- Co-authored-by: Rafal Litka <[email protected]>
Generic name discovery for rope.prepare_cos_sin. It fixes errors in models that don't follow a specific naming hierarchy
Add new member to list of codeowners.
#566 breaks long-contexts + LoRA flow. This assumes caching sin-cos buffer for first decoder layer is sufficient to handle all cases, which is not the applicable for long-context + LoRA. This PR ignores `_prepare_cos_sin` call prior to HpuModelAdapter forward in long-context + LoRA flow.
This PR solves the "ModuleNotFoundError: No Module named torch.hpu" in test_lora_manager_hpu.py::test_from_lora_tensors by importing "habana_frameworks.hpu" to lora model. Co-authored-by: Vivek Goel <[email protected]>
This PR updates `test_layers_hpu.py` and `test_lora_hpu.py` to align with `PunicaWrapper` refactor. Related PR: #614
Error reported in https://jira.habana-labs.com/browse/SW-212516 Found two recent merged PR breaks down Spec Decode functionality: 1. #491 overrides existing workerwrapperBase design for speculative decoding. ``` if model_runner_cls is not None: ModelRunnerClass = model_runner_cls ``` is not needed since we now use codes as below for init model_runner_cls to follow upstream design. ``` if model_runner_cls is not None: self.model_runner = model_runner_cls(self.model_runner) ``` 2. #566 is not working in Spec Decode Eagle mode Due to input tensors is now different to the pre-assumption that decode_fwd only provide one token per seq. Spec Decode provides multiple candidates tokens as q. To fix that, added a new ENV - "**VLLM_COS_SIN_RECOMPUTE**=true", need to use it to trigger recompute to cos and sin for spec decode. --------- Signed-off-by: Chendi.Xue <[email protected]>
This PR is to fix the slow sampling in HPU when repetition_penalty is set in the sampling parameters. It replaces the slow pytorch API on HPU and mitigate the dynamic shapes in the code. Without this PR: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.06, temperature=1.0, top_p=1.0, top_k=-1, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=True, max_tokens=1024, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None) Warming up... Profiling iterations: 100%|5/5 [03:32<00:00, 42.49s/it] Avg latency: 42.49439047839987 seconds 10% percentile latency: 11.322476224999628 seconds 25% percentile latency: 11.32563829100036 seconds 50% percentile latency: 11.331052645000455 seconds 75% percentile latency: 11.333669468998778 seconds 90% percentile latency: 104.8302020711999 seconds 99% percentile latency: 160.92812163252054 seconds With PR: Avg latency: 11.038154767800005 seconds 10% percentile latency: 10.964674918200398 seconds 25% percentile latency: 10.964709408001 seconds 50% percentile latency: 10.966433088000485 seconds 75% percentile latency: 10.967024742998547 seconds 90% percentile latency: 11.18358270219942 seconds 99% percentile latency: 11.313517477719943 seconds Testing code: https://github.com/ccrhx4/huanxing.vllm-fork/blob/slow_repetition_penalty/benchmarks/reproduce.sh The only difference about this PR and #442 is that I do not enable pin_memory as this feature readiness is poor on HPU.
Original : #599 We have a case where topk=1, and topp=<1. Adding special handling for the case topk=1 and handle_duplicate=0 (by default handle_duplicate=0, to support num-scheduling-steps)
This PR fixes a bug that results in the following RuntimeError when APC is enabled. ``` ERROR 12-19 02:30:05 engine.py:140] File "/workspace/vllm/worker/hpu_model_runner.py", line 854, in _prepare_prompt ERROR 12-19 02:30:05 engine.py:140] if prefix_block_list_tensor: ERROR 12-19 02:30:05 engine.py:140] RuntimeError: Boolean value of Tensor with more than one value is ambiguous ```
Add llava suport with a prompt for benchmark throughput using images
This is a updated version from #650. Coupled with [Use FusedSDPA for MllamaVisionSdpaAttention #620], these two issues arising when running llama3.2 vision model can be resolved: GC fail when batchsize>1 on Gaudi3. Increased device memory consumption with Torch 2.5 compared to Torch 2.4. --------- Signed-off-by: yan ma <[email protected]> Co-authored-by: yisonzhu <[email protected]>
Use `FusedSDPA` instead of regular `F.scaled_dot_product_attention` in `MllamaVisionSdpaAttention` module. The difference between these two ops is precision, since `F.scaled_dot_product_attention` converts the input to float32 and performs all operations on this data type, while `FusedSDPA` does not. However, it change accuracy only from 0.449 to 0.446 on accuracy test based on MMMU dataset and lm-evaluation-harness, while improves single prompt performance from ~550ms to ~100ms.
Fix warmup for encoder-decoder model runner by limiting the number of dummy cross attention blocks to available blocks. Without this we will encounter an error in CrossAttention due to lack of available blocks.
Remove WA using torch.ops.hpu.fp8_gemm_v2 for hpu.
This PR adds changes required to enable MSS with LoRA flow. Checked there are no regressions using vllm-fork CI job https://tf-jenkins-ctrl01.habana-labs.com/job/vLLM/view/CI-jobs/job/vLLM-CI-Pipeline/429/
- Added actionlint.yaml to allow usage of self-hosted runners (without it actionlint will throw error) - I also tried to disable some of shellcheck warnings/errors but couldn't do that so probably this PR should be merged even though actionlint is failing - Update Trigger Jenkins workflow - now it will contain 4 jobs: 1. Dependency Scan - will fail the job if a dependency with high severity vulnerability will be part of the PR 2. CodeQL Scan - scan the python code itself 3. Calculate Tests To Trigger - will read the .jenkins/test_config.yaml file and based on it trigger all the tests configured on it 4. Tests - The tests running on Gaudi resources
Scope of changes:
mark_step
s)