Optimization for htmlspecialchars function #18126

ArtUkrainskiy · 2025-03-21T13:26:03Z

Optimization for `htmlspecialchars` function.

A dedicated php_htmlspecialchars function instead of the “universal” php_escape_html_entities_ex.
We work with ASCII-compatible encodings, we can employ byte-by-byte scanning and a lookup table to identify special characters. For c < 0x80, the lookup table is used; for potentially multi-byte characters, we continue to rely on get_next_char.
This approach provides a noticeable performance improvement for ASCII strings and some improvement for multi-byte strings due to more optimized logic.

Important!
We have changed the entity validation logic so that the maximum length of an entity name can no longer exceed LONGEST_ENTITY_LENGTH.

Benchmarks 4k char strings:
htmlspecialchars with UTF-8 encoding

 ---------------------------------------------------------------------------------------------------------
 |                      Test |  htmlspecialchars avg(ns) | htmlspecialchars_new avg(ns) |        diff(%) |
 ---------------------------------------------------------------------------------------------------------
 | 300 specialchars          |                     40657 |                        11559 |        251.73% |
 ---------------------------------------------------------------------------------------------------------
 | 300 spchrs. with entities |                     40986 |                        11550 |        254.86% |
 ---------------------------------------------------------------------------------------------------------
 | Clean ASCII string        |                     38348 |                         7401 |        418.15% |
 ---------------------------------------------------------------------------------------------------------
 | Cyrillic                  |                     50003 |                        38302 |         30.55% |
 ---------------------------------------------------------------------------------------------------------
 | Chinese                   |                     55860 |                        46308 |         20.63% |
 ---------------------------------------------------------------------------------------------------------
 | Japanese                  |                     56832 |                        46991 |         20.94% |
 ---------------------------------------------------------------------------------------------------------

htmlspecialchars with lang-specific encoding. Without encoding hint

------------------------------------------------------------------------------------------------------
|                      Test  |  htmlspecialchars avg(ns) | htmlspecialchars_new avg(ns) |    diff(%) |
-----------------------------------------------------------------------------------------------------
| Cyrillic CP1251            |                     41518 |                        35148 |     18.12% |
-----------------------------------------------------------------------------------------------------
| Chinese  Big5              |                     57282 |                        47005 |     21.86% |
-----------------------------------------------------------------------------------------------------
| Japanese SJIS              |                     54703 |                        49887 |      9.65% |
-----------------------------------------------------------------------------------------------------

htmlspecialchars with lang-specific encoding. With encoding hint

------------------------------------------------------------------------------------------------------
|                      Test  |  htmlspecialchars avg(ns) | htmlspecialchars_new avg(ns) |    diff(%) |
-----------------------------------------------------------------------------------------------------
| Cyrillic CP1251            |                     37185 |                        29197 |     27.36% |
-----------------------------------------------------------------------------------------------------
| Chinese  Big5              |                     48877 |                        38191 |     27.98% |
-----------------------------------------------------------------------------------------------------
| Japanese SJIS              |                     46282 |                        38317 |     20.79% |
-----------------------------------------------------------------------------------------------------

We may need more benchmarks.
This is not the final optimization, as get_next_char remains suboptimal for this function due to extra character-detection steps that aren’t required under the default flags.

A dedicated php_htmlspecialchars function instead of the “universal” php_escape_html_entities_ex. We work with ASCII-compatible encodings, we can employ byte-by-byte scanning and a lookup table to identify special characters. For c < 0x80, the lookup table is used; for potentially multi-byte characters, we continue to rely on get_next_char. This approach provides a noticeable performance improvement for ASCII strings and some improvement for multi-byte strings due to more optimized logic.

The new htmlspecialchars function respects the maximum entity size, defined as LONGEST_ENTITY_LENGTH. There is no strict limit on the length of a numeric entity in the HTML and XML specifications, but in practice the maximum possible is 􏿿, which takes up 10 characters. Any numeric entities larger than this size are effectively invalid and will not be processed by browsers.

Girgias

Haven't really gone through the logic yet, but there are already things that don't really make sense.

Girgias · 2025-03-26T15:41:32Z

ext/standard/html.c

+/* {{{ php_html_entities */
+static void php_htmlspecialchars(INTERNAL_FUNCTION_PARAMETERS)
+{
+	zend_string *str, *hint_charset = NULL;
+	zend_long flags = ENT_QUOTES|ENT_SUBSTITUTE;
+	zend_string *replaced;
+	bool double_encode = 1;
+
+	ZEND_PARSE_PARAMETERS_START(1, 4)
+		Z_PARAM_STR(str)
+		Z_PARAM_OPTIONAL
+		Z_PARAM_LONG(flags)
+		Z_PARAM_STR_OR_NULL(hint_charset)
+		Z_PARAM_BOOL(double_encode);
+	ZEND_PARSE_PARAMETERS_END();
+
+	replaced = php_htmlspecialchars_ex(
+		str, (int) flags,
+		hint_charset ? ZSTR_VAL(hint_charset) : NULL, double_encode, /* quiet */ 0);
+	RETVAL_STR(replaced);
+}
+/* }}} */
+
 /* {{{ Convert special characters to HTML entities */
 PHP_FUNCTION(htmlspecialchars)
 {
-	php_html_entities(INTERNAL_FUNCTION_PARAM_PASSTHRU, 0);
+	php_htmlspecialchars(INTERNAL_FUNCTION_PARAM_PASSTHRU);
 }
 /* }}} */


Why are you using a pass through function when there is only ever one use case now? Especially as you are adding a C function call overhead, which is weird for an optimization PR.

You should also "inline" the other usage of the php_html_entities pass through function.

Girgias · 2025-03-26T15:41:46Z

ext/standard/tests/strings/bug60965.phpt

-echo htmlspecialchars('"""""""""""""""""""""""""""""""""""""""""""""&#x000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000005;',
+echo htmlspecialchars('"""""""""""""""""""""""""""""""""""""""""""""&#x123456789123456789123456789;',


I added support for the LONGEST_ENTITY_LENGTH constant to define the maximum length of an entity. While it originally applies to named entities, I think it also makes sense to use it to limit the length of numeric entities.

There’s no strict limit on numeric entity length in the HTML or XML specs, but in practice the longest valid one is 􏿿, which is 10 characters long — so there’s no point in scanning the string beyond that when looking for a semicolon.

Any numeric entities longer than that are effectively invalid and won’t be processed by browsers anyway.

Girgias · 2025-03-26T15:42:27Z

ext/standard/html.c

+/* Lookup table for php_htmlspecialchars */
+typedef struct  {
+	char* entity[256];
+	ushort entity_len[256];
+} htmlspecialchars_lut;


Move this closer to the declaration of the function.

Girgias · 2025-03-26T15:42:45Z

ext/standard/html.c

@@ -752,6 +758,60 @@ static zend_result resolve_named_entity_html(const char *start, size_t length, c
 }
 /* }}} */

+/* {{{ is_codepoint_allowed */
+static inline zend_bool is_codepoint_allowed(


Suggested change

static inline zend_bool is_codepoint_allowed(

static bool is_codepoint_allowed(

Girgias · 2025-03-26T15:42:52Z

ext/standard/html.c

+		return unicode_cp_is_allowed(cp, doctype);
+	}
+
+	return 1;


Suggested change

return 1;

return true;

Girgias · 2025-03-26T15:51:58Z

ext/standard/html.c

+    size_t replacement_len = 0;
+    if (flags & (ENT_HTML_SUBSTITUTE_ERRORS | ENT_HTML_SUBSTITUTE_DISALLOWED_CHARS)) {
+        if (charset == cs_utf_8) {
+            replacement = (const unsigned char*)"\xEF\xBF\xBD";


Why the cast?

Girgias · 2025-03-26T15:52:09Z

ext/standard/html.c

+    if (flags & (ENT_HTML_SUBSTITUTE_ERRORS | ENT_HTML_SUBSTITUTE_DISALLOWED_CHARS)) {
+        if (charset == cs_utf_8) {
+            replacement = (const unsigned char*)"\xEF\xBF\xBD";
+            replacement_len = sizeof("\xEF\xBF\xBD") - 1;


Suggested change

replacement_len = sizeof("\xEF\xBF\xBD") - 1;

replacement_len = strlen("\xEF\xBF\xBD");

Girgias · 2025-03-26T15:52:21Z

ext/standard/html.c

+            replacement = (const unsigned char*)"&#xFFFD;";
+            replacement_len = sizeof("&#xFFFD;") - 1;


Ditto prior remarks

Girgias · 2025-03-26T15:55:50Z

ext/standard/html.c

+/* Lookup table for php_htmlspecialchars */
+typedef struct  {
+	char* entity[256];
+	ushort entity_len[256];


ushort is not standard.

Suggested change

ushort entity_len[256];

uint8_t entity_len[256];

Girgias · 2025-03-26T16:01:08Z

ext/standard/html.c

+/* {{{ php_htmlspecialchars */
+PHPAPI zend_string* php_htmlspecialchars_ex(


Why is this a PHPAPI?

Thanks for the review! I’ll go through all the comments and make the necessary changes. I don’t have much experience contributing to public PHP projects yet, so I might not be fully aware of all the conventions and best practices.

I was working off the existing code and reused some parts as-is. I plan to refactor php_html_entities later. The main goal here was to switch from a hashtable to a LUT for special character replacement, and to restructure the logic accordingly.

ArtUkrainskiy added 2 commits March 21, 2025 16:17

github-actions bot added the Extension: standard label Mar 21, 2025

ArtUkrainskiy marked this pull request as ready for review March 21, 2025 14:45

ArtUkrainskiy requested a review from bukka as a code owner March 21, 2025 14:45

Girgias requested changes Mar 26, 2025

View reviewed changes

ArtUkrainskiy marked this pull request as draft March 26, 2025 19:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimization for htmlspecialchars function #18126

Optimization for htmlspecialchars function #18126

ArtUkrainskiy commented Mar 21, 2025 •

edited

Loading

Girgias left a comment

Girgias Mar 26, 2025

Girgias Mar 26, 2025

ArtUkrainskiy Mar 26, 2025 •

edited

Loading

Girgias Mar 26, 2025

Girgias Mar 26, 2025

Girgias Mar 26, 2025

Girgias Mar 26, 2025

Girgias Mar 26, 2025

Girgias Mar 26, 2025

Girgias Mar 26, 2025

Girgias Mar 26, 2025

ArtUkrainskiy Mar 26, 2025

		echo htmlspecialchars('"""""""""""""""""""""""""""""""""""""""""""""&#x000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000005;',
		echo htmlspecialchars('"""""""""""""""""""""""""""""""""""""""""""""&#x123456789123456789123456789;',

	static inline zend_bool is_codepoint_allowed(
	static bool is_codepoint_allowed(

	replacement_len = sizeof("\xEF\xBF\xBD") - 1;
	replacement_len = strlen("\xEF\xBF\xBD");

		replacement = (const unsigned char*)"�";
		replacement_len = sizeof("�") - 1;

		/* {{{ php_htmlspecialchars */
		PHPAPI zend_string* php_htmlspecialchars_ex(

Optimization for htmlspecialchars function #18126

Are you sure you want to change the base?

Optimization for htmlspecialchars function #18126

Conversation

ArtUkrainskiy commented Mar 21, 2025 • edited Loading