Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Topic partition backlog grows after bookies restart when using ExtensibleLoadManagerImpl #23908

Open
2 of 3 tasks
szkoludasebastian opened this issue Jan 29, 2025 · 30 comments
Labels
type/bug The PR fixed a bug or issue reported a bug

Comments

@szkoludasebastian
Copy link

szkoludasebastian commented Jan 29, 2025

Search before asking

  • I searched in the issues and found nothing similar.

Read release policy

  • I understand that unsupported versions don't get bug fixes. I will attempt to reproduce the issue on a supported version of Pulsar client and Pulsar broker.

Version

Pulsar client >= 3.3.2
Pulsar server >= 3.3.2

We made our tests on client/server version: 3.3.1, 3.3.2, 4.0.1, 4.0.2. We have not noticed such a problem on version 3.3.1.

Minimal reproduce step

  1. Create partitioned topic A with 100 partitions
  2. Connect 4 consumers to topic A (Subscription type does not matter, we noticed this problem for shared and key_shared subscriptions)
  3. Start sending data to topic A with a throughput of about 5,000 messages per second
  4. Restart bookies (in our case we had like 6 bookies and 6 brokers) - we restarted it by killing k8s pods
  5. After all bookies restarted and are up and running wait for 30 seconds and stop sending data
  6. Check pulsar admin topic stats that backlog is only growing. We were observing backlog for every partition in grafana so we knew for which partition check stats-internal. In our stats-internal we usually see such situation:
{
  "entriesAddedCounter" : 1843,
  "numberOfEntries" : 1867,
  "totalSize" : 14761813,
  "currentLedgerEntries" : 964,
  "currentLedgerSize" : 10095321,
  "lastLedgerCreatedTimestamp" : "2025-01-29T09:52:26.278Z",
  "lastLedgerCreationFailureTimestamp" : "2025-01-29T09:52:26.059Z",
  "waitingCursorsCount" : 1,
  "pendingAddEntriesCount" : 0,
  "lastConfirmedEntry" : "1755285:963",
  "state" : "LedgerOpened",
  "ledgers" : [ {
    "ledgerId" : 1742277,
    "entries" : 23,
    "size" : 26553,
    "offloaded" : false,
    "underReplicated" : false
  }, {
    "ledgerId" : 1742802,
    "entries" : 1,
    "size" : 4272,
    "offloaded" : false,
    "underReplicated" : false
  }, {
    "ledgerId" : 1743592,
    "entries" : 35,
    "size" : 21963,
    "offloaded" : false,
    "underReplicated" : false
  }, {
    "ledgerId" : 1744071,
    "entries" : 31,
    "size" : 19453,
    "offloaded" : false,
    "underReplicated" : false
  }, {
    "ledgerId" : 1744244,
    "entries" : 43,
    "size" : 26981,
    "offloaded" : false,
    "underReplicated" : false
  }, {
    "ledgerId" : 1744474,
    "entries" : 44,
    "size" : 27611,
    "offloaded" : false,
    "underReplicated" : false
  }, {
    "ledgerId" : 1744955,
    "entries" : 25,
    "size" : 15688,
    "offloaded" : false,
    "underReplicated" : false
  }, {
    "ledgerId" : 1745168,
    "entries" : 48,
    "size" : 30127,
    "offloaded" : false,
    "underReplicated" : false
  }, {
    "ledgerId" : 1745373,
    "entries" : 45,
    "size" : 28250,
    "offloaded" : false,
    "underReplicated" : false
  }, {
    "ledgerId" : 1745594,
    "entries" : 38,
    "size" : 23882,
    "offloaded" : false,
    "underReplicated" : false
  }, {
    "ledgerId" : 1746041,
    "entries" : 36,
    "size" : 22628,
    "offloaded" : false,
    "underReplicated" : false
  }, {
    "ledgerId" : 1746268,
    "entries" : 34,
    "size" : 21362,
    "offloaded" : false,
    "underReplicated" : false
  }, {
    "ledgerId" : 1746487,
    "entries" : 34,
    "size" : 21370,
    "offloaded" : false,
    "underReplicated" : false
  }, {
    "ledgerId" : 1746710,
    "entries" : 443,
    "size" : 4184621,
    "offloaded" : false,
    "underReplicated" : false
  }, {
    "ledgerId" : 1749679,
    "entries" : 23,
    "size" : 191731,
    "offloaded" : false,
    "underReplicated" : false
  }, {
    "ledgerId" : 1755285,
    "entries" : 0,
    "size" : 0,
    "offloaded" : false,
    "underReplicated" : false
  } ],
  "cursors" : {
    "realTimeDocStreamSubscription" : {
      "markDeletePosition" : "1739559:1480",
      "readPosition" : "1755285:964",
      "waitingReadOp" : true,
      "pendingReadOps" : 0,
      "messagesConsumedCounter" : 3097,
      "cursorLedger" : 1755643,
      "cursorLedgerLastEntry" : 72,
      "individuallyDeletedMessages" : "[(1739559:1481..1739559:1482],(1739559:1483..1739559:1488],(1739559:1489..1739559:1490],(1739559:1494..1739559:1498],(1739559:1501..1739559:1502],(1739559:1503..1739559:1504],(1739559:1507..1739559:1510],(1739559:1511..1739559:1513],(1739559:1518..1739559:1519],(1739559:1521..1739559:1528],(1739559:1530..1739559:1531],(1739559:1532..1739559:1533],(1739559:1534..1739559:1535],(1739559:1536..1739559:1541],(1739559:1542..1739559:1545],(1739559:1546..1739559:1549],(1739559:1550..1739559:1561],(1739559:1562..1739559:1563],(1739559:1564..1739559:1571],(1739559:1572..1739559:1574],(1739559:1575..1739559:1586],(1739559:1587..1739559:1588],(1739559:1589..1739559:1591],(1739559:1592..1739559:1594],(1739559:1598..1739559:1599],(1739559:1600..1739559:1604],(1739559:1605..1739559:1607],(1739559:1609..1739559:1611],(1739559:1612..1739559:1614],(1739559:1615..1739559:1621],(1739559:1622..1739559:1627],(1739559:1628..1739559:1630],(1739559:1631..1739559:1643],(1739559:1644..1739559:1647],(1739559:1648..1739559:1655],(1739559:1657..1739559:1658],(1739559:1659..1739559:1663],(1739559:1664..1739559:1665],(1739559:1668..1739559:1672],(1739559:1673..1739559:1674],(1739559:1675..1739559:1678],(1739559:1679..1739559:1681],(1739559:1682..1739559:1683],(1739559:1684..1739559:1687],(1739559:1688..1739559:1697],(1739559:1698..1739559:1704],(1739559:1705..1739559:1706],(1739559:1707..1739559:1711],(1739559:1712..1739559:1713],(1739559:1715..1739559:1717],(1739559:1718..1739559:1719],(1739559:1720..1739559:1725],(1739559:1726..1739559:1729],(1739559:1730..1739559:1738],(1739559:1739..1739559:1751],(1739559:1755..1739559:1765],(1739559:1766..1739559:1768],(1739559:1771..1739559:1773],(1739559:1776..1739559:1779],(1739559:1781..1739559:1788],(1739559:1789..1739559:1794],(1739559:1795..1739559:1796],(1739559:1797..1739559:1803],(1739559:1806..1739559:1808],(1739559:1809..1739559:1810],(1739559:1812..1739559:1817],(1739559:1818..1739559:1828],(1739559:1830..1739559:1832],(1739559:1833..1739559:1838],(1739559:1841..1739559:1843],(1739559:1844..1739559:1849],(1739559:1851..1739559:1857],(1739559:1858..1739559:1859],(1739559:1860..1739559:1864],(1739559:1865..1739559:1870],(1739559:1871..1739559:1873],(1739559:1875..1739559:1876],(1739559:1877..1739559:1879],(1739559:1881..1739559:1887],(1739559:1890..1739559:1893],(1739559:1894..1739559:1896],(1739559:1897..1739559:1899],(1739559:1901..1739559:1908],(1739559:1911..1739559:1913],(1739559:1914..1739559:1918],(1739559:1919..1739559:1922],(1739559:1923..1739559:1925],(1739559:1927..1739559:1939],(1739559:1940..1739559:1945],(1739559:1946..1739559:1956],(1739559:1958..1739559:1959],(1739559:1960..1739559:1968],(1739559:1969..1739559:1974],(1739559:1975..1739559:1977],(1739559:1978..1739559:1983],(1739559:1985..1739559:1992],(1739559:1993..1739559:1996],(1739559:1998..1739559:1999],(1739559:2001..1739559:2008],(1739559:2009..1739559:2010],(1739559:2011..1739559:2016],(1739559:2017..1739559:2021],(1739559:2022..1739559:2023],(1739559:2024..1739559:2025],(1739559:2026..1739559:2030],(1739559:2031..1739559:2032],(1739559:2033..1739559:2037],(1739559:2038..1739559:2042],(1739559:2043..1739559:2044],(1739559:2046..1739559:2048],(1739559:2050..1739559:2056],(1739559:2058..1739559:2060],(1739559:2062..1739559:2063],(1739559:2064..1739559:2066],(1739559:2068..1739559:2070],(1739559:2071..1739559:2072],(1739559:2073..1739559:2078],(1739559:2079..1739559:2082],(1739559:2084..1739559:2085],(1739559:2086..1739559:2088],(1739559:2089..1739559:2098],(1739559:2102..1739559:2104],(1739559:2105..1739559:2111],(1739559:2113..1739559:2115],(1739559:2116..1739559:2124],(1739559:2125..1739559:2129],(1739559:2130..1739559:2135],(1739559:2137..1739559:2142],(1739559:2143..1739559:2144],(1739559:2145..1739559:2147],(1739559:2148..1739559:2152],(1739559:2155..1739559:2157],(1739559:2158..1739559:2161],(1739559:2163..1739559:2166],(1739559:2167..1739559:2168],(1739559:2170..1739559:2172],(1739559:2173..1739559:2179],(1739559:2181..1739559:2185],(1739559:2187..1739559:2190],(1739559:2192..1739559:2199],(1739559:2201..1739559:2204],(1739559:2207..1739559:2212],(1739559:2213..1739559:2214],(1739559:2215..1739559:2226],(1739559:2230..1739559:2231],(1739559:2232..1739559:2233],(1739559:2234..1739559:2235],(1739559:2236..1739559:2239],(1739559:2240..1739559:2241],(1739559:2242..1739559:2243],(1739559:2244..1739559:2250],(1739559:2255..1739559:2256],(1739559:2258..1739559:2263],(1739559:2264..1739559:2265],(1739559:2266..1739559:2269],(1739559:2270..1739559:2271],(1739559:2272..1739559:2277],(1739559:2278..1739559:2282],(1739559:2283..1739559:2288],(1739559:2289..1739559:2293],(1739559:2294..1739559:2304],(1739559:2305..1739559:2306],(1739559:2308..1739559:2312],(1739559:2313..1739559:2314],(1739559:2315..1739559:2323],(1739559:2324..1739559:2327],(1739559:2328..1739559:2331],(1739559:2332..1739559:2335],(1739559:2336..1739559:2337],(1739559:2338..1739559:2341],(1739559:2346..1739559:2351],(1739559:2352..1739559:2354],(1739559:2355..1739559:2356],(1739559:2357..1739559:2361],(1739559:2363..1739559:2367],(1739559:2368..1739559:2371],(1739559:2372..1739559:2376],(1739559:2377..1739559:2383],(1739559:2384..1739559:2389],(1739559:2393..1739559:2396],(1739559:2397..1739559:2401],(1739559:2403..1739559:2410],(1739559:2411..1739559:2412],(1739559:2414..1739559:2418],(1739559:2419..1739559:2421],(1739559:2423..1739559:2424],(1739559:2425..1739559:2428],(1739559:2429..1739559:2431],(1739559:2433..1739559:2434],(1739559:2435..1739559:2440],(1739559:2441..1739559:2442],(1739559:2443..1739559:2445],(1739559:2446..1739559:2449],(1739559:2450..1739559:2451],(1739559:2452..1739559:2456],(1739559:2457..1739559:2462],(1739559:2464..1739559:2470],(1739559:2471..1739559:2475],(1739559:2476..1739559:2482],(1739559:2483..1739559:2489],(1739559:2492..1739559:2494],(1739559:2496..1739559:2499],(1739559:2502..1739559:2504],(1739559:2505..1739559:2506],(1739559:2508..1739559:2513],(1739559:2514..1739559:2519],(1739559:2520..1739559:2522],(1739559:2524..1739559:2525],(1739559:2526..1739559:2528],(1739559:2529..1739559:2530],(1739559:2532..1739559:2533],(1739559:2535..1739559:2536],(1739559:2538..1739559:2542],(1739559:2546..1739559:2548],(1739559:2550..1739559:2552],(1739559:2553..1739559:2554],(1739559:2555..1739559:2563],(1739559:2566..1739559:2569],(1739559:2570..1739559:2574],(1739559:2575..1739559:2577],(1739559:2578..1739559:2580],(1739559:2581..1739559:2590],(1739559:2591..1739559:2596],(1739559:2597..1739559:2598],(1739559:2604..1739559:2606],(1739559:2611..1739559:2613],(1739559:2616..1739559:2620],(1739559:2622..1739559:2625],(1739559:2626..1739559:2630],(1739559:2633..1739559:2635],(1739559:2637..1739559:2640],(1739559:2641..1739559:2642],(1739559:2644..1739559:2646],(1739559:2648..1739559:2650],(1739559:2655..1739559:2661],(1739559:2663..1739559:2669],(1739559:2670..1739559:2672],(1739559:2674..1739559:2675],(1739559:2676..1739559:2678],(1739559:2679..1739559:2682],(1739559:2683..1739559:2688],(1739559:2690..1739559:2692],(1739559:2694..1739559:2695],(1739559:2696..1739559:2699],(1739559:2701..1739559:2703],(1739559:2705..1739559:2706],(1739559:2707..1739559:2708],(1739559:2709..1739559:2717],(1739559:2718..1739559:2721],(1739559:2722..1739559:2728],(1739559:2729..1739559:2730],(1739559:2731..1739559:2740],(1739559:2743..1739559:2745],(1739559:2746..1739559:2747],(1739559:2749..1739559:2754],(1739559:2756..1739559:2758],(1739559:2762..1739559:2764],(1739559:2765..1739559:2768],(1739559:2772..1739559:2773],(1739559:2774..1739559:2775],(1739559:2779..1739559:2784],(1739559:2785..1739559:2787],(1739559:2789..1739559:2791],(1739559:2793..1739559:2794],(1739559:2796..1739559:2799],(1739559:2800..1739559:2801],(1739559:2802..1739559:2803],(1739559:2805..1739559:2807],(1739559:2809..1739559:2810],(1739559:2813..1739559:2814],(1739559:2817..1739559:2818],(1739559:2821..1739559:2822],(1739559:2823..1739559:2824],(1739559:2825..1739559:2831],(1739559:2832..1739559:2833],(1739559:2835..1739559:2838],(1739559:2840..1739559:2842],(1739559:2843..1739559:2844],(1739559:2846..1739559:2850],(1739559:2851..1739559:2852],(1739559:2853..1739559:2854],(1739559:2855..1739559:2861],(1739559:2862..1739559:2864],(1739559:2866..1739559:2867],(1739559:2871..1739559:2875],(1739559:2876..1739559:2879],(1739559:2881..1739559:2884],(1739559:2885..1739559:2886],(1739559:2887..1739559:2889],(1739559:2890..1739559:2891],(1739559:2892..1739559:2893],(1739559:2895..1739559:2899],(1739559:2904..1739559:2905],(1739559:2906..1739559:2909],(1739559:2911..1739559:2914],(1739559:2915..1739559:2917],(1739559:2919..1739559:2921],(1739559:2923..1739559:2928],(1739559:2929..1739559:2930],(1739559:2931..1739559:2934],(1739559:2935..1739559:2939],(1739559:2940..1739559:2944],(1739559:2947..1739559:2951],(1739559:2954..1739559:2955],(1739559:2960..1739559:2961],(1739559:2965..1739559:2966],(1739559:2969..1739559:2970],(1739559:2972..1739559:2974],(1739559:2975..1739559:2976],(1739559:2977..1739559:2985],(1739559:2987..1739559:2988],(1739559:2991..1739559:2994],(1739559:2995..1739559:2996],(1739559:2997..1739559:3000],(1739559:3001..1739559:3002],(1739559:3003..1739559:3006],(1739559:3012..1739559:3014],(1739559:3016..1739559:3019],(1739559:3022..1739559:3023],(1739559:3024..1739559:3025],(1739559:3032..1739559:3035],(1739559:3036..1739559:3038],(1739559:3040..1739559:3043],(1739559:3046..1739559:3047],(1739559:3048..1739559:3049],(1739559:3050..1739559:3051],(1739559:3052..1739559:3053],(1739559:3060..1739559:3061],(1739559:3062..1739559:3065],(1739559:3075..1739559:3076],(1739559:3077..1739559:3083],(1739559:3084..1739559:3085],(1739559:3086..1739559:3089],(1739559:3091..1739559:3093],(1739559:3094..1739559:3095],(1739559:3097..1739559:3098],(1739559:3103..1739559:3104],(1739559:3106..1739559:3107],(1739559:3113..1739559:3116],(1739559:3118..1739559:3119],(1739559:3122..1739559:3123],(1739559:3125..1739559:3126],(1739559:3127..1739559:3129],(1739559:3133..1739559:3136],(1739559:3138..1739559:3150],(1739559:3151..1739559:3153],(1739559:3154..1739559:3158],(1739559:3159..1739559:3166],(1739559:3167..1739559:3172],(1739559:3173..1739559:3175],(1739559:3176..1739559:3178],(1739559:3179..1739559:3180],(1739559:3181..1739559:3185],(1739559:3186..1739559:3187],(1739559:3188..1739559:3190],(1739559:3191..1739559:3193],(1739559:3195..1739559:3197],(1739559:3199..1739559:3200],(1739559:3205..1739559:3206],(1739559:3207..1739559:3208],(1739559:3211..1739559:3212],(1739559:3215..1739559:3216],(1739559:3217..1739559:3218],(1739559:3219..1739559:3224],(1739559:3225..1739559:3226],(1739559:3227..1739559:3229],(1739559:3231..1739559:3232],(1739559:3233..1739559:3234],(1739559:3235..1739559:3236],(1739559:3237..1739559:3238],(1739559:3241..1739559:3243],(1739559:3245..1739559:3246],(1739559:3247..1739559:3248],(1739559:3253..1739559:3258],(1739559:3260..1739559:3261],(1739559:3262..1739559:3263],(1739559:3265..1739559:3266],(1739559:3267..1739559:3269],(1739559:3274..1739559:3276],(1739559:3277..1739559:3279],(1739559:3281..1739559:3284],(1739559:3285..1739559:3286],(1739559:3287..1739559:3288],(1739559:3290..1739559:3292],(1739559:3293..1739559:3295],(1739559:3297..1739559:3298],(1739559:3300..1739559:3301],(1739559:3302..1739559:3303],(1739559:3307..1739559:3308],(1739559:3310..1739559:3312],(1739559:3313..1739559:3316],(1739559:3322..1739559:3323],(1739559:3327..1739559:3328],(1739559:3332..1739559:3334],(1739559:3336..1739559:3338],(1739559:3343..1739559:3344],(1739559:3345..1739559:3346],(1739559:3351..1739559:3352],(1739559:3353..1739559:3354],(1739559:3359..1739559:3360],(1739559:3361..1739559:3362],(1739559:3368..1739559:3369],(1739559:3380..1739559:3381],(1739559:3382..1739559:3383],(1739559:3386..1739559:3388],(1739559:3390..1739559:3392],(1739559:3395..1739559:3396],(1739559:3397..1739559:3398],(1739559:3400..1739559:3402],(1739559:3403..1739559:3404],(1739559:3406..1739559:3407],(1739559:3408..1739559:3409],(1739559:3413..1739559:3414],(1739559:3415..1739559:3416],(1739559:3421..1739559:3423],(1739559:3424..1739559:3426],(1739559:3429..1739559:3430],(1739559:3431..1739559:3432],(1739559:3433..1739559:3436],(1739559:3442..1739559:3444],(1739559:3447..1739559:3448],(1739559:3454..1739559:3455],(1739559:3461..1739559:3462],(1739559:3467..1739559:3468],(1739559:3473..1739559:3475],(1739559:3482..1739559:3486],(1739559:3487..1739559:3488],(1739559:3489..1739559:3490],(1739559:3491..1739559:3492],(1739559:3494..1739559:3495],(1739559:3496..1739559:3497],(1739559:3502..1739559:3503],(1739559:3505..1739559:3506],(1739559:3508..1739559:3509],(1739559:3515..1739559:3516],(1739559:3523..1739559:3524],(1739559:3528..1739559:3529],(1739559:3532..1739559:3533],(1739559:3539..1739559:3540],(1739559:3545..1739559:3546],(1739559:3555..1739559:3556],(1739559:3559..1739559:3560],(1739559:3562..1739559:3564],(1739559:3567..1739559:3568],(1739559:3570..1739559:3572],(1739559:3573..1739559:3574],(1739559:3579..1739559:3582],(1739559:3583..1739559:3584],(1739559:3585..1739559:3587],(1739559:3596..1739559:3598],(1739559:3599..1739559:3600],(1739559:3602..1739559:3603],(1739559:3606..1739559:3607],(1739559:3608..1739559:3609],(1739559:3614..1739559:3615],(1742277:-1..1742277:22],(1742802:-1..1742802:0],(1743592:-1..1743592:34],(1744071:-1..1744071:30],(1744244:-1..1744244:42],(1744474:-1..1744474:43],(1744955:-1..1744955:24],(1745168:-1..1745168:47],(1745373:-1..1745373:44],(1745594:-1..1745594:37],(1746041:-1..1746041:35],(1746268:-1..1746268:33],(1746487:-1..1746487:33],(1746710:-1..1746710:442],(1749679:-1..1749679:22],(1755285:-1..1755285:963]]",
      "lastLedgerSwitchTimestamp" : "2025-01-29T06:11:35.773Z",
      "state" : "Open",
      "active" : true,
      "numberOfEntriesSinceFirstNotAckedMessage" : 1868,
      "totalNonContiguousDeletedMessagesRange" : 471,
      "subscriptionHavePendingRead" : true,
      "subscriptionHavePendingReplayRead" : false,
      "properties" : { }
    }
  },
  "schemaLedgers" : [ ],
  "compactedLedger" : {
    "ledgerId" : -1,
    "entries" : -1,
    "size" : -1,
    "offloaded" : false,
    "underReplicated" : false
  }
}

so readPosition is moving but markDeletePosition stuck. Sometimes restarting consumers or brokers makes that markDeletePosition is moving forward and backlog goes down.

We also tried resetting the cursor to markDeletePosition to read the messages again, and that also helped. We did that by this command:

./pulsar-admin topics reset-cursor persistent://topic/prefix/topic-partition-48 -s subscription -m 1739559:1480

What did you expect to see?

No partition stuck and backlog does not grow

What did you see instead?

Backlog grows because markDeletePosition does not move

Anything else?

No response

Are you willing to submit a PR?

  • I'm willing to submit a PR!
@szkoludasebastian szkoludasebastian added the type/bug The PR fixed a bug or issue reported a bug label Jan 29, 2025
@lhotari
Copy link
Member

lhotari commented Jan 31, 2025

Thanks for reporting @szkoludasebastian . Would you be able to share also topic stats? It would be preferred to have this for a supported version (4.0.2) since that's what we are maintaining actively and there's also msgInReplay information since 3.0.7/3.3.2/4.0.0 with #23224 changes. The reason why msgInReplay would be relevant is that I'm interested to see if this is similar to #23845 .

  1. After all bookies restarted and are up and running wait for 30 seconds and stop sending data

It could be similar to Issue 23845 since I understood based on the above description that in your case you stop sending data and the backlog remains after that.
@szkoludasebastian Some questions that would help isolate the issue:

  • Does the backlog get consumed if you produce at least one message to each partition later?
  • Does the problem reproduce when you configure dispatcherDispatchMessagesInSubscriptionThread=false?

@szkoludasebastian
Copy link
Author

  1. New messages are consumed but the backlog is still growing then because markDeletePosition is not moving due to the fact that for some reason there is no continuity in acknowledgments

@szkoludasebastian
Copy link
Author

szkoludasebastian commented Jan 31, 2025

I will check that with dispatcherDispatchMessagesInSubscriptionThread=false. Where to configure that?

@lhotari
Copy link
Member

lhotari commented Jan 31, 2025

I will check that with dispatcherDispatchMessagesInSubscriptionThread=false. Where to configure that?

@szkoludasebastian In broker.conf (reference). In Apache Pulsar Helm chart k8s deployments, you add dispatcherDispatchMessagesInSubscriptionThread: "false" under the broker.configData key in values.yaml.

@lhotari
Copy link
Member

lhotari commented Jan 31, 2025

  1. New messages are consumed but the backlog is still growing then because markDeletePosition is not moving due to the fact that for some reason there is no continuity in acknowledgments

@szkoludasebastian That means that the broker hasn't received the acknowledgements or lost them. It's indeed very useful information if you can consistently reproduce the issue with 3.3.2 and not with 3.3.1 . In that case, it should be possible to find the commit that introduced the regression. Since I don't have the reproducer, I cannot do this on behalf of you. For finding the commit in v3.3.1...v3.3.2 (release notes), one possible solution would be to do interactive git-bisecting (or just manual bisecting) where you first pick the commit in the middle of v3.3.1...v3.3.2 and build a Pulsar binary and test that. If it fails, you know that the regression was introduced before that commit. You keep splitting the commits until you find the commit that introduced the regression. In Pulsar, the docker images are built with the commands documented here: https://pulsar.apache.org/contribute/release-process/#release-pulsar-30-and-later (prerequisite is https://pulsar.apache.org/contribute/release-process/#build-release-artifacts). Would you be able to perform this task so that we'd find the commit that introduced the regression?
I guess you have already checked whether the client version matters so that we could be sure that it's a broker side issue?

@lhotari
Copy link
Member

lhotari commented Jan 31, 2025

Is the issue that you are facing similar to #22709? What's different this time?

@szkoludasebastian
Copy link
Author

Is the issue that you are facing similar to #22709? What's different this time?

it appeared that it was problem on our side.

@lhotari
Copy link
Member

lhotari commented Jan 31, 2025

@szkoludasebastian Just confirming: do you happen to use Pulsar transactions?

@szkoludasebastian
Copy link
Author

Partition-4 stats-internal:

{
  "entriesAddedCounter" : 285,
  "numberOfEntries" : 409,
  "totalSize" : 4481595,
  "currentLedgerEntries" : 285,
  "currentLedgerSize" : 3521965,
  "lastLedgerCreatedTimestamp" : "2025-01-31T09:46:39.629Z",
  "waitingCursorsCount" : 1,
  "pendingAddEntriesCount" : 0,
  "lastConfirmedEntry" : "2044129:284",
  "state" : "LedgerOpened",
  "ledgers" : [ {
    "ledgerId" : 2043391,
    "entries" : 124,
    "size" : 959630,
    "offloaded" : false,
    "underReplicated" : false
  }, {
    "ledgerId" : 2044129,
    "entries" : 0,
    "size" : 0,
    "offloaded" : false,
    "underReplicated" : false
  } ],
  "cursors" : {
    "microbatcher" : {
      "markDeletePosition" : "2043391:112",
      "readPosition" : "2044129:285",
      "waitingReadOp" : true,
      "pendingReadOps" : 0,
      "messagesConsumedCounter" : 273,
      "cursorLedger" : 2044305,
      "cursorLedgerLastEntry" : 44,
      "individuallyDeletedMessages" : "[(2043391:122..2043391:123],(2044129:-1..2044129:282]]",
      "lastLedgerSwitchTimestamp" : "2025-01-31T09:46:39.675Z",
      "state" : "Open",
      "active" : true,
      "numberOfEntriesSinceFirstNotAckedMessage" : 297,
      "totalNonContiguousDeletedMessagesRange" : 2,
      "subscriptionHavePendingRead" : true,
      "subscriptionHavePendingReplayRead" : false,
      "properties" : { }
    }
  },
  "schemaLedgers" : [ ],
  "compactedLedger" : {
    "ledgerId" : -1,
    "entries" : -1,
    "size" : -1,
    "offloaded" : false,
    "underReplicated" : false
  }
}

Topic partition-4 stats:

{
  "msgRateIn" : 0.016666664188055923,
  "msgThroughputIn" : 16.849997494124537,
  "msgRateOut" : 0.016666664151389267,
  "msgThroughputOut" : 16.84999745705455,
  "bytesInCounter" : 3522976,
  "msgInCounter" : 832,
  "systemTopicBytesInCounter" : 0,
  "bytesOutCounter" : 3658127,
  "msgOutCounter" : 863,
  "bytesOutInternalCounter" : 0,
  "averageMsgSize" : 1011.0,
  "msgChunkPublished" : false,
  "storageSize" : 4482606,
  "backlogSize" : 3608212,
  "backlogQuotaLimitSize" : -1,
  "backlogQuotaLimitTime" : -1,
  "oldestBacklogMessageAgeSeconds" : 956,
  "oldestBacklogMessageSubscriptionName" : "microbatcher",
  "publishRateLimitedTimes" : 0,
  "earliestMsgPublishTimeInBacklogs" : 0,
  "offloadedStorageSize" : 0,
  "lastOffloadLedgerId" : 0,
  "lastOffloadSuccessTimeStamp" : 0,
  "lastOffloadFailureTimeStamp" : 0,
  "ongoingTxnCount" : 0,
  "abortedTxnCount" : 0,
  "committedTxnCount" : 0,
  "publishers" : [ {
    "accessMode" : "Shared",
    "msgRateIn" : 0.016666664188055923,
    "msgThroughputIn" : 16.849997494124537,
    "averageMsgSize" : 1011.0,
    "chunkedMessageRate" : 0.0,
    "producerId" : 5,
    "supportsPartialProducer" : false,
    "producerName" : "integ-pulsar-1162-18",
    "address" : "/10.10.178.13:38552",
    "connectedSince" : "2025-01-31T09:46:39.801991145Z",
    "clientVersion" : "Pulsar-Java-v4.0.2",
    "metadata" : { }
  }, {
    "accessMode" : "Shared",
    "msgRateIn" : 0.0,
    "msgThroughputIn" : 0.0,
    "averageMsgSize" : 0.0,
    "chunkedMessageRate" : 0.0,
    "producerId" : 5,
    "supportsPartialProducer" : false,
    "producerName" : "integ-pulsar-1162-21",
    "address" : "/10.10.178.13:38530",
    "connectedSince" : "2025-01-31T09:46:39.802282863Z",
    "clientVersion" : "Pulsar-Java-v4.0.2",
    "metadata" : { }
  } ],
  "waitingPublishers" : 0,
  "subscriptions" : {
    "microbatcher" : {
      "msgRateOut" : 0.016666664151389267,
      "msgThroughputOut" : 16.84999745705455,
      "bytesOutCounter" : 3658127,
      "msgOutCounter" : 863,
      "msgRateRedeliver" : 0.0,
      "messageAckRate" : 0.049999992325001186,
      "chunkedMessageRate" : 0.0,
      "msgBacklog" : 10,
      "backlogSize" : 3608212,
      "earliestMsgPublishTimeInBacklog" : 0,
      "msgBacklogNoDelayed" : 10,
      "blockedSubscriptionOnUnackedMsgs" : false,
      "msgDelayed" : 0,
      "msgInReplay" : 0,
      "unackedMessages" : 26,
      "type" : "Key_Shared",
      "msgRateExpired" : 0.0,
      "totalMsgExpired" : 0,
      "lastExpireTimestamp" : 0,
      "lastConsumedFlowTimestamp" : 1738316912547,
      "lastConsumedTimestamp" : 1738317704211,
      "lastAckedTimestamp" : 1738317707396,
      "lastMarkDeleteAdvancedTimestamp" : 0,
      "consumers" : [ {
        "msgRateOut" : 0.0,
        "msgThroughputOut" : 0.0,
        "bytesOutCounter" : 0,
        "msgOutCounter" : 0,
        "msgRateRedeliver" : 0.0,
        "messageAckRate" : 0.0,
        "chunkedMessageRate" : 0.0,
        "consumerName" : "microbatcher-consumer-27b750cf-9118-4585-9136-34ecb6a90ffe",
        "availablePermits" : 500,
        "unackedMessages" : 0,
        "avgMessagesPerEntry" : 0,
        "blockedConsumerOnUnackedMsgs" : false,
        "drainingHashesCount" : 0,
        "drainingHashesClearedTotal" : 0,
        "drainingHashesUnackedMessages" : 0,
        "drainingHashes" : [ ],
        "address" : "/10.10.179.78:52838",
        "connectedSince" : "2025-01-31T09:46:39.803306204Z",
        "clientVersion" : "Pulsar-Java-v4.0.2",
        "lastAckedTimestamp" : 0,
        "lastConsumedTimestamp" : 0,
        "lastConsumedFlowTimestamp" : 1738316799814,
        "keyHashRangeArrays" : [ [ 1515, 2038 ], [ 3849, 4299 ], [ 4464, 4577 ], [ 5188, 5234 ], [ 5449, 6708 ], [ 7236, 7239 ], [ 8129, 8361 ], [ 8914, 8926 ], [ 9655, 9829 ], [ 9959, 10040 ], [ 10788, 11077 ], [ 11782, 11963 ], [ 13985, 14384 ], [ 15106, 15200 ], [ 15782, 16093 ], [ 16122, 16315 ], [ 16335, 16759 ], [ 17059, 17076 ], [ 18068, 18300 ], [ 18488, 18578 ], [ 19599, 19681 ], [ 20131, 20189 ], [ 20514, 21012 ], [ 21891, 22294 ], [ 22450, 22468 ], [ 22776, 22891 ], [ 23152, 23237 ], [ 24151, 24397 ], [ 24642, 25135 ], [ 25570, 25656 ], [ 26289, 26293 ], [ 26423, 26843 ], [ 27974, 28209 ], [ 28860, 29208 ], [ 29779, 29816 ], [ 30007, 30317 ], [ 31403, 31610 ], [ 32887, 33005 ], [ 33401, 33717 ], [ 33891, 34068 ], [ 34205, 34262 ], [ 35101, 35158 ], [ 36566, 37028 ], [ 37881, 38931 ], [ 39256, 39782 ], [ 40111, 40142 ], [ 40267, 40604 ], [ 40715, 41186 ], [ 41403, 41667 ], [ 43168, 43394 ], [ 43405, 44313 ], [ 44795, 44960 ], [ 46331, 46365 ], [ 46706, 47175 ], [ 47590, 48125 ], [ 48549, 48675 ], [ 49501, 49517 ], [ 49831, 50064 ], [ 50422, 50460 ], [ 52018, 52338 ], [ 52529, 52991 ], [ 53814, 53849 ], [ 54740, 54763 ], [ 55544, 55616 ], [ 55732, 56176 ], [ 56501, 56621 ], [ 56727, 56815 ], [ 57158, 57334 ], [ 57860, 58229 ], [ 58242, 58476 ], [ 59275, 59361 ], [ 59970, 60019 ], [ 61151, 61210 ], [ 62030, 62291 ], [ 62590, 62610 ], [ 63450, 63735 ], [ 64945, 65168 ] ],
        "metadata" : { },
        "lastAckedTime" : "1970-01-01T00:00:00Z",
        "lastConsumedTime" : "1970-01-01T00:00:00Z"
      }, {
        "msgRateOut" : 0.016666664151389267,
        "msgThroughputOut" : 16.84999745705455,
        "bytesOutCounter" : 3649823,
        "msgOutCounter" : 851,
        "msgRateRedeliver" : 0.0,
        "messageAckRate" : 0.016666664146667048,
        "chunkedMessageRate" : 0.0,
        "consumerName" : "microbatcher-consumer-e02c3d33-2ff6-4fcb-9fdc-3e8f3773ae36",
        "availablePermits" : 399,
        "unackedMessages" : 26,
        "avgMessagesPerEntry" : 2,
        "blockedConsumerOnUnackedMsgs" : false,
        "drainingHashesCount" : 0,
        "drainingHashesClearedTotal" : 0,
        "drainingHashesUnackedMessages" : 0,
        "drainingHashes" : [ ],
        "address" : "/10.10.179.105:36012",
        "connectedSince" : "2025-01-31T09:46:39.803747416Z",
        "clientVersion" : "Pulsar-Java-v4.0.2",
        "lastAckedTimestamp" : 1738317707396,
        "lastConsumedTimestamp" : 1738317704211,
        "lastConsumedFlowTimestamp" : 1738316912547,
        "keyHashRangeArrays" : [ [ 406, 441 ], [ 661, 1185 ], [ 3706, 3848 ], [ 4300, 4463 ], [ 4578, 5029 ], [ 6709, 6722 ], [ 6833, 6877 ], [ 7301, 7326 ], [ 7506, 7595 ], [ 9503, 9572 ], [ 10041, 10427 ], [ 10536, 10583 ], [ 10605, 10787 ], [ 11078, 11101 ], [ 11115, 11166 ], [ 11448, 11781 ], [ 12319, 13683 ], [ 14385, 14534 ], [ 14706, 15105 ], [ 15201, 15651 ], [ 15659, 15781 ], [ 17077, 17309 ], [ 17568, 18019 ], [ 18805, 19188 ], [ 19256, 19353 ], [ 19484, 19598 ], [ 20072, 20078 ], [ 20190, 20213 ], [ 21106, 21458 ], [ 21676, 21731 ], [ 21753, 21890 ], [ 23238, 24150 ], [ 25136, 25343 ], [ 26294, 26422 ], [ 27172, 27880 ], [ 27944, 27973 ], [ 28653, 28859 ], [ 30373, 30610 ], [ 31110, 31402 ], [ 32104, 32194 ], [ 32350, 32436 ], [ 33006, 33097 ], [ 35454, 35941 ], [ 36289, 36430 ], [ 37029, 37034 ], [ 37596, 37679 ], [ 37806, 37880 ], [ 38946, 39181 ], [ 39783, 40110 ], [ 40605, 40714 ], [ 41187, 41232 ], [ 41314, 41402 ], [ 42094, 42417 ], [ 42549, 42795 ], [ 44462, 44609 ], [ 46240, 46322 ], [ 46366, 46422 ], [ 46581, 46705 ], [ 48893, 49220 ], [ 49518, 49601 ], [ 49621, 49830 ], [ 50749, 51596 ], [ 52339, 52528 ], [ 53770, 53813 ], [ 53850, 54079 ], [ 54222, 54320 ], [ 54764, 55360 ], [ 56177, 56219 ], [ 56622, 56650 ], [ 56867, 56896 ], [ 58230, 58241 ], [ 58491, 58830 ], [ 60644, 60926 ], [ 61211, 61299 ], [ 62611, 62962 ], [ 63344, 63449 ], [ 64566, 64944 ], [ 65169, 65177 ] ],
        "metadata" : { },
        "lastAckedTime" : "2025-01-31T10:01:47.396Z",
        "lastConsumedTime" : "2025-01-31T10:01:44.211Z"
      }, {
        "msgRateOut" : 0.0,
        "msgThroughputOut" : 0.0,
        "bytesOutCounter" : 0,
        "msgOutCounter" : 0,
        "msgRateRedeliver" : 0.0,
        "messageAckRate" : 0.0,
        "chunkedMessageRate" : 0.0,
        "consumerName" : "microbatcher-consumer-4ddf9389-86d0-42a0-9b99-eac0c9374d16",
        "availablePermits" : 500,
        "unackedMessages" : 0,
        "avgMessagesPerEntry" : 0,
        "blockedConsumerOnUnackedMsgs" : false,
        "drainingHashesCount" : 0,
        "drainingHashesClearedTotal" : 0,
        "drainingHashesUnackedMessages" : 0,
        "drainingHashes" : [ ],
        "address" : "/10.10.178.13:41010",
        "connectedSince" : "2025-01-31T09:46:39.803606989Z",
        "clientVersion" : "Pulsar-Java-v4.0.2",
        "lastAckedTimestamp" : 0,
        "lastConsumedTimestamp" : 0,
        "lastConsumedFlowTimestamp" : 1738316799808,
        "keyHashRangeArrays" : [ [ 1, 376 ], [ 501, 525 ], [ 1186, 1502 ], [ 2039, 2287 ], [ 2524, 2794 ], [ 3588, 3705 ], [ 6723, 6832 ], [ 6878, 7235 ], [ 7596, 8128 ], [ 8446, 8913 ], [ 8927, 9502 ], [ 9647, 9654 ], [ 9830, 9958 ], [ 10428, 10535 ], [ 10584, 10604 ], [ 11102, 11114 ], [ 15652, 15658 ], [ 16094, 16121 ], [ 17310, 17567 ], [ 18301, 18487 ], [ 20214, 20513 ], [ 22357, 22449 ], [ 22469, 22775 ], [ 22892, 22961 ], [ 24398, 24537 ], [ 25968, 26288 ], [ 26969, 27171 ], [ 27881, 27943 ], [ 28210, 28400 ], [ 29461, 29778 ], [ 29817, 30006 ], [ 30356, 30372 ], [ 30611, 30811 ], [ 31611, 32103 ], [ 32437, 32886 ], [ 33718, 33841 ], [ 34069, 34185 ], [ 34263, 35100 ], [ 35159, 35453 ], [ 37203, 37595 ], [ 37680, 37805 ], [ 38932, 38945 ], [ 39182, 39255 ], [ 40143, 40157 ], [ 41668, 42093 ], [ 42488, 42548 ], [ 44375, 44445 ], [ 44616, 44781 ], [ 44961, 45485 ], [ 45867, 45916 ], [ 46423, 46580 ], [ 47176, 47442 ], [ 50065, 50071 ], [ 50461, 50748 ], [ 51619, 51944 ], [ 52003, 52017 ], [ 54321, 54609 ], [ 55361, 55543 ], [ 55617, 55731 ], [ 56220, 56500 ], [ 56651, 56726 ], [ 56897, 57002 ], [ 57343, 57388 ], [ 59362, 59672 ], [ 60335, 60643 ], [ 60927, 61068 ], [ 61399, 61454 ], [ 61698, 62029 ], [ 62292, 62384 ], [ 62535, 62539 ], [ 62963, 63028 ], [ 63736, 63765 ], [ 64276, 64565 ], [ 65205, 65535 ] ],
        "metadata" : { },
        "lastAckedTime" : "1970-01-01T00:00:00Z",
        "lastConsumedTime" : "1970-01-01T00:00:00Z"
      }, {
        "msgRateOut" : 0.0,
        "msgThroughputOut" : 0.0,
        "bytesOutCounter" : 8304,
        "msgOutCounter" : 12,
        "msgRateRedeliver" : 0.0,
        "messageAckRate" : 0.033333328178334135,
        "chunkedMessageRate" : 0.0,
        "consumerName" : "microbatcher-consumer-7e5d4f42-6697-446e-bbb8-6eeb3349269c",
        "availablePermits" : 488,
        "unackedMessages" : 0,
        "avgMessagesPerEntry" : 1,
        "blockedConsumerOnUnackedMsgs" : false,
        "drainingHashesCount" : 0,
        "drainingHashesClearedTotal" : 0,
        "drainingHashesUnackedMessages" : 0,
        "drainingHashes" : [ ],
        "address" : "/10.10.179.78:36344",
        "connectedSince" : "2025-01-31T09:46:39.803620032Z",
        "clientVersion" : "Pulsar-Java-v4.0.2",
        "lastAckedTimestamp" : 1738317673406,
        "lastConsumedTimestamp" : 1738317657199,
        "lastConsumedFlowTimestamp" : 1738316799808,
        "keyHashRangeArrays" : [ [ 377, 405 ], [ 442, 500 ], [ 526, 660 ], [ 1503, 1514 ], [ 2288, 2523 ], [ 2795, 3587 ], [ 5030, 5187 ], [ 5235, 5448 ], [ 7240, 7300 ], [ 7327, 7505 ], [ 8362, 8445 ], [ 9573, 9646 ], [ 11167, 11447 ], [ 11964, 12318 ], [ 13684, 13984 ], [ 14535, 14705 ], [ 16316, 16334 ], [ 16760, 17058 ], [ 18020, 18067 ], [ 18579, 18804 ], [ 19189, 19255 ], [ 19354, 19483 ], [ 19682, 20071 ], [ 20079, 20130 ], [ 21013, 21105 ], [ 21459, 21675 ], [ 21732, 21752 ], [ 22295, 22356 ], [ 22962, 23151 ], [ 24538, 24641 ], [ 25344, 25569 ], [ 25657, 25967 ], [ 26844, 26968 ], [ 28401, 28652 ], [ 29209, 29460 ], [ 30318, 30355 ], [ 30812, 31109 ], [ 32195, 32349 ], [ 33098, 33400 ], [ 33842, 33890 ], [ 34186, 34204 ], [ 35942, 36288 ], [ 36431, 36565 ], [ 37035, 37202 ], [ 40158, 40266 ], [ 41233, 41313 ], [ 42418, 42487 ], [ 42796, 43167 ], [ 43395, 43404 ], [ 44314, 44374 ], [ 44446, 44461 ], [ 44610, 44615 ], [ 44782, 44794 ], [ 45486, 45866 ], [ 45917, 46239 ], [ 46323, 46330 ], [ 47443, 47589 ], [ 48126, 48548 ], [ 48676, 48892 ], [ 49221, 49500 ], [ 49602, 49620 ], [ 50072, 50421 ], [ 51597, 51618 ], [ 51945, 52002 ], [ 52992, 53769 ], [ 54080, 54221 ], [ 54610, 54739 ], [ 56816, 56866 ], [ 57003, 57157 ], [ 57335, 57342 ], [ 57389, 57859 ], [ 58477, 58490 ], [ 58831, 59274 ], [ 59673, 59969 ], [ 60020, 60334 ], [ 61069, 61150 ], [ 61300, 61398 ], [ 61455, 61697 ], [ 62385, 62534 ], [ 62540, 62589 ], [ 63029, 63343 ], [ 63766, 64275 ], [ 65178, 65204 ] ],
        "metadata" : { },
        "lastAckedTime" : "2025-01-31T10:01:13.406Z",
        "lastConsumedTime" : "2025-01-31T10:00:57.199Z"
      } ],
      "isDurable" : true,
      "isReplicated" : false,
      "allowOutOfOrderDelivery" : false,
      "keySharedMode" : "AUTO_SPLIT",
      "consumersAfterMarkDeletePosition" : { },
      "drainingHashesCount" : 0,
      "drainingHashesClearedTotal" : 0,
      "drainingHashesUnackedMessages" : 0,
      "nonContiguousDeletedMessagesRanges" : 2,
      "nonContiguousDeletedMessagesRangesSerializedSize" : 70,
      "delayedMessageIndexSizeInBytes" : 0,
      "subscriptionProperties" : { },
      "filterProcessedMsgCount" : 0,
      "filterAcceptedMsgCount" : 0,
      "filterRejectedMsgCount" : 0,
      "filterRescheduledMsgCount" : 0,
      "durable" : true,
      "replicated" : false
    }
  },
  "replication" : { },
  "deduplicationStatus" : "Disabled",
  "nonContiguousDeletedMessagesRanges" : 2,
  "nonContiguousDeletedMessagesRangesSerializedSize" : 70,
  "delayedMessageIndexSizeInBytes" : 0,
  "compaction" : {
    "lastCompactionRemovedEventCount" : 0,
    "lastCompactionSucceedTimestamp" : 0,
    "lastCompactionFailedTimestamp" : 0,
    "lastCompactionDurationTimeInMills" : 0
  },
  "ownerBroker" : "integ-pulsar-broker-5.integ-pulsar-broker.str-integ.svc.cluster.local:8080"
}

Topic partitioned-stats:

{
  "msgRateIn" : 2.0333408764084946,
  "msgThroughputIn" : 1704.8063034094396,
  "msgRateOut" : 2.033340898884361,
  "msgThroughputOut" : 1704.8063223167255,
  "bytesInCounter" : 330665095,
  "msgInCounter" : 78321,
  "systemTopicBytesInCounter" : 0,
  "bytesOutCounter" : 338307079,
  "msgOutCounter" : 80097,
  "bytesOutInternalCounter" : 0,
  "averageMsgSize" : 612.0249999878398,
  "msgChunkPublished" : false,
  "storageSize" : 339672957,
  "backlogSize" : 55819242,
  "backlogQuotaLimitSize" : -1,
  "backlogQuotaLimitTime" : -1,
  "oldestBacklogMessageAgeSeconds" : 1202,
  "oldestBacklogMessageSubscriptionName" : "microbatcher",
  "publishRateLimitedTimes" : 0,
  "earliestMsgPublishTimeInBacklogs" : 0,
  "offloadedStorageSize" : 0,
  "lastOffloadLedgerId" : 0,
  "lastOffloadSuccessTimeStamp" : 0,
  "lastOffloadFailureTimeStamp" : 0,
  "ongoingTxnCount" : 0,
  "abortedTxnCount" : 0,
  "committedTxnCount" : 0,
  "publishers" : [ {
    "msgRateIn" : 1.0333367134601146,
    "msgThroughputIn" : 874.5694844573518,
    "averageMsgSize" : 420.425,
    "chunkedMessageRate" : 0.0,
    "producerId" : 0,
    "supportsPartialProducer" : false
  }, {
    "msgRateIn" : 1.000004162948381,
    "msgThroughputIn" : 830.2368189520878,
    "averageMsgSize" : 387.9766666666666,
    "chunkedMessageRate" : 0.0,
    "producerId" : 0,
    "supportsPartialProducer" : false
  } ],
  "waitingPublishers" : 0,
  "subscriptions" : {
    "microbatcher" : {
      "msgRateOut" : 2.033340898884361,
      "msgThroughputOut" : 1704.8063223167255,
      "bytesOutCounter" : 338307079,
      "msgOutCounter" : 80097,
      "msgRateRedeliver" : 0.0,
      "messageAckRate" : 2.050007899466562,
      "chunkedMessageRate" : 0.0,
      "msgBacklog" : 131,
      "backlogSize" : 55819242,
      "earliestMsgPublishTimeInBacklog" : 0,
      "msgBacklogNoDelayed" : 131,
      "blockedSubscriptionOnUnackedMsgs" : false,
      "msgDelayed" : 0,
      "msgInReplay" : 0,
      "unackedMessages" : 214,
      "type" : "Key_Shared",
      "msgRateExpired" : 0.0,
      "totalMsgExpired" : 0,
      "lastExpireTimestamp" : 0,
      "lastConsumedFlowTimestamp" : 0,
      "lastConsumedTimestamp" : 0,
      "lastAckedTimestamp" : 0,
      "lastMarkDeleteAdvancedTimestamp" : 0,
      "consumers" : [ {
        "msgRateOut" : 0.600002054159385,
        "msgThroughputOut" : 473.685136454395,
        "bytesOutCounter" : 64861170,
        "msgOutCounter" : 15575,
        "msgRateRedeliver" : 0.0,
        "messageAckRate" : 0.6166689861685504,
        "chunkedMessageRate" : 0.0,
        "availablePermits" : 46675,
        "unackedMessages" : 95,
        "avgMessagesPerEntry" : 0,
        "blockedConsumerOnUnackedMsgs" : false,
        "drainingHashesCount" : 0,
        "drainingHashesClearedTotal" : 51,
        "drainingHashesUnackedMessages" : 0,
        "drainingHashes" : [ ],
        "lastAckedTimestamp" : 0,
        "lastConsumedTimestamp" : 0,
        "lastConsumedFlowTimestamp" : 0,
        "keyHashRangeArrays" : [ [ 1, 376 ], [ 501, 525 ], [ 1186, 1502 ], [ 2039, 2287 ], [ 2524, 2794 ], [ 3588, 3705 ], [ 6723, 6832 ], [ 6878, 7235 ], [ 7596, 8128 ], [ 8446, 8913 ], [ 8927, 9502 ], [ 9647, 9654 ], [ 9830, 9958 ], [ 10428, 10535 ], [ 10584, 10604 ], [ 11102, 11114 ], [ 15652, 15658 ], [ 16094, 16121 ], [ 17310, 17567 ], [ 18301, 18487 ], [ 20214, 20513 ], [ 22357, 22449 ], [ 22469, 22775 ], [ 22892, 22961 ], [ 24398, 24537 ], [ 25968, 26288 ], [ 26969, 27171 ], [ 27881, 27943 ], [ 28210, 28400 ], [ 29461, 29778 ], [ 29817, 30006 ], [ 30356, 30372 ], [ 30611, 30811 ], [ 31611, 32103 ], [ 32437, 32886 ], [ 33718, 33841 ], [ 34069, 34185 ], [ 34263, 35100 ], [ 35159, 35453 ], [ 37203, 37595 ], [ 37680, 37805 ], [ 38932, 38945 ], [ 39182, 39255 ], [ 40143, 40157 ], [ 41668, 42093 ], [ 42488, 42548 ], [ 44375, 44445 ], [ 44616, 44781 ], [ 44961, 45485 ], [ 45867, 45916 ], [ 46423, 46580 ], [ 47176, 47442 ], [ 50065, 50071 ], [ 50461, 50748 ], [ 51619, 51944 ], [ 52003, 52017 ], [ 54321, 54609 ], [ 55361, 55543 ], [ 55617, 55731 ], [ 56220, 56500 ], [ 56651, 56726 ], [ 56897, 57002 ], [ 57343, 57388 ], [ 59362, 59672 ], [ 60335, 60643 ], [ 60927, 61068 ], [ 61399, 61454 ], [ 61698, 62029 ], [ 62292, 62384 ], [ 62535, 62539 ], [ 62963, 63028 ], [ 63736, 63765 ], [ 64276, 64565 ], [ 65205, 65535 ] ],
        "lastAckedTime" : "1970-01-01T00:00:00Z",
        "lastConsumedTime" : "1970-01-01T00:00:00Z"
      }, {
        "msgRateOut" : 0.5166690860734111,
        "msgThroughputOut" : 405.385061886289,
        "bytesOutCounter" : 80591750,
        "msgOutCounter" : 19173,
        "msgRateRedeliver" : 0.0,
        "messageAckRate" : 0.5500023426171868,
        "chunkedMessageRate" : 0.0,
        "availablePermits" : 46577,
        "unackedMessages" : 61,
        "avgMessagesPerEntry" : 0,
        "blockedConsumerOnUnackedMsgs" : false,
        "drainingHashesCount" : 0,
        "drainingHashesClearedTotal" : 11,
        "drainingHashesUnackedMessages" : 0,
        "drainingHashes" : [ ],
        "lastAckedTimestamp" : 0,
        "lastConsumedTimestamp" : 0,
        "lastConsumedFlowTimestamp" : 0,
        "keyHashRangeArrays" : [ [ 377, 405 ], [ 442, 500 ], [ 526, 660 ], [ 1503, 1514 ], [ 2288, 2523 ], [ 2795, 3587 ], [ 5030, 5187 ], [ 5235, 5448 ], [ 7240, 7300 ], [ 7327, 7505 ], [ 8362, 8445 ], [ 9573, 9646 ], [ 11167, 11447 ], [ 11964, 12318 ], [ 13684, 13984 ], [ 14535, 14705 ], [ 16316, 16334 ], [ 16760, 17058 ], [ 18020, 18067 ], [ 18579, 18804 ], [ 19189, 19255 ], [ 19354, 19483 ], [ 19682, 20071 ], [ 20079, 20130 ], [ 21013, 21105 ], [ 21459, 21675 ], [ 21732, 21752 ], [ 22295, 22356 ], [ 22962, 23151 ], [ 24538, 24641 ], [ 25344, 25569 ], [ 25657, 25967 ], [ 26844, 26968 ], [ 28401, 28652 ], [ 29209, 29460 ], [ 30318, 30355 ], [ 30812, 31109 ], [ 32195, 32349 ], [ 33098, 33400 ], [ 33842, 33890 ], [ 34186, 34204 ], [ 35942, 36288 ], [ 36431, 36565 ], [ 37035, 37202 ], [ 40158, 40266 ], [ 41233, 41313 ], [ 42418, 42487 ], [ 42796, 43167 ], [ 43395, 43404 ], [ 44314, 44374 ], [ 44446, 44461 ], [ 44610, 44615 ], [ 44782, 44794 ], [ 45486, 45866 ], [ 45917, 46239 ], [ 46323, 46330 ], [ 47443, 47589 ], [ 48126, 48548 ], [ 48676, 48892 ], [ 49221, 49500 ], [ 49602, 49620 ], [ 50072, 50421 ], [ 51597, 51618 ], [ 51945, 52002 ], [ 52992, 53769 ], [ 54080, 54221 ], [ 54610, 54739 ], [ 56816, 56866 ], [ 57003, 57157 ], [ 57335, 57342 ], [ 57389, 57859 ], [ 58477, 58490 ], [ 58831, 59274 ], [ 59673, 59969 ], [ 60020, 60334 ], [ 61069, 61150 ], [ 61300, 61398 ], [ 61455, 61697 ], [ 62385, 62534 ], [ 62540, 62589 ], [ 63029, 63343 ], [ 63766, 64275 ], [ 65178, 65204 ] ],
        "lastAckedTime" : "1970-01-01T00:00:00Z",
        "lastConsumedTime" : "1970-01-01T00:00:00Z"
      }, {
        "msgRateOut" : 0.48333491160967085,
        "msgThroughputOut" : 424.8514676772335,
        "bytesOutCounter" : 78352903,
        "msgOutCounter" : 18496,
        "msgRateRedeliver" : 0.0,
        "messageAckRate" : 0.5166687138351619,
        "chunkedMessageRate" : 0.0,
        "availablePermits" : 46754,
        "unackedMessages" : 10,
        "avgMessagesPerEntry" : 0,
        "blockedConsumerOnUnackedMsgs" : false,
        "drainingHashesCount" : 0,
        "drainingHashesClearedTotal" : 3,
        "drainingHashesUnackedMessages" : 0,
        "drainingHashes" : [ ],
        "lastAckedTimestamp" : 0,
        "lastConsumedTimestamp" : 0,
        "lastConsumedFlowTimestamp" : 0,
        "keyHashRangeArrays" : [ [ 1515, 2038 ], [ 3849, 4299 ], [ 4464, 4577 ], [ 5188, 5234 ], [ 5449, 6708 ], [ 7236, 7239 ], [ 8129, 8361 ], [ 8914, 8926 ], [ 9655, 9829 ], [ 9959, 10040 ], [ 10788, 11077 ], [ 11782, 11963 ], [ 13985, 14384 ], [ 15106, 15200 ], [ 15782, 16093 ], [ 16122, 16315 ], [ 16335, 16759 ], [ 17059, 17076 ], [ 18068, 18300 ], [ 18488, 18578 ], [ 19599, 19681 ], [ 20131, 20189 ], [ 20514, 21012 ], [ 21891, 22294 ], [ 22450, 22468 ], [ 22776, 22891 ], [ 23152, 23237 ], [ 24151, 24397 ], [ 24642, 25135 ], [ 25570, 25656 ], [ 26289, 26293 ], [ 26423, 26843 ], [ 27974, 28209 ], [ 28860, 29208 ], [ 29779, 29816 ], [ 30007, 30317 ], [ 31403, 31610 ], [ 32887, 33005 ], [ 33401, 33717 ], [ 33891, 34068 ], [ 34205, 34262 ], [ 35101, 35158 ], [ 36566, 37028 ], [ 37881, 38931 ], [ 39256, 39782 ], [ 40111, 40142 ], [ 40267, 40604 ], [ 40715, 41186 ], [ 41403, 41667 ], [ 43168, 43394 ], [ 43405, 44313 ], [ 44795, 44960 ], [ 46331, 46365 ], [ 46706, 47175 ], [ 47590, 48125 ], [ 48549, 48675 ], [ 49501, 49517 ], [ 49831, 50064 ], [ 50422, 50460 ], [ 52018, 52338 ], [ 52529, 52991 ], [ 53814, 53849 ], [ 54740, 54763 ], [ 55544, 55616 ], [ 55732, 56176 ], [ 56501, 56621 ], [ 56727, 56815 ], [ 57158, 57334 ], [ 57860, 58229 ], [ 58242, 58476 ], [ 59275, 59361 ], [ 59970, 60019 ], [ 61151, 61210 ], [ 62030, 62291 ], [ 62590, 62610 ], [ 63450, 63735 ], [ 64945, 65168 ] ],
        "lastAckedTime" : "1970-01-01T00:00:00Z",
        "lastConsumedTime" : "1970-01-01T00:00:00Z"
      }, {
        "msgRateOut" : 0.43333484704189446,
        "msgThroughputOut" : 400.884656298808,
        "bytesOutCounter" : 114501256,
        "msgOutCounter" : 26853,
        "msgRateRedeliver" : 0.0,
        "messageAckRate" : 0.36666785684566255,
        "chunkedMessageRate" : 0.0,
        "availablePermits" : 46647,
        "unackedMessages" : 48,
        "avgMessagesPerEntry" : 0,
        "blockedConsumerOnUnackedMsgs" : false,
        "drainingHashesCount" : 0,
        "drainingHashesClearedTotal" : 0,
        "drainingHashesUnackedMessages" : 0,
        "drainingHashes" : [ ],
        "lastAckedTimestamp" : 0,
        "lastConsumedTimestamp" : 0,
        "lastConsumedFlowTimestamp" : 0,
        "keyHashRangeArrays" : [ [ 406, 441 ], [ 661, 1185 ], [ 3706, 3848 ], [ 4300, 4463 ], [ 4578, 5029 ], [ 6709, 6722 ], [ 6833, 6877 ], [ 7301, 7326 ], [ 7506, 7595 ], [ 9503, 9572 ], [ 10041, 10427 ], [ 10536, 10583 ], [ 10605, 10787 ], [ 11078, 11101 ], [ 11115, 11166 ], [ 11448, 11781 ], [ 12319, 13683 ], [ 14385, 14534 ], [ 14706, 15105 ], [ 15201, 15651 ], [ 15659, 15781 ], [ 17077, 17309 ], [ 17568, 18019 ], [ 18805, 19188 ], [ 19256, 19353 ], [ 19484, 19598 ], [ 20072, 20078 ], [ 20190, 20213 ], [ 21106, 21458 ], [ 21676, 21731 ], [ 21753, 21890 ], [ 23238, 24150 ], [ 25136, 25343 ], [ 26294, 26422 ], [ 27172, 27880 ], [ 27944, 27973 ], [ 28653, 28859 ], [ 30373, 30610 ], [ 31110, 31402 ], [ 32104, 32194 ], [ 32350, 32436 ], [ 33006, 33097 ], [ 35454, 35941 ], [ 36289, 36430 ], [ 37029, 37034 ], [ 37596, 37679 ], [ 37806, 37880 ], [ 38946, 39181 ], [ 39783, 40110 ], [ 40605, 40714 ], [ 41187, 41232 ], [ 41314, 41402 ], [ 42094, 42417 ], [ 42549, 42795 ], [ 44462, 44609 ], [ 46240, 46322 ], [ 46366, 46422 ], [ 46581, 46705 ], [ 48893, 49220 ], [ 49518, 49601 ], [ 49621, 49830 ], [ 50749, 51596 ], [ 52339, 52528 ], [ 53770, 53813 ], [ 53850, 54079 ], [ 54222, 54320 ], [ 54764, 55360 ], [ 56177, 56219 ], [ 56622, 56650 ], [ 56867, 56896 ], [ 58230, 58241 ], [ 58491, 58830 ], [ 60644, 60926 ], [ 61211, 61299 ], [ 62611, 62962 ], [ 63344, 63449 ], [ 64566, 64944 ], [ 65169, 65177 ] ],
        "lastAckedTime" : "1970-01-01T00:00:00Z",
        "lastConsumedTime" : "1970-01-01T00:00:00Z"
      } ],
      "isDurable" : true,
      "isReplicated" : false,
      "allowOutOfOrderDelivery" : false,
      "consumersAfterMarkDeletePosition" : { },
      "drainingHashesCount" : 0,
      "drainingHashesClearedTotal" : 65,
      "drainingHashesUnackedMessages" : 0,
      "nonContiguousDeletedMessagesRanges" : 29,
      "nonContiguousDeletedMessagesRangesSerializedSize" : 1001,
      "delayedMessageIndexSizeInBytes" : 0,
      "subscriptionProperties" : { },
      "filterProcessedMsgCount" : 0,
      "filterAcceptedMsgCount" : 0,
      "filterRejectedMsgCount" : 0,
      "filterRescheduledMsgCount" : 0,
      "replicated" : false,
      "durable" : true
    }
  },
  "replication" : { },
  "nonContiguousDeletedMessagesRanges" : 29,
  "nonContiguousDeletedMessagesRangesSerializedSize" : 1001,
  "delayedMessageIndexSizeInBytes" : 0,
  "compaction" : {
    "lastCompactionRemovedEventCount" : 0,
    "lastCompactionSucceedTimestamp" : 0,
    "lastCompactionFailedTimestamp" : 0,
    "lastCompactionDurationTimeInMills" : 0
  },
  "metadata" : {
    "partitions" : 100,
    "deleted" : false
  },
  "partitions" : { }
}

@szkoludasebastian
Copy link
Author

szkoludasebastian commented Jan 31, 2025

@szkoludasebastian Just confirming: do you happen to use Pulsar transactions?

Checked on broker: This Broker is not configured with transactionCoordinatorEnabled=true so we are not using pulsar transactions. Should I check it in some other way?

@lhotari
Copy link
Member

lhotari commented Jan 31, 2025

@szkoludasebastian Just confirming: do you happen to use Pulsar transactions?

Checked on broker: This Broker is not configured with transactionCoordinatorEnabled=true so we are not using pulsar transactions. Should I check it in some other way?

That's fine. No transactions involved. The application code would explicitly need to use transactions if it was used.
The reason why I asked is that acknowledgement handling is more complicated when using transactions.

@lhotari
Copy link
Member

lhotari commented Jan 31, 2025

In the v3.3.1...v3.3.2 diff, #23072 looks like the largest change related to acknowledgements. If you would be able to run the experiments mentioned in #23908 (comment), that would confirm many things. Building a fork off v3.3.2 with #23072 commit 1996c86 reverted would be one way to check if it's impacting your use case.
There are also a few other commits that could be causing the change in behavior such as #23189 & #23147.

@szkoludasebastian
Copy link
Author

I will check that with dispatcherDispatchMessagesInSubscriptionThread=false. Where to configure that?

@szkoludasebastian In broker.conf (reference). In Apache Pulsar Helm chart k8s deployments, you add dispatcherDispatchMessagesInSubscriptionThread: "false" under the broker.configData key in values.yaml.

Configured that but still able to reproduce issue

@szkoludasebastian
Copy link
Author

In the v3.3.1...v3.3.2 diff, #23072 looks like the largest change related to acknowledgements. If you would be able to run the experiments mentioned in #23908 (comment), that would confirm many things. Building a fork off v3.3.2 with #23072 commit 1996c86 reverted would be one way to check if it's impacting your use case. There are also a few other commits that could be causing the change in behavior such as #23189 & #23147.

Tried that but I have problem with building pulsar-all image for architecture linux/amd64. For linux/arm64 it works correctly but I need to build linux/amd64

@lhotari
Copy link
Member

lhotari commented Feb 3, 2025

Tried that but I have problem with building pulsar-all image for architecture linux/amd64. For linux/arm64 it works correctly but I need to build linux/amd64

What problem are you facing? The way Pulsar docker images are built during the release is documented at https://pulsar.apache.org/contribute/release-process/#release-pulsar-30-and-later . The prerequisite is to build artifacts before that.

@szkoludasebastian
Copy link
Author

I tried to build that locally with this instruction from Readme:

Image I've added this param: -Ddocker.platforms=linux/amd64 so it was like: mvn package -Pdocker,-main -am -pl docker/pulsar-all -DskipTests -Ddocker.platforms=linux/amd64 and the problem is: Image there is no such problem when I remove this param and image apachepulsar/pulsar and apachepulsar/pulsar-all is built locally, but I need linux/amd64 arch

@lhotari
Copy link
Member

lhotari commented Feb 3, 2025

I tried to build that locally with this instruction from Readme:

Unfortunately those instructions aren't up-to-date. Please check #23908 (comment) for the commands used while releasing.

@szkoludasebastian
Copy link
Author

still same issue when i use commands from this link

@lhotari
Copy link
Member

lhotari commented Feb 3, 2025

still same issue when i use commands from this link

It seems that you might be using the wrong commands. In the error message, you shouldn't see apachepulsar/pulsar at all if you are using this command:

mvn clean install -DskipTests
DOCKER_USER=my_docker_user
mvn install -pl docker/pulsar,docker/pulsar-all \
    -DskipTests \
    -Pmain,docker,docker-push \
    -Ddocker.platforms=linux/amd64,linux/arm64 \
    -Ddocker.organization=$DOCKER_USER \
    -Ddocker.noCache=true

@szkoludasebastian
Copy link
Author

still something wrong, on k8s I see such error:

Image

on my local docker images seems to be linux/amd64

Image

@lhotari
Copy link
Member

lhotari commented Feb 3, 2025

still something wrong, on k8s I see such error:

Image on my local docker images seems to be linux/amd64 Image

Did you use these commands to build a multi-platform image?

mvn clean install -DskipTests
DOCKER_USER=my_docker_user
mvn install -pl docker/pulsar,docker/pulsar-all \
    -DskipTests \
    -Pmain,docker,docker-push \
    -Ddocker.platforms=linux/amd64,linux/arm64 \
    -Ddocker.organization=$DOCKER_USER \
    -Ddocker.noCache=true

There are several gotchas in building and handling multi-platform images. I'm not exactly sure, but I think that unless you build a multi-platform image in one shot and push it directly to a docker repository from the build, you might not be able to push it separately with docker push. I have struggled with this issue in the past. You can use a custom docker repository as well, by prefixing the -Ddocker.organization= value with the docker repository.

You might not need a multi-platform image in your case, but if you'd like to build a similar image as the Pulsar images, it's better to stick to the same set of commands.

Another detail is that tagging multi-platform images contains gotchas. In Pulsar, we use the regctl image copy to achieve tagging directly in the repository. (examples). Using docker tag locally and then pushing with docker push could cause in unexpected results with multi-platform images. regctl is available at https://github.com/regclient/regclient .

@szkoludasebastian
Copy link
Author

szkoludasebastian commented Feb 4, 2025

In the v3.3.1...v3.3.2 diff, #23072 looks like the largest change related to acknowledgements. If you would be able to run the experiments mentioned in #23908 (comment), that would confirm many things. Building a fork off v3.3.2 with #23072 commit 1996c86 reverted would be one way to check if it's impacting your use case. There are also a few other commits that could be causing the change in behavior such as #23189 & #23147.

Finally I was able to build pulsar-all docker image. I reverted commits which you mentioned in above comment but still was able to reproduce issue. Do you know which other commit I can try?

@szkoludasebastian
Copy link
Author

I did some more tests and it turned out that it is most likely the version of pulsar client that is causing the problem. Previously when I was testing pulsar server all the time the client was version 4.0.1 and I managed to reproduce the issue. After all these tests of pulsar server I decided to check if it wasn't the client that was causing it so I reverted to client 3.3.2 and this time I wasn't able to reproduce the issue when restarting bookies, only restarting brokers caused stuck partitions (but it was really random). I have posted all the results of my tests below.

Server Client Result Notes
3.3.2 3.3.2 no issue on bookies restart. When I restart brokers then I see stuck partitions partitions backlog is not growing fast, it's rather really low
3.3.2 3.2.4 no issue on bookies restart. When I restart brokers then I see stuck partitions Image partitions backlog is not growing fast, it's rather really low
3.3.2 4.0.1 issue reproduced with bookies restart Image Immediately noticeable increase in partitions backlog
4.0.1 4.0.1 issue reproduced with bookies restart Image Immediately noticeable increase in partitions backlog

@lhotari
Copy link
Member

lhotari commented Feb 10, 2025

I did some more tests and it turned out that it is most likely the version of pulsar client that is causing the problem. Previously when I was testing pulsar server all the time the client was version 4.0.1 and I managed to reproduce the issue. After all these tests of pulsar server I decided to check if it wasn't the client that was causing it so I reverted to client 3.3.2 and this time I wasn't able to reproduce the issue when restarting bookies, only restarting brokers caused stuck partitions (but it was really random). I have posted all the results of my tests below.

Thank you @szkoludasebastian, this is useful in isolating the issue. Since there's also 3.3.3 and 3.3.4 releases available, would you be able to test whether problems reproduce in those releases? If possible, it would be great to find the commit that breaks your use case.

@szkoludasebastian
Copy link
Author

I performed more tests for pulsar client versions 3.3.3 and 3.3.4 to check which version is causing issue but failed to reproduce it. Then I checked again versions 4.0.0 and 4.0.1, however, it failed to reproduce for them as well. I noticed that right now I built images for our services from our latest master, so I checked git history and lately we reverted such change:

loadManagerClassName: "org.apache.pulsar.broker.loadbalance.extensions.ExtensibleLoadManagerImpl"
loadBalancerLoadSheddingStrategy: "org.apache.pulsar.broker.loadbalance.extensions.scheduler.TransferShedder"

we had it added in broker configmap.
It turned out that when I add this to our broker configmap, I am able to reproduce the problem. When I remove it, I can't reproduce it. So this is probably the cause of the problem. I also tested this on the latest pulsar client and server (4.0.2) and it behaves the same way.

Here is our broker configmap (added these two configurations mentioned above):

apiVersion: v1
data:
  PULSAR_GC: |
    -XX:+UseG1GC -XX:MaxGCPauseMillis=10 -Dio.netty.leakDetectionLevel=disabled -Dio.netty.recycler.linkCapacity=1024 -XX:+ParallelRefProcEnabled -XX:+UnlockExperimentalVMOptions -XX:+DoEscapeAnalysis -XX:ParallelGCThreads=4 -XX:ConcGCThreads=4 -XX:G1NewSizePercent=50 -XX:+DisableExplicitGC -XX:-ResizePLAB -XX:+ExitOnOutOfMemoryError -XX:+PerfDisableSharedMem -Djute.maxbuffer=15485760
  PULSAR_MEM: |
    -Xms512m -Xmx5120m -XX:MaxDirectMemorySize=5120m
  acknowledgmentAtBatchIndexLevelEnabled: "true"
  brokerDeduplicationEnabled: "false"
  brokerDeleteInactiveTopicsEnabled: "false"
  brokerServicePort: "6650"
  clusterName: integ-pulsar
  configurationStoreServers: integ-pulsar-zookeeper:2181
  defaultNumberOfNamespaceBundles: "16"
  dispatcherDispatchMessagesInSubscriptionThread: "false"
  exposeTopicLevelMetricsInPrometheus: "true"
  functionsWorkerEnabled: "false"
  loadBalancerLoadSheddingStrategy: org.apache.pulsar.broker.loadbalance.extensions.scheduler.TransferShedder
  loadBalancerNamespaceMaximumBundles: "512"
  loadManagerClassName: org.apache.pulsar.broker.loadbalance.extensions.ExtensibleLoadManagerImpl
  managedLedgerDefaultAckQuorum: "2"
  managedLedgerDefaultEnsembleSize: "3"
  managedLedgerDefaultWriteQuorum: "3"
  managedLedgerMaxLedgerRolloverTimeMinutes: "60"
  managedLedgerMinLedgerRolloverTimeMinutes: "1"
  managedLedgerOffloadDriver: aws-s3
  maxUnackedMessagesPerConsumer: "500000"
  maxUnackedMessagesPerSubscription: "2000000"
  numHttpServerThreads: "8"
  s3ManagedLedgerOffloadBucket: tiered-storage-us-east-1
  s3ManagedLedgerOffloadRegion: us-east-1
  s3ManagedLedgerOffloadRole: arn:aws:iam::xyz:role/str-integ-broker
  s3ManagedLedgerOffloadRoleSessionName: pulsar-tiered-storage-offload
  statusFilePath: /pulsar/status
  subscriptionExpirationTimeMinutes: "43800"
  subscriptionPatternMaxLength: "200"
  systemTopicEnabled: "true"
  topicLevelPoliciesEnabled: "true"
  webServicePort: "8080"
  zooKeeperSessionTimeoutMillis: "30000"
  zookeeperServers: integ-pulsar-zookeeper:2181

@lhotari
Copy link
Member

lhotari commented Feb 19, 2025

I performed more tests for pulsar client versions 3.3.3 and 3.3.4 to check which version is causing issue but failed to reproduce it. Then I checked again versions 4.0.0 and 4.0.1, however, it failed to reproduce for them as well. I noticed that right now I built images for our services from our latest master, so I checked git history and lately we reverted such change:

loadManagerClassName: "org.apache.pulsar.broker.loadbalance.extensions.ExtensibleLoadManagerImpl"
loadBalancerLoadSheddingStrategy: "org.apache.pulsar.broker.loadbalance.extensions.scheduler.TransferShedder"
we had it added in broker configmap. It turned out that when I add this to our broker configmap, I am able to reproduce the problem. When I remove it, I can't reproduce it. So this is probably the cause of the problem. I also tested this on the latest pulsar client and server (4.0.2) and it behaves the same way.

Thank you @szkoludasebastian. This is very useful information in isolating the issue to be related to ExtensibleLoadManagerImpl. @heesung-sn recently provided this advice about using ExtensibleLoadManagerImpl: #23889 (comment)
Are you able to reproduce if you set loadManagerServiceUnitStateTableViewClassName=org.apache.pulsar.broker.loadbalance.extensions.channel.ServiceUnitStateMetadataStoreTableViewImpl?

@lhotari lhotari changed the title [Bug] Topic partition backlog grows after bookies restart [Bug] Topic partition backlog grows after bookies restart when using ExtensibleLoadManagerImpl Feb 19, 2025
@szkoludasebastian
Copy link
Author

Tested but still able to reproduce issue

@heesung-sn
Copy link
Contributor

heesung-sn commented Feb 19, 2025

Thank you for reporting this bug.
Seems like there are multiple things to check to isolate the problem.
Possible Theories:

  • consumers failed to reconnect (can we check the ownership of the topics by pulsar-admin topics lookup {topic-partition-name} Also, do we see any errors on the consumer side?)
  • brokers failed to dispatch( can we take a heapdump from the owner broker above)? by PID=$(ps aux | grep java | grep -v grep | awk '{print $2}'); jmap -dump:live,format=b,file=/tmp/${HOSTNAME}.hprof $PID ; tar -zcvf /tmp/${HOSTNAME}.tar.gz /tmp/${HOSTNAME}.hprof) Also, can you share the broker logs during this time? We can turn on the ExtensibleLoadBalancer debug logs by pulsar-admin brokers update-dynamic-config --config loadBalancerDebugModeEnabled --value true
  • I am speculating topic.isTransferring flag might have a bug and might be blocking the message dispatcher logic after the topic transfer. We can confirm this by heapdump. Also, can you try to unload the namespace and see if that can mitigate the issue, by pulsar-admin namespaces unload {tenant/namespace}

Meanwhile, I am trying to reproduce this issue on my end.

@heesung-sn
Copy link
Contributor

heesung-sn commented Feb 20, 2025

compose.yaml.txt

I tried to reproduce this issue by the above test setting, but I couldn't. Can you help to reproduce this issue by modifying the above setup script?

Also, for this repro step:

1. Create partitioned topic A with 100 partitions
2. Connect 4 consumers to topic A (Subscription type does not matter, we noticed this problem for shared and key_shared subscriptions)
3. Start sending data to topic A with a throughput of about 5,000 messages per second
4. Restart bookies (in our case we had like 6 bookies and 6 brokers) - we restarted it by killing k8s pods
5. After all bookies restarted and are up and running wait for 30 seconds and stop sending data
6. Check pulsar admin topic stats that backlog is only growing. We were observing backlog for every partition in grafana so we knew for which partition check stats-internal. In our stats-internal we usually see such situation:

it appears that the Step 5 kills the producer. After step 5, how do we see backlog increase without producer?

@heesung-sn
Copy link
Contributor

From this stats, #23908 (comment)

I noticed that "unackedMessages" is non-zero. Seems like either the broker or consumer missed msg acks during the bk restarts.

fyi, the ExtensibleLoadManager ignores message acks during topic.transferring state, by
https://github.com/apache/pulsar/blob/master/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/ServerCnx.java#L1974-L1984

However, I think these unacked messages should be handled by Pulsar message re-delivery mechanism.

You can confirm this behavior by the debug logs in the above code or pulsar_lb_ignored_ack_total metric.

Partition-4 stats-internal:

{
  "entriesAddedCounter" : 285,
  "numberOfEntries" : 409,
  "totalSize" : 4481595,
  "currentLedgerEntries" : 285,
  "currentLedgerSize" : 3521965,
  "lastLedgerCreatedTimestamp" : "2025-01-31T09:46:39.629Z",
  "waitingCursorsCount" : 1,
  "pendingAddEntriesCount" : 0,
  "lastConfirmedEntry" : "2044129:284",
  "state" : "LedgerOpened",
  "ledgers" : [ {
    "ledgerId" : 2043391,
    "entries" : 124,
    "size" : 959630,
    "offloaded" : false,
    "underReplicated" : false
  }, {
    "ledgerId" : 2044129,
    "entries" : 0,
    "size" : 0,
    "offloaded" : false,
    "underReplicated" : false
  } ],
  "cursors" : {
    "microbatcher" : {
      "markDeletePosition" : "2043391:112",
      "readPosition" : "2044129:285",
      "waitingReadOp" : true,
      "pendingReadOps" : 0,
      "messagesConsumedCounter" : 273,
      "cursorLedger" : 2044305,
      "cursorLedgerLastEntry" : 44,
      "individuallyDeletedMessages" : "[(2043391:122..2043391:123],(2044129:-1..2044129:282]]",
      "lastLedgerSwitchTimestamp" : "2025-01-31T09:46:39.675Z",
      "state" : "Open",
      "active" : true,
      "numberOfEntriesSinceFirstNotAckedMessage" : 297,
      "totalNonContiguousDeletedMessagesRange" : 2,
      "subscriptionHavePendingRead" : true,
      "subscriptionHavePendingReplayRead" : false,
      "properties" : { }
    }
  },
  "schemaLedgers" : [ ],
  "compactedLedger" : {
    "ledgerId" : -1,
    "entries" : -1,
    "size" : -1,
    "offloaded" : false,
    "underReplicated" : false
  }
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug The PR fixed a bug or issue reported a bug
Projects
None yet
Development

No branches or pull requests

3 participants