Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug Report: Collation buffer memory leak in unicode_loose_* vindexes in v18 and below #14587

Closed
brendar opened this issue Nov 22, 2023 · 5 comments

Comments

@brendar
Copy link
Contributor

brendar commented Nov 22, 2023

Overview of the Issue

We saw vtgates in one of our vitess clusters using over 4 GB of memory and being OOM killed, while vtgates in other clusters were generally using less than 1 GB of memory for a similar workload. A pprof heap profile (attached below) showed that nearly all of the memory was allocated from golang.org/x/text/collate.appendPrimary and a read through the call graph and code found that it was being retained by pooledCollator.buf. The use of sync.Pool made the issue challenging to debug, as the pool allows objects to be garbage collected under some (apparently non-deterministic) circumstances, so we had some outliers using tons of memory then suddenly dropping back to the norm.

I was able to reproduce the issue locally on v18.0.0 following the steps outlined below, and I was able to fix the issue by calling collator.buf.Reset() before returning the collator to the pool. e.g. Shopify#131

This issue does not exist in main as the vindex collation logic was replaced in #14395 and the new shared buffer is reset here.

I think this is worth fixing and backporting, but I'm not sure which branch to start with a PR against. release-18.0?


pprof heap profile call graph:

profile001

Reproduction Steps

Using local examples from v18

  1. Edit go/vt/vtgate/vindexes/unicode.go to not use a pool (just a single shared pooledCollator protected by a mutex)
    This ensures that reproduction isn't impacted by the nondeterministic nature of sync.Pool, i.e. https://pkg.go.dev/sync#Pool

Any item stored in the Pool may be removed automatically at any time without notification. If the Pool holds the only reference when this happens, the item might be deallocated.

  1. SKIP_VTADMIN=true ./101_initial_cluster.sh
  2. Create a new sharded "things" keyspace:
source ../common/env.sh

for i in 300 301 302; do
	CELL=zone1 TABLET_UID=$i ../common/scripts/mysqlctl-up.sh
	SHARD=-80 CELL=zone1 KEYSPACE=things TABLET_UID=$i ../common/scripts/vttablet-up.sh
done

for i in 400 401 402; do
	CELL=zone1 TABLET_UID=$i ../common/scripts/mysqlctl-up.sh
	SHARD=80- CELL=zone1 KEYSPACE=things TABLET_UID=$i ../common/scripts/vttablet-up.sh
done

# set the correct durability policy for the keyspace
vtctldclient --server localhost:15999 SetKeyspaceDurabilityPolicy --durability-policy=semi_sync things || fail "Failed to set keyspace durability policy on the things keyspace"

for shard in "-80" "80-"; do
	# Wait for all the tablets to be up and registered in the topology server
	# and for a primary tablet to be elected in the shard and become healthy/serving.
	wait_for_healthy_shard things "${shard}" || exit 1
done
  1. Create the VSchema:
vtctldclient ApplyVSchema --vschema '
    {
        "sharded": true,
        "vindexes": {
            "unicode_loose_md5": {
                "type": "unicode_loose_md5"
            }
        },
        "tables": {
            "things": {
                "column_vindexes": [
                    {
                        "column": "id",
                        "name": "unicode_loose_md5"
                    }
                ]
            }
        }
    }
    ' things || fail "Failed to create vschema in sharded things keyspace"
  1. Create the SQL schema:
vtctldclient ApplySchema --sql '
  create table if not exists things(
    id varchar(255) not null,
    primary key(id)
  ) ENGINE=InnoDB;
' things || fail "Failed to create tables in the things keyspace"
  1. Check vtgate memory use
cat /proc/$(cat /tmp/vtdataroot/tmp/vtgate.pid)/status | grep RSS
  1. Run lots of queries
go run run_queries.go
  1. Keep checking memory use

run_queries.go

package main

import (
	"context"
	"log"

	_ "vitess.io/vitess/go/vt/vtctl/grpcvtctlclient"
	_ "vitess.io/vitess/go/vt/vtgate/grpcvtgateconn"
	"vitess.io/vitess/go/vt/vtgate/vtgateconn"
)

func main() {
	ctx := context.Background()
	conn, err := vtgateconn.Dial(ctx, "localhost:15991")
	if err != nil {
		log.Fatal(err)
	}
	defer conn.Close()
	session := conn.Session("things", nil)
	for {
		_, err := session.Execute(ctx, "SELECT * FROM things WHERE id = 'fooBARfooBARfooBARfooBARfooBAR'", nil)
		if err != nil {
			log.Fatal(err)
		}
	}
}

Binary Version

vtgate version Version: 18.0.0 (Git revision 9a6f5262f7707ff80ce85c111d2ff686d85d29cc branch 'HEAD') built on Wed Nov 22 19:45:23 UTC 2023 by spin@localhost using go1.21.4 linux/amd64

Operating System and Environment details

$ cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.3 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.3 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy

$ uname -sr
Linux 5.15.0-87-generic

$ uname -m
x86_64

Log Fragments

No response

@brendar brendar added Needs Triage This issue needs to be correctly labelled and triaged Type: Bug labels Nov 22, 2023
@deepthi
Copy link
Member

deepthi commented Nov 22, 2023

Thank you for opening this. I believe that the change in #14395 is too big to be back ported to v18 and earlier, so if you could start with a PR against release-18.0, that will be appreciated. We can mark that to be back ported to older release branches.
cc @vmg

@deepthi deepthi added Component: Query Serving Type: Performance and removed Type: Bug Needs Triage This issue needs to be correctly labelled and triaged labels Nov 22, 2023
@vmg
Copy link
Collaborator

vmg commented Nov 23, 2023

The fix in Shopify#131 looks correct. Sorry you had to spend time debugging an issue we already fixed! Hopefully you can upgrade soon to a version that contains #14395, because besides not leaking memory, it is much much faster!

@brendar
Copy link
Contributor Author

brendar commented Nov 27, 2023

Hopefully you can upgrade soon to a version that contains #14395, because besides not leaking memory, it is much much faster!

For sure. We'll be pushing on upgrades in 2024. Lots of versions to catch up on. Vitess moves fast!

@brendar
Copy link
Contributor Author

brendar commented Nov 27, 2023

PR against release-18.0 here: #14621

@brendar
Copy link
Contributor Author

brendar commented Nov 28, 2023

Closing this as the fix has been merged.

We've also deployed the fix on our internal patched v15 release and memory use is now stable at a few hundred MB, with no outliers.

@brendar brendar closed this as completed Nov 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants