Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent EOF Errors #1464

Open
7 tasks
0xrelapse opened this issue Jan 4, 2025 · 5 comments
Open
7 tasks

Intermittent EOF Errors #1464

0xrelapse opened this issue Jan 4, 2025 · 5 comments
Labels
bug cloud ClickHouse Cloud related tests

Comments

@0xrelapse
Copy link

Observed

Getting this error periodically via the golang clickhouse driver when connecting to the clickhouse cloud service.

read:
    github.com/ClickHouse/ch-go/proto.(*Reader).ReadFull
        /go/pkg/mod/github.com/!click!house/[email protected]/proto/reader.go:62
  - EOF

Here is our config (with connection details omitted):

        // nolint:exhaustruct
	conn, err := clickhouse.Open(&clickhouse.Options{
		TLS:     ...,
		Protocol: clickhouse.Native,
		Addr:     ....,
		Auth: clickhouse.Auth{
			Username: ...,
			Password: ...,
			Database: ...,
		},
		ClientInfo: clickhouse.ClientInfo{
			Products: []struct {
				Name    string
				Version string
			}{
				{Name: "....", Version: "0.1"},
			},
		},
		Compression: &clickhouse.Compression{
			Method: clickhouse.CompressionLZ4,
		},
		BlockBufferSize: 10,
		MaxOpenConns: 70,
		MaxIdleConns: 50,
	})

After this error occurs, other queries (both reads and write) also fail, all outputting the same EOF error

Because this is intermittent, it's a little hard to reproduce.

I wonder if there's a race condition in the connection lifetime cleanup routine?

Details

Environment

  • clickhouse-go version: v2.30.0
  • Interface: database/sql compatible driver
  • Go version: 1.22.10
  • Operating system: amazon-linux-2023
  • ClickHouse version: 24.8
  • Is it a ClickHouse Cloud? yes
  • ClickHouse Server non-default settings, if any: everything is default
@jkaflik
Copy link
Contributor

jkaflik commented Jan 7, 2025

@SpencerTorres could you take a look?

@SpencerTorres
Copy link
Member

@0xrelapse Could you add some details about how you're connecting and what type of queries you're running? How frequent, etc. Also let me know if you have any special settings in the TLS config. This issue will be hard to reproduce

@begelundmuller
Copy link

begelundmuller commented Feb 4, 2025

@SpencerTorres We're also seeing this issue. It seems to us that it happens when connecting to a ClickHouse Cloud service that is idle, which causes it to wait for scale up from zero.

We are connecting with the HTTP protocol (not native protocol) and with TLS enabled, but apart from that there's no other custom connection config.

@SpencerTorres
Copy link
Member

@begelundmuller I appreciate the extra insight... This makes more sense as the root cause. I suppose this would need to be handled at the application level? Perhaps some kind of retry or health check to verify the server is ready for connections? If this is a production instance you could also disable the service's sleep timeout.

I'll ask around internally to see if we have any other suggestions for this. What can the client do if it's ultimately a networking issue? We could add some extra logic to verify the connection is ready, but either way the application should already be ready to handle network outages

@SpencerTorres SpencerTorres added cloud ClickHouse Cloud related tests and removed needs triage labels Feb 4, 2025
@begelundmuller
Copy link

begelundmuller commented Feb 5, 2025

@SpencerTorres Thank you for investigating! My impression from the docs is that the autoscaling works similar to services like AWS Lambda, i.e. that the proxy will keep the connection alive until the underlying cluster has scaled up and is ready to serve queries. So I wonder if this might just be a bug in the proxy implementation.

If you require retries in the application logic, it would be helpful if you could provide guidance on how to implement it. However, if retries are indeed needed, I think they might have to be implemented inside this driver since the database/sql interface doesn't provide granular control of the connection pool.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug cloud ClickHouse Cloud related tests
Projects
None yet
Development

No branches or pull requests

4 participants