DB consumes lots of disk #159

jkralik · 2020-01-20T12:26:10Z

Subject of the issue

Server's run from 5. December and DB has lot's of "*.vlog" (314) that consumes 309GB.

Pls how can I reduce it ?
Can I remove old "*.vlog" files ?

Your environment

OS Ubuntu
Version 18.04
We are using ca with acme server and 8(services) as acme clients.

Expected behavior

I expected that it consumes max 1GB of disk.

Actual behavior

It consumes 309GB and it's growing.

dopey · 2020-01-21T20:11:03Z

Hey, apologies for the delayed response.

First off, I assume you are using the default Badger DB? That size is definitely not expected. I'm curious about your usage patterns: ballpark how many certificates have you created? You have 8 services, how often are they regenerating certificates?

Are you using the revocation feature? If not, then you can probably just start the database over entirely (meaning move the old one and stop using it and just start a new database).

More importantly, we'd definitely like to understand what's happening here. Unfortunately, we're storing the data as nosql key-value which makes it difficult to analyze without writing specific code to do that. Would you be open to sending us the database so that we can attempt to analyze it on our end?

jkralik · 2020-01-23T13:27:27Z

Hi

We are using badger DB. Every service has two acme clients:

one for listen socket
second one is use for connect to other service.

eg: Gateways are exposed to the world and they are use let's encrypt for listen and for internal communication with mutual authenticated TLS we use step-ca (connect). Some services are internal and in this use case, they have same configuration for both acme clients that points to the same acme server(step-ca). All services running in k8s.

We use acme cert manager that provides renew: https://github.com/go-ocf/kit/blob/5cad919232f614458aaae356353192d6a0e89706/security/acme/certManager.go#L123
and renew is called when the certificate's age is more than 2/3 it's lifetime.

we don't use revocation.
We want to provide the database. Do you have some endpoint(access) where we can upload DB ? Now it has 345GB.

jkralik · 2020-01-24T06:08:40Z

I compressed DB with 7z and now it has 7.6GB.

dopey · 2020-01-24T06:15:33Z

Hey @jkralik that's awesome. Sorry for the delayed response, I've been chugging away at a late deadline all day.

I was thinking of easy ways for you to upload that. If you send me an ssh pub key I can give you access to a test box and then you can scp it over there. Would that work for you?

Also, follow up question, would you mind sending a snippet of logs from the CA? The rate at which the db is growing makes me think that something is pummeling the CA with requests.

jkralik · 2020-01-24T06:32:24Z

Sure.
ssh pub key: ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIFg88ZxptKmiOJQhG6jK/96+psDz6joEpe4+/Bd2ZFR9

I restarted step-ca pod and all logs are lost. But I'm thinking that this issue can be related with #149 . Now we are used patched version with #162 and after clean DB it has only 2.7MB after 12hours.

dopey · 2020-01-24T08:15:06Z

Whoa! That's super interesting. hmmm.

Ok, I'm gonna try and pm you the host address. Also, first time I've seen an ssh ed25519 key out in the wild. cool.

dopey · 2020-01-24T08:19:26Z

Actually I don't know how to do that with github 😬. My email is [email protected]. Wanna email me and then I'll email you the host address. Sorry, hate to make this complicated.

jkralik · 2020-01-27T10:38:16Z

I found that #162 is noy related. Again it takes 3GB after three days running ... I will provide the smaller one for you.

jkralik · 2020-01-27T13:10:53Z

I found where was the issue. I expected that lego client fill resource with PrivateKey when it's called Renew with CSR, but it's just set certificate and CSR without PrivateKey .... Sorry my fault.

dopey · 2020-01-27T19:50:28Z

@jkralik I don't know the Lego client well enough. But, did this cause some sort of loop that continuously hit the db? I guess I'm not understanding why this was causing the DB to expand so rapidly.

jkralik · 2020-01-28T07:10:49Z

I have loop in my cert manager that renew certificate and when any call fails it try again in 15seconds. It means that every 15seconds was called renew. In my case problem was in https://github.com/go-ocf/kit/blob/cbf12801499b2699b37d72c79f66d8c261d7767e/security/certManager/acme/certManager.go#L238 - this function fails because PrivateKey was empty. And then I fixed it with commit plgd-dev/kit@cbf1280#diff-b1659f964b8384a232f0aec94303c811
-> it set private key from previous certificate it is not set in new one.

dopey · 2020-01-28T20:51:34Z

Interesting. Even if you were renewing every 15 seconds, it's hard for me to understand how you could possibly be generating that much data. If you still have access to the 3GB database, I'd love to take a look at it.

jkralik · 2020-01-30T09:07:15Z

Sure. I uploaded new archive at /newvol/smallstep-private.bck.1.7z. It contains DB and the log, but I have extended logs about logging middleware and challenges.

ki-pete · 2021-09-15T11:53:08Z

Hi @dopey,
did you have a look into the attached DB file? I'm investigating a similar issue currently and its hard for me to find the root cause.
BTW: Do you have a hint how to open the Badger DB and take a look into it?
BR Kim

dopey · 2021-09-15T18:11:33Z

Hey @ki-pete want to hop in our Discord? It might be easier to debug in real time. https://discord.gg/fX5VJZAc

Here is a script you can use to count the rows in each table in your DB: https://gist.github.com/dopey/8e9206073e2cb052b6f633c0b7d4d8df. We'll want that info to help with debugging.

jkralik closed this as completed Jan 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DB consumes lots of disk #159

DB consumes lots of disk #159

jkralik commented Jan 20, 2020 •

edited

Loading

dopey commented Jan 21, 2020

jkralik commented Jan 23, 2020 •

edited

Loading

jkralik commented Jan 24, 2020

dopey commented Jan 24, 2020

jkralik commented Jan 24, 2020

dopey commented Jan 24, 2020

dopey commented Jan 24, 2020

jkralik commented Jan 27, 2020 •

edited

Loading

jkralik commented Jan 27, 2020

dopey commented Jan 27, 2020

jkralik commented Jan 28, 2020

dopey commented Jan 28, 2020

jkralik commented Jan 30, 2020

ki-pete commented Sep 15, 2021

dopey commented Sep 15, 2021

DB consumes lots of disk #159

DB consumes lots of disk #159

Comments

jkralik commented Jan 20, 2020 • edited Loading

Subject of the issue

Your environment

Expected behavior

Actual behavior

dopey commented Jan 21, 2020

jkralik commented Jan 23, 2020 • edited Loading

jkralik commented Jan 24, 2020

dopey commented Jan 24, 2020

jkralik commented Jan 24, 2020

dopey commented Jan 24, 2020

dopey commented Jan 24, 2020

jkralik commented Jan 27, 2020 • edited Loading

jkralik commented Jan 27, 2020

dopey commented Jan 27, 2020

jkralik commented Jan 28, 2020

dopey commented Jan 28, 2020

jkralik commented Jan 30, 2020

ki-pete commented Sep 15, 2021

dopey commented Sep 15, 2021

jkralik commented Jan 20, 2020 •

edited

Loading

jkralik commented Jan 23, 2020 •

edited

Loading

jkralik commented Jan 27, 2020 •

edited

Loading