Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Principal=>subordinate coordination for logrotation (grafana-agent causing a lot of unreleased files and huge syslog) #153

Open
taurus-forever opened this issue Jul 16, 2024 · 16 comments

Comments

@taurus-forever
Copy link

Bug Description

Hi,

Please check the complete issue description in PostgreSQL repo:
canonical/postgresql-operator#524

TL;DR: PostgreSQL charm rotates logs but doesn't send any signals to subordinated grafana-agent causing a lot of unreleased files and huge syslog => downtime.

It is a cross-team ticket to build a solution here.

To Reproduce

See steps to reproduce in canonical/postgresql-operator#524

Environment

See Versions in canonical/postgresql-operator#524

Relevant log output

No relevant logs output.

Additional context

Proposals:

  • (preferred) subordinated charm(s) should bring/install logrotate config, so principal charm will execute all of them.
  • principal charm should (somehow) detects subordinates and inform them (somehow) using signal (SIGHUP?)
    Better ideas are welcome!
@taurus-forever
Copy link
Author

@simskij do you have good ideas how to proceed here?
DB charm should somehow aware about COS promtail to SIGHUP it properly.

@lucabello
Copy link
Contributor

lucabello commented Aug 30, 2024

@taurus-forever Do you think this would be solved by not getting all the files from /var/log/** and instead only get a few that you specify?

@taurus-forever
Copy link
Author

taurus-forever commented Sep 6, 2024

@lucabello AFAIK, no. The grafana-agent charm / promtail binary (did?) read some log files from disk (to send them to Loki). PostgreSQL charm rotates logs, but didn't send any signals to COS to close the current descriptor and reopen the file (as the old one moved to archive folder).

IMHO, we have two options:

  • send all logs through pebble and never read logs on disk. AFAIK it is a current COS plan. Pros: 1) VM has no Pebble. Journald has performance issues, etc. 2) some services (e.g. pgbackest) requires complex files structure and do not support syslog approach.
  • teach principal/PostgreSQL charm to send signals on logrorate to subordinated/grafa-agent.
  • ...
    Better ideas are welcome!

@sed-i
Copy link
Contributor

sed-i commented Sep 6, 2024

  • Grafana folks recommend using the "rename and create" rotation strategy over "copy and truncate". Do you know what rotation strategy is set up for postgresql? Is that configurable on your side?
  • Are you saying that slurping the archive patroni/patroni.log.* is incorrect, and we should only be reading the one file patroni/patroni.log? That could be related to the glob "/var/log/**/*log" we have in place. But I'm surprised that filenames such as patroni.log.6545 or patroni.log.10080 would match the glob. I confirmed that python's glob meets my expectation ([print(p.name) for p in Path(".").glob("**/*log")]) but not sure how it's impl'd in gagent.

I'd appreciate your input on the above @taurus-forever.

@dragomirp
Copy link
Contributor

  • Grafana folks recommend using the "rename and create" rotation strategy over "copy and truncate". Do you know what rotation strategy is set up for postgresql? Is that configurable on your side?

Patroni should be using an extended version of Python's RotatingFileHandler. Python's docs indicate “rename and create”. We can only configure size and amount of files to keep.

Postgresql is configured to keep a week's worth of per minute logs that are truncated each minute. This is behaviour configured by us, but it is spec behaviour so discussions would be necessary to change it.

Both Patroni and Postgresql try to keep about a week's worth of per minute logs, so that should be about 10k files for each.

  • Are you saying that slurping the archive patroni/patroni.log.* is incorrect, and we should only be reading the one file patroni/patroni.log? That could be related to the glob "/var/log/**/*log" we have in place. But I'm surprised that filenames such as patroni.log.6545 or patroni.log.10080 would match the glob. I confirmed that python's glob meets my expectation ([print(p.name) for p in Path(".").glob("**/*log")]) but not sure how it's impl'd in gagent.

I don't know how the agent detects log changes, but for Patroni only the last few logs should be relevant, the deeper backlog was likely already synced and shouldn't be changing. For Postgresql things are trickier, since the files are the same (postgresql-%w_%H%M.log) but will be overwritten when the same time comes.

@Vultaire
Copy link

I'm hoping this is still on peoples' radar - this interaction between postgresql and grafana-agent is creating truly excessive logging in some environments. I'm seeing roughly 8 gigs of log lines a day just related to events related to the patroni log files. This is quickly filling the disk of the hosts in one of the clouds I am working on. (10,080 log lines * roughly 300 characters per line * 2,800 times a day (twice a minute) = 8,467,200,000 bytes as a rough estimate)

@Vultaire
Copy link

Vultaire commented Dec 13, 2024

Adding additional context, per discussion with the obs team: here is the related job in /etc/grafana-agent.yaml:

    - job_name: charmed-postgresql-common-var-log-patroni
      pipeline_stages:
      - drop:
          expression: .*file is a directory.*
      relabel_configs:
      - replacement: /patroni
        source_labels:
        - __path__
        target_label: path
      static_configs:
      - labels:
          __path__: /snap/grafana-agent/51/shared-logs/patroni/**
          job: charmed-postgresql-common-var-log-patroni
          juju_application: postgresql
          juju_charm: grafana-agent
          juju_model: landscape-dev
          juju_model_uuid: $REDACTED
          juju_unit: postgresql/1
          snap_name: charmed-postgresql
        targets:
        - localhost

@sed-i
Copy link
Contributor

sed-i commented Dec 13, 2024

Summary of our call:

  • patroni.log.N files are rotated, and there are a lot of them, for some reason - from patroni.log.1 to patroni.log.10080.
  • grafana agent periodically logs to syslog a CREATE event for all of the patroni log files, including the old ones that were already rotated.
  • patroni rotates files every minute, so the stat of all files changes every minute.
  • Even when grafana-agent is down, the patroni files are still rotated every minute, so grafana-agent isn't "locking" anything.

Per comment by @dragomirp this is by design:

Postgresql is configured to keep a week's worth of per minute logs that are truncated each minute. This is behaviour configured by us, but it is spec behaviour so discussions would be necessary to change it.

We could see how ergonomic it would be to make the ** glob configurable so that grafana agent tracks only a subset of patroni files:

job = self._snap_plug_job(
endpoint.owner,
f"{path}/**",
topology.application,
str(topology.unit),
self._path_label(path),
)

That being said, the patroni log files seem to have little value:

root@dev:/var/snap/charmed-postgresql/common/var/log/patroni# cat patroni.log.7404
2024-12-08 13:25:52 UTC [9279]: INFO: no action. I am (postgresql-1), the leader with the lock 
2024-12-08 13:26:02 UTC [9279]: INFO: no action. I am (postgresql-1), the leader with the lock 
2024-12-08 13:26:12 UTC [9279]: INFO: no action. I am (postgresql-1), the leader with the lock 
2024-12-08 13:26:22 UTC [9279]: INFO: no action. I am (postgresql-1), the leader with the lock
2024-12-08 13:26:32 UTC [9279]: INFO: no action. I am (postgresql-1), the leader with the lock 
2024-12-08 13:26:42 UTC [9279]: INFO: no action. I am (postgresql-1), the leader with the lock
root@dev:/var/snap/charmed-postgresql/common/var/log/patroni# cat patroni.log.282
2024-12-13 12:07:03 UTC [9279]: ERROR: Error communicating with DCS 
2024-12-13 12:07:04 UTC [9279]: INFO: demoting self because DCS is not accessible and I was a leader 
2024-12-13 12:07:04 UTC [9279]: INFO: Demoting self (offline) 
2024-12-13 12:07:04 UTC [9279]: WARNING: Loop time exceeded, rescheduling immediately. 
2024-12-13 12:07:04 UTC [9279]: INFO: Lock owner: postgresql-1; I am postgresql-1 
2024-12-13 12:07:04 UTC [9279]: INFO: updated leader lock during demoting self because DCS is not accessible and I was a leader

I wonder if:

  • the logging policy for patroni can be decoupled from postgres, i.e. have different retention and rotation settings.
  • reduce the rotation period to e.g. 1 day, so way have much less files to track.

Another option: see if we can have grafana agent not log CREATE events.

@verterok
Copy link

verterok commented Jan 22, 2025

Hi,
I just hit this excessive logging in a PostgreSQL cluster deployment, luckily it was not yet in use and the first node to go down was in standby mode.

My current workaround is to route grafana-agent logs to a dedicated log file and rotate it daily, attached quick script to tweak the rsyslog and logrotate configs in case it is useful for someone else.

EDIT, looks like attaching the file failed, reproduced below:

#!/bin/bash

[[ "$1" == "" ]] && exit 1

juju ssh $1 << 'EOF1'
sudo su

cat > /etc/logrotate.d/grafana-agent << EOF
/var/log/grafana-agent.log {
    daily
    rotate 7
    create 0644 syslog root
    missingok
    notifempty
    compress
    delaycompress
    dateext
    postrotate
        [ ! -x /usr/lib/rsyslog/rsyslog-rotate ] || /usr/lib/rsyslog/rsyslog-rotate
    endscript
}
EOF

cat > /etc/rsyslog.d/22-grafana-agent.conf << 'EOF'
if $programname == 'grafana-agent.grafana-agent' then {
   action(type="omfile" file="/var/log/grafana-agent.log" fileOwner="root" fileGroup="syslog")
   stop
}
EOF

systemctl restart rsyslog
EOF1

@sed-i
Copy link
Contributor

sed-i commented Jan 22, 2025

Another option is to tell grafana-agent to exclude old patroni files.
A drawback is that if grafana-agent incurs some down time, then we may have a gap in logs scraped from patroni (e.g. in loki).

@verterok
Copy link

Hi,
I was not suggesting my approach as a solution, as I mentioned is a workaround to avoid getting the instance disks filled in a couple of days and might be useful for cases were it is affecting production workloads.

My 2c, I don't know details of grafana-agent operator implementation, but purely from a user POV it would be nice if the default is not to fill the disk even if the log level is verbose like in this case (an option is to drop a rsyslog and logrotate file like in my workaround or some other way to achieve the same control of file size/rotation frequency)

Cheers

@sed-i
Copy link
Contributor

sed-i commented Jan 23, 2025

Hey @verterok,

Grafana agent logs op=CREATE with level=info:

Jun 30 00:00:03 landscapesql-new-1 grafana-agent.grafana-agent[416696]: ts=2024-06-30T00:00:03.940798064Z caller=filetargetmanager.go:181 level=info component=logs logs_config=log_file_scraper msg="received file watcher event" name=/snap/grafana-agent/16/shared-logs/patroni/patroni.log.6545 op=CREATE

You could raise log_level to warn:

juju config grafana-agent log_level=warn

@verterok
Copy link

verterok commented Jan 24, 2025

Hi,

There seems to be no such config in the latest/stable charm ( latest/stable, 332 )

$ juju config grafana-agent log_level=warn
{}
ERROR parsing settings for application: unknown option "log_level"

FWIW, my daily rotation config/setup was not enough and I got a dead postgresql cluster (luckily not a production one)

@marceloneppel
Copy link
Member

Hi,

There seems to be no such config in the latest/stable charm ( latest/stable, 332 )

$ juju config grafana-agent log_level=warn
{}
ERROR parsing settings for application: unknown option "log_level"

FWIW, my daily rotation config/setup was not enough and I got a dead postgresql cluster (luckily not a production one)

From what I see, that option (log_level) is only available from the candidate risk.

@marceloneppel
Copy link
Member

Another option: see if we can have grafana agent not log CREATE events.

Unfortunately, it seems that this is not possible as gagent isn't exposing any way to change the list of events it's logging.

@marceloneppel
Copy link
Member

marceloneppel commented Jan 24, 2025

Grafana agent logs op=CREATE with level=info:

Jun 30 00:00:03 landscapesql-new-1 grafana-agent.grafana-agent[416696]: ts=2024-06-30T00:00:03.940798064Z caller=filetargetmanager.go:181 level=info component=logs logs_config=log_file_scraper msg="received file watcher event" name=/snap/grafana-agent/16/shared-logs/patroni/patroni.log.6545 op=CREATE

You could raise log_level to warn:

juju config grafana-agent log_level=warn

@sed-i Regarding that config option (log_level), is there any other important info-level log message that should be shown by default for debugging purposes? If not, would it be okay to set the default to warn?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants