Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FRR sometimes starts with incomplete config #15799

Open
2 tasks done
hillsp opened this issue Apr 19, 2024 · 6 comments
Open
2 tasks done

FRR sometimes starts with incomplete config #15799

hillsp opened this issue Apr 19, 2024 · 6 comments
Labels
triage Needs further investigation

Comments

@hillsp
Copy link

hillsp commented Apr 19, 2024

Description

We are trying to use FRR/bgpd to originate a large number of prefixes and have run into issues where it occasionally starts up with incomplete config, especially when the system is loaded:

ubuntu@ip-10-2-0-11:~$ sudo vtysh -c 'sh run' | wc -l
11378
ubuntu@ip-10-2-0-11:~$ wc -l /etc/frr/frr.conf
24039 /etc/frr/frr.conf

On our test system, the issue happens occasionally during a normal restart but is 100% reproducible when the system is CPU loaded (I run one instance of yes > /dev/null & for every CPU core).

While trying to debug this, I noticed that the problem goes away if I add --no-fork to the vtysh -b command in /usr/lib/frr/frrcommon.sh.

Version

Reproduced with both 9.1 and and a fresh clone from master

How to reproduce

Only bgpd is enabled in /etc/frr/daemons:

zebra_options="  -A 127.0.0.1 -s 90000000"
bgpd_options="   -A 127.0.0.1 --no_kernel"
frr_profile="traditional"

frr.conf I've been using to repro, with a lot of synthetic test /32s removed:

hostname ip-10-2-0-11
password zebra
service password-encryption
route-map prepend-1 permit 10
  set as-path prepend 65001 65101
route-map ALLOW-ALL permit 10
route-map out-map-1 permit 10
  match ip address prefix-list out-map-1
ip prefix-list out-map-1 seq 5 permit 101.101.1.1/32
[snip 11,998 lines]
ip prefix-list out-map-1 seq 60000 permit 101.101.95.99/32
router bgp 65002
  no bgp network import-check
  bgp router-id 169.254.29.10
  network 100.100.1.1/32
  network 101.101.72.93/32 route-map prepend-1
  [snip 11,998 lines]
  network 101.101.31.205/32 route-map prepend-1
  neighbor 169.254.29.9 remote-as 65103
  neighbor 169.254.29.9 activate
  neighbor 169.254.29.9 ebgp-multihop
  neighbor 169.254.29.9 soft-reconfiguration inbound
  neighbor 169.254.29.9 route-map ALLOW-ALL in
  neighbor 169.254.29.9 route-map out-map-1 out
log file /var/log/frr/combined.log

Expected behavior

FRR should start with a complete set of config (vtysh -c 'sh run' | wc -l should return > 24,000 lines)

Actual behavior

FRR starts without all config loaded.

Additional context

No response

Checklist

  • I have searched the open issues for this bug.
  • I have not included sensitive information in this report.
@hillsp hillsp added the triage Needs further investigation label Apr 19, 2024
@hillsp
Copy link
Author

hillsp commented Apr 22, 2024

After debugging this some, I see watchfrr is giving up because it takes too long to read the config. From systemd journal:

Apr 22 17:09:03 ip-10-0-1-65 watchfrr[19556]: [ZE9RA-19PS5] restart all child process 19557 still running after 20 seconds, sending signal 15
Apr 22 17:09:03 ip-10-0-1-65 watchfrr[19556]: [SK7QP-A2GT9] restart all process 19557 terminated due to signal 15

Essentially, vtysh -b is being killed before config load completes, leaving a partially-configured system behind.

@hillsp
Copy link
Author

hillsp commented Apr 22, 2024

It looks like watchfrr has code that's supposed to handle this - reading_configuration is set to true once watchfrr has finished its config load. Unfortunately, now that vtysh -b forks, watchfrr may process its config long before the other daemons, causing the above timeout.

I have been able to work-around this by setting watchfrr_options="--restart-timeout=60" in /etc/frr/daemons. Editing frrcommon.sh to pass --no-fork to vtysh -b also works, but is of course slower.

Copy link

This issue is stale because it has been open 180 days with no activity. Comment or remove the autoclose label in order to avoid having this issue closed.

@frrbot
Copy link

frrbot bot commented Oct 20, 2024

This issue will be automatically closed in the specified period unless there is further activity.

@hillsp
Copy link
Author

hillsp commented Oct 21, 2024

I've been able to work around this, but it is a real bug for people using large configurations. Ideally, there should be a positive ack for all config reloaded, rather than just a semi-fixed and hard to debug timeout if the config is too large.

@frrbot frrbot bot removed the autoclose label Oct 21, 2024
@frrbot
Copy link

frrbot bot commented Oct 21, 2024

This issue will no longer be automatically closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triage Needs further investigation
Projects
None yet
Development

No branches or pull requests

1 participant