RFE: enable changing the number of AVC hash buckets at runtime #34

pcmoore · 2017-05-23T13:00:17Z

At present the number of AVC hash buckets is hard coded to 512, we should look into making this tunable at runtime. While 512 buckets tends to work well for most workloads, it is proving to be too small for systems with a large number of unique labels such as container hosts using MCS/sVirt.

pcmoore · 2017-05-23T13:02:11Z

I'm not sure this calls for something like the kernel's lib/rhashtable.c implementation, since the AVC is a cache, and I expect size adjustments to be rare, we can probably get away with throwing out the old table and replacing it with a new, empty table.

stephensmalley · 2017-05-23T13:51:01Z

Do you really need to change the number of buckets, or just the threshold/max number of cache entries? The latter can already be tuned via /sys/fs/selinux/avc/cache_threshold. Do we have some data, e.g. cat /sys/fs/selinux/avc/hash_stats, from these systems?

pcmoore · 2017-05-23T14:07:17Z

I'm hearing of systems that have bumped the threshold up to ~65k and are hitting that limit, the resulting lengthy per-bucket chains are causing spikes in CPU usage in avc_has_perm().

stephensmalley · 2017-05-23T14:19:46Z

Why would we end up with that many unique AVC entries? Most container accesses would be within the same category set (i.e. intra-container) and to a handful of types (mostly container or svirt types). So they shouldn't yield that many unique (source context, target context, target class) triples.

pcmoore · 2017-05-23T14:45:43Z

Imagine thousands of containers on a single system.

stephensmalley · 2018-03-27T13:28:44Z

Even with thousands of containers, most accesses should be intra-container, so I wouldn't expect that many unique AVC entries; AVC entries are only ever created for actual permission checks, not potential ones. That said, given the number of unique security classes, I could see a definite multiplying factor to just represent a container's access to all file classes, many socket classes, etc. That's another area for possible improvement, i.e. allowing a single AVC entry and security server computation to represent multiple classes so that if the same permissions are allowed to e.g. all file classes, we can store that once in the AVC.

rhatdan · 2018-03-27T14:12:16Z

Most people will never run more then 100 containers. Eventually we might scale beyond 100, but I think on OpenShift right now, we are only handling ~50 containers. So this would be 50 Process types and 50 object types (Maybe a few more)

jeremyeder · 2018-03-27T14:19:02Z

Agreed with the most people comment. OpenShift supports up to 250 per node right now, and going to try and double that by fall of this year. We currently have closing in on 100 per node in a variety of environments though. 100 is pretty common.

stephensmalley · 2018-03-27T14:24:56Z

Then I don't see why we'd be increasing the AVC cache threshhold to 64k; that's just making the cache slow for no benefit.

gpiochip_set_cascaded_irqchip() is passed 'parent_irq' as an argument and then the address of that argument is assigned to the gpio chips gpio_irq_chip 'parents' pointer shortly thereafter. This can't ever work, because we've just assigned some stack address to a pointer that we plan to dereference later in gpiochip_irq_map(). I ran into this issue with the KASAN report below when gpiochip_irq_map() tried to setup the parent irq with a total junk pointer for the 'parents' array. BUG: KASAN: stack-out-of-bounds in gpiochip_irq_map+0x228/0x248 Read of size 4 at addr ffffffc0dde472e0 by task swapper/0/1 CPU: 7 PID: 1 Comm: swapper/0 Not tainted 4.14.72 #34 Call trace: [<ffffff9008093638>] dump_backtrace+0x0/0x718 [<ffffff9008093da4>] show_stack+0x20/0x2c [<ffffff90096b9224>] __dump_stack+0x20/0x28 [<ffffff90096b91c8>] dump_stack+0x80/0xbc [<ffffff900845a350>] print_address_description+0x70/0x238 [<ffffff900845a8e4>] kasan_report+0x1cc/0x260 [<ffffff900845aa14>] __asan_report_load4_noabort+0x2c/0x38 [<ffffff900897e098>] gpiochip_irq_map+0x228/0x248 [<ffffff900820cc08>] irq_domain_associate+0x114/0x2ec [<ffffff900820d13c>] irq_create_mapping+0x120/0x234 [<ffffff900820da78>] irq_create_fwspec_mapping+0x4c8/0x88c [<ffffff900820e2d8>] irq_create_of_mapping+0x180/0x210 [<ffffff900917114c>] of_irq_get+0x138/0x198 [<ffffff9008dc70ac>] spi_drv_probe+0x94/0x178 [<ffffff9008ca5168>] driver_probe_device+0x51c/0x824 [<ffffff9008ca6538>] __device_attach_driver+0x148/0x20c [<ffffff9008ca14cc>] bus_for_each_drv+0x120/0x188 [<ffffff9008ca570c>] __device_attach+0x19c/0x2dc [<ffffff9008ca586c>] device_initial_probe+0x20/0x2c [<ffffff9008ca18bc>] bus_probe_device+0x80/0x154 [<ffffff9008c9b9b4>] device_add+0x9b8/0xbdc [<ffffff9008dc7640>] spi_add_device+0x1b8/0x380 [<ffffff9008dcbaf0>] spi_register_controller+0x111c/0x1378 [<ffffff9008dd6b10>] spi_geni_probe+0x4dc/0x6f8 [<ffffff9008cab058>] platform_drv_probe+0xdc/0x130 [<ffffff9008ca5168>] driver_probe_device+0x51c/0x824 [<ffffff9008ca59cc>] __driver_attach+0x100/0x194 [<ffffff9008ca0ea8>] bus_for_each_dev+0x104/0x16c [<ffffff9008ca58c0>] driver_attach+0x48/0x54 [<ffffff9008ca1edc>] bus_add_driver+0x274/0x498 [<ffffff9008ca8448>] driver_register+0x1ac/0x230 [<ffffff9008caaf6c>] __platform_driver_register+0xcc/0xdc [<ffffff9009c4b33c>] spi_geni_driver_init+0x1c/0x24 [<ffffff9008084cb8>] do_one_initcall+0x240/0x3dc [<ffffff9009c017d0>] kernel_init_freeable+0x378/0x468 [<ffffff90096e8240>] kernel_init+0x14/0x110 [<ffffff9008086fcc>] ret_from_fork+0x10/0x18 The buggy address belongs to the page: page:ffffffbf037791c0 count:0 mapcount:0 mapping: (null) index:0x0 flags: 0x4000000000000000() raw: 4000000000000000 0000000000000000 0000000000000000 00000000ffffffff raw: ffffffbf037791e0 ffffffbf037791e0 0000000000000000 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffffffc0dde47180: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ffffffc0dde47200: f1 f1 f1 f1 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f2 f2 >ffffffc0dde47280: f2 f2 00 00 00 00 00 00 00 00 00 00 f3 f3 f3 f3 ^ ffffffc0dde47300: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ffffffc0dde47380: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Let's leave around one unsigned int in the gpio_irq_chip struct for the single parent irq case and repoint the 'parents' array at it. This way code is left mostly intact to setup parents and we waste an extra few bytes per structure of which there should be only a handful in a system. Cc: Evan Green <[email protected]> Cc: Thierry Reding <[email protected]> Cc: Grygorii Strashko <[email protected]> Fixes: e0d8972 ("gpio: Implement tighter IRQ chip integration") Signed-off-by: Stephen Boyd <[email protected]> Signed-off-by: Linus Walleij <[email protected]>

Inside decrement_ttl() upon discovering that the packet ttl has exceeded, __IP_INC_STATS and __IP6_INC_STATS macros can be called from preemptible context having the following backtrace: check_preemption_disabled: 48 callbacks suppressed BUG: using __this_cpu_add() in preemptible [00000000] code: curl/1177 caller is decrement_ttl+0x217/0x830 CPU: 5 PID: 1177 Comm: curl Not tainted 6.7.0+ #34 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0xbd/0xe0 check_preemption_disabled+0xd1/0xe0 decrement_ttl+0x217/0x830 __ip_vs_get_out_rt+0x4e0/0x1ef0 ip_vs_nat_xmit+0x205/0xcd0 ip_vs_in_hook+0x9b1/0x26a0 nf_hook_slow+0xc2/0x210 nf_hook+0x1fb/0x770 __ip_local_out+0x33b/0x640 ip_local_out+0x2a/0x490 __ip_queue_xmit+0x990/0x1d10 __tcp_transmit_skb+0x288b/0x3d10 tcp_connect+0x3466/0x5180 tcp_v4_connect+0x1535/0x1bb0 __inet_stream_connect+0x40d/0x1040 inet_stream_connect+0x57/0xa0 __sys_connect_file+0x162/0x1a0 __sys_connect+0x137/0x160 __x64_sys_connect+0x72/0xb0 do_syscall_64+0x6f/0x140 entry_SYSCALL_64_after_hwframe+0x6e/0x76 RIP: 0033:0x7fe6dbbc34e0 Use the corresponding preemption-aware variants: IP_INC_STATS and IP6_INC_STATS. Found by Linux Verification Center (linuxtesting.org). Fixes: 8d8e20e ("ipvs: Decrement ttl") Signed-off-by: Fedor Pchelkin <[email protected]> Acked-by: Julian Anastasov <[email protected]> Acked-by: Simon Horman <[email protected]> Signed-off-by: Pablo Neira Ayuso <[email protected]>

pcmoore added enhancement priority/low labels May 23, 2017

fcami mentioned this issue Sep 5, 2018

Reset avc_cache_threshold to 512 as higher values cause performance issues openshift/openshift-ansible#9923

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFE: enable changing the number of AVC hash buckets at runtime #34

RFE: enable changing the number of AVC hash buckets at runtime #34

pcmoore commented May 23, 2017

pcmoore commented May 23, 2017

stephensmalley commented May 23, 2017

pcmoore commented May 23, 2017

stephensmalley commented May 23, 2017

pcmoore commented May 23, 2017

stephensmalley commented Mar 27, 2018

rhatdan commented Mar 27, 2018

jeremyeder commented Mar 27, 2018

stephensmalley commented Mar 27, 2018

RFE: enable changing the number of AVC hash buckets at runtime #34

RFE: enable changing the number of AVC hash buckets at runtime #34

Comments

pcmoore commented May 23, 2017

pcmoore commented May 23, 2017

stephensmalley commented May 23, 2017

pcmoore commented May 23, 2017

stephensmalley commented May 23, 2017

pcmoore commented May 23, 2017

stephensmalley commented Mar 27, 2018

rhatdan commented Mar 27, 2018

jeremyeder commented Mar 27, 2018

stephensmalley commented Mar 27, 2018