-
Notifications
You must be signed in to change notification settings - Fork 972
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
operator-friendly quorum health reporting #4523
Comments
We have to be careful with this. We don't want to give the wrong impression that removing nodes from their quorum set would increase reliability (what we've seen already happen with archives)
I thought "tier1" was basically computed by the quorum intersection code? That being said, as the target audience here are people running watcher nodes, simplified algos like running "pagerank" over qsets from validators in the node config and display entries that cross some threshold and are not in the config could go a long way.
there is a SEP for this actually, so we could use ledger information. That being said, I don't know if this will continue to be true (state archival). |
I think this is part of a bigger discussion around being able to access core data more easily for operators (thus also users). Everyone should be able to easily access and process data (and they theoretically are), for instance regarding the quorum health there should be workers that monitor whether validators aren't arbitrarily including transactions that favor their activities when the network is congested (e.g monitoring the frequency of included txs that are not in the node's mempool), etc. I don't think that pushing data to downstream and making aggregations and assumptions about how watcher nodes might want to do this is the correct way. My view is that watcher nodes should be able to easily work along with core at a lower level and decide for themselves how they want to process health factors, changes, etc. (This also incentivizes a better research and network economy imo). I've started working on a stellar-core fork that shares data at runtime over to a rust bridge using shm. Processing the data from the rust service becomes much easier for an operator as it's running in parallel (still figuring out about making this thread safe) and you can plug in a much simpler codebase. I'm curious about how the core team feels about this approach (also because I'm planning a push an MVP of this to Mercury testnet quite soon). On a separate note
I agree with this. |
@heytdep great to hear you're also interested in better auditing infrastructure. It's an area where we didn't do a whole lot until very recently, so lots of opportunities. We've actually already made some good experiments by pushing data into the data pipeline for historical analysis (we've already started to use it to catch potential bias from validators) @sydneynotthecity probably has a lot more to say on the topic (and links to issues where discussions can happen). I'd say that if we're missing data, we should look into ways to push it into those data lakes to make it as easy and efficient as possible for data scientists (or anyone really) to analyze things and find interesting anomalies/trends without having to deal with the dark world of C++ :) As for this specific issue, I think the scope here is much simpler: without historical data (and related infrastructure), we're trying to give as much feedback as possible to node operators so that they at least know that they're supposed to go look at more advanced tools (stellarbeat or others) and confirm that their configuration is correct. |
@MonsieurNicolas awesome! Would very much like to learn about the current experiments happening on that end, I do remember Garand mentioning something about making this kind of data more accessible through Hubble. I think I'll keep working on my experimental "rust bridge" atm even just to not just have one sdf implementation, and to learn more about core's codebase (and making the information bidirectional, i.e also feeding instructions to core from rust, still unsure about the safety here tho), but would be awesome to be able to access more vote and set related data similarly/with the same ease as how we currently access the transitions. |
@heytdep this is something we've been thinking about on the data team as well. We are particularly interested in assessing how txset compositions change between validators if validators choose to enable cap-0042 (and the implications for fees). In general, it would be nice if some of the quorum sets and validator data was more easily accessible. We've added some fields to Hubble to make it easier to audit bias at the validator level. See stellar-etl#249 for some additional context. We now publish the node id + signature that signed each ledger tx set. We discussed adding SCP info and validator messages to our dataset, but ultimately decided that the data was too fine-grain without better understanding what operators would do with that level of detail. The aggregate information @marta-lokhova proposes above is something we could include in our ETL pipeline and make publicly available and expand out to publishing full qset details and messages (if needed) Given the current data published in Hubble, you can actually do a bit of experimentation today. Anything saved in |
@sydneynotthecity thanks for this. In general I'd say that any data in the What I was working on is a bit more lower level to monitor that leader validators are not pushing valid txs to the set without publishing them first. But yeah it's great to have more data externalized about validator bias too, I'll try and squeeze in some time to build a zephyr version as well to increase awareness. |
We should signal to operators vital information about the health of their quorum set as well as transitive quorum in general. Currently, we have info in
quorum
, quorum intersection checker warnings, plus in each consensus round we report missing/delayed nodes. Some ideas on how this can be improved from an operator point of view (specifically, it should be easy to understand that an action is needed):The text was updated successfully, but these errors were encountered: