Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate the Need for the nodes_sync Table and Potential Alternatives #750

Open
VictorCavichioli opened this issue Oct 17, 2024 · 1 comment
Assignees
Labels
enhancement New feature or request PoC/Agent Tasks related to new generation of ecchronos as an agent

Comments

@VictorCavichioli
Copy link
Contributor

Story Description:

The current system utilizes a table called nodes_sync to track various details about Cassandra nodes, instances, and their connections. It maintains information such as node health, datacenter details, last and next scheduled job executions, and more. This table plays a critical role in managing and monitoring node connectivity and job scheduling within the system.

However, it is important to investigate whether the system truly needs this table to function, or if there is a more efficient way to track and manage these details without persisting this information in a database table. This task aims to explore potential alternatives, such as using in-memory data structures, distributed coordination mechanisms, or other approaches to reduce or eliminate reliance on the nodes_sync table.

The investigation will focus on identifying whether the functionality provided by the nodes_sync table can be achieved through alternative means, and what impact such a change would have on the system’s performance, scalability, and reliability.

Acceptance Criteria:

  1. A thorough investigation into the usage of the nodes_sync table, documenting all functionalities it provides in the current system.
  2. Identification of potential alternatives to using a persistent table, such as in-memory storage or distributed coordination solutions.
  3. A detailed analysis of the trade-offs between using a table versus alternatives, including performance, scalability, and maintainability implications.
  4. If the investigation reveals that the table can be avoided, a plan for transitioning to an alternative solution must be outlined, with considerations for backward compatibility and data migration if necessary.

Definition of Done:

  1. The investigation is completed, and a detailed report is produced, outlining the necessity of the nodes_sync table.
  2. If alternatives are feasible, the report should include recommendations and a high-level plan for implementation.
  3. The findings are presented to the team, and a decision is made on whether to proceed with removing or refactoring the nodes_sync table.
  4. Documentation is updated to reflect the investigation findings, including the rationale behind keeping or removing the table.

Notes:

  1. Consider the implications on job scheduling, node health tracking, and how the system might respond if a table is no longer used for these purposes.
  2. Review if any existing in-memory caching, distributed service coordination (e.g., Zookeeper, etcd), or other mechanisms can fulfill the same requirements without persistent storage.
  3. Be mindful of how removing the table might affect the system’s resilience and state persistence during node failures.
@VictorCavichioli VictorCavichioli added enhancement New feature or request PoC/Agent Tasks related to new generation of ecchronos as an agent labels Oct 17, 2024
@paulchandler paulchandler self-assigned this Nov 7, 2024
@paulchandler
Copy link

Currently the node_sync table is not really used in the best way. When ecchronos starts up, it does not read any data from the table, so it does not know what the last and next connections should be. Instead it just creates a new record with the next connection reset to the default.

The only time the table is read, is for when the RetrySchedulerService is looking for jmx connections that need to re connected.
So for the current state the node_sync table could easily be replaced by a simple in memory data structure.
However, it would make for sense to fix this first issue, so that ecchronos will read the node_sync table on startup and then use the original state a basis going forward.

Secondly there is an outstanding issue #680 Create "status" Command on EccTool . The easiest way to implement this would be to store data from all the ecchronos’ running into the database on the node_sync table, then the ecctool command could just query this data.

Finally storing the data within the table will allow to improve robustness of a cluster of ecchronos instances, if a instance of ecchronos goes down, a good number of nodes would be left unrepaired, so another healthy instance could temporarily take these nodes to be responsible for.

For all these reasons I would recommend keeping the node_sync table, this will only be a small table with only likely a few rows stored on each node.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request PoC/Agent Tasks related to new generation of ecchronos as an agent
Projects
None yet
Development

No branches or pull requests

2 participants