Architectural/System-level problems for enterprise-level Storage Providers #9686
Replies: 16 comments 92 replies
-
Problem: Sealing tasks scheduling inefficienciesIn both large and smaller SP deployments, one can observe that there are lags in what resources a lotus-worker reports it has available. Lotus-Workers need to be allocated tasks more efficiently. There is a big discrepancy between the theoretical throughput a sealing worker can have and the actual throughput. These inefficiencies compounds for storage providers wanting to scale fast, or that need to have a high deal/sealing throughput (i.e many lotus-workers). All scheduling is currently done on the Lotus-Miner process. This process does not have perfect information about each lotus-worker state, which means that it has to model each of the Lotus-Workers states. This leads to the scheduling logic being complex and inevitably not having a perfect state of the workers. The scheduler also has an upper boundary of how many lotus-workers it's built for. There are currently many network roundtrips involved in the scheduling logic for a single process, which leads to a theoretically supported upper boundary of 100 Lotus-Workers. We have seen Storage Providers run with more workers than this, but additional issues observed grow exponentially from this point. Potential solution 1: Pros:
Cons:
Potential solution 2: We have been looking into ways to optimize the current scheduler by removing and re-architecting its current blockers (especially the network round trips done between the Lotus-Miner and Lotus-Woker, and O(m*n)-scheduling). Moving to a scheduler that has shallower queues and better storage awareness of the Lotus-Workers could also improve a lot of the issues observed with not selecting workers appropriately. Pros:
Cons:
|
Beta Was this translation helpful? Give feedback.
-
Problem: Sector Storage SubsystemUnder pressure (or crashes/abrupt shutdowns) the sectors storage subsystem can cause failure of reporting what state a sector is actually in, which makes monitoring and troubleshooting hard. In some cases it can also lead to sectors being stuck in a state, where the only solution is to remove (if in sealing pipeline) or potentially terminating it (if it made it on-chain and unable to prove). On the Filecoin roadmap there are also improvements like mutable sectors (re-snap), new Proofs-of-Replication and Hierachial Consensus that the sector storage subsystem is not prepared for. When SnapDeals (mutating CC-sector to deal sectors) was introduced, this restriction was circumvented by adding more sector file types and folders - which added more complexity for something that did not necessarily need to exist:
We face ending up in a similar complexity trade-off down the line with the current Sector Storage Subsystem. Re-architecting to hardening the stability and performance of this subsystem would be needed to support large scale SP deployments. The sector index is today backed by non-persistent in-memory maps running inside the main Lotus-Miner process which is not ideal at large deployments. Potential solution/idea: One solution we have explored is moving the sector index outside of the Lotus-Miner process into a shared robust database. Pros:
Cons:
|
Beta Was this translation helpful? Give feedback.
-
Problem: Lotus-miner doesn’t support PoSt redundancyProving your storage is a critical component for storage providers, and failure to post WindowPoSt poses great earnings loss and costs. Today you can connect multiple windowPoSt workers to a single Lotus-Miner instance, which allows storage providers to have many partitions in a single deadline, enabling one to scale up to large raw storage capacities. This is still only partially useful at an enterprise scale since there is no redundancy in windowPoSt. There is also no redundancy in winningPoSt. Potential solution for windowPoSt redundancy: Real redundancy (duplicate work) for windowPoSt is hard to accomplish due to the nature of how windowPoSt is computed. Since windowPoSt requires reading random parts of each sector in a partition, doing 2x or more reads for every sector in a partition can lead to slower read speeds, which in turn can cause windowPoSt failures. Sending duplicate windowPoSt message to the chain is also not ideal since this wastes both FIL spent on gas-fees and chain space. What we can do is adding redundancy in the windowPoSt scheduling logic (watching the chain and asking workers to make a snark after the challenges has been read) so that failures in creating the snark or sending the message to chain will be retried. Potential solution for winningPoSt redundancy: Adding redundancy (duplicate work) for winningPoSt is easier to accomplish, since it is only a Proof-of-Spacetime on a single random sector, and therefor does not pose any storage reading threats/bottlenecks. Adding this mostly requires new slash-prevention logic before broadcasting the block. |
Beta Was this translation helpful? Give feedback.
-
Problem: Managing multiple MinerIDsEnterprise Storage Providers might take on funding from multiple investors or different lending platforms. A common way of managing funds from these different entities (and also managing risk), is to separate funds/ownership between different MinerIDs. The storage providers usually have a fixed amount of sealing workers and storage servers which they then spread across these MinerIDs. Managing all these workers and changing workers to different MinerIDs is both time consuming and not very flexible. The same issue also arises for storage providers running test networks, as the MinerID for the test network would need a dedicated sealing-pipeline and storage. Potential solution: Decoupling these components will allow one to have a pool of storage or sealing-workers for multiple MinerIDs. Including the ability to have different sectors sizes and connecting to multiple networks with a single pool of storage/workers. Being able to work on this potential problem relies on changes being made in the Sector Storage Subsystem which is another problem: Sector Storage Subsystem |
Beta Was this translation helpful? Give feedback.
-
Problem: Monitoring SystemsEnterprise scale storage providers needs good monitoring systems to observe the health and condition of their systems, this is especially true when managing multiple MinerIDs. There are currently some open source monitoring systems that can be leveraged/or built upon by enterprise SPs for monitoring, but they might not be suitable for enterprise requirements. Many SPs already has such montioring systems developed in-house, but new entereprising SPs face the issue of having no easy way to deploy montoring systems. Potential solution: |
Beta Was this translation helpful? Give feedback.
-
Problem: ??Any high-level problem areas of the Lotus-Miner you feel have not been covered by the above threads? Then please elaborate in-depth here: |
Beta Was this translation helpful? Give feedback.
-
Problem: RPC Connection to Lotus should be ReliableWhen miner's loading is high, sometimes RPC connection to lotus will be stuck. Or, when the link between miner and lotus is not so stable sometime, miner and lotus connection will be lost directly. At that time, for a retry connection, miner can resume after the connection is closed. But sometimes the connection will be stuck for long time, then miner will also be stuck. For the sealing process, if the lotus and miner connection lost, the only way is to restart miner. A lot of sectors' state are lost, and never be scheduled again. |
Beta Was this translation helpful? Give feedback.
-
Problem: Miner itself should be a HA clusterCurrently we have only one miner to generate proof (may be more than 1 post worker, but post worker actually is slave, if miner down, post worker done). I think miner should be a HA cluster, master / slave or not, then when master down, at least one slave could become to master to let the work go ahead. |
Beta Was this translation helpful? Give feedback.
-
Problem: If mountpoint stuck, Miner stuck tooMiner will use a LOTUS_MINER_REPO, normally it's a mountpoint from NFS or CEPH or some other filesystem. But for such mountpoint system, when the filesystem stuck, it'll stuck the whole miner process. It's better to detect such stuck point, let miner to report some error but do not stuck the whole process. |
Beta Was this translation helpful? Give feedback.
-
Problem: For a lot of sector, Miner load sector list very slow when startI think miner should have some cache / compensate mechanism to the sector list to avoid so slow loading process when restart miner. |
Beta Was this translation helpful? Give feedback.
This comment was marked as off-topic.
This comment was marked as off-topic.
-
Just a tip: suggest not make this thread to be a battle between venus and lotus, 😄. |
Beta Was this translation helpful? Give feedback.
This comment was marked as off-topic.
This comment was marked as off-topic.
-
Problem: Lotus has no meaningful security All communication is done via http or tcp in lotus. SPs should at least have the option to specify certificates and keys to use modern versions of TLS. This problem is very concerning due to the use of static JWTs as the primary form of api authZ. I would also like to see the option for mTLS. A good example of how this works is how some k8s distributions do it - they auto generate certs and default to mTLS among peers while giving you the option to specify certs. API security should be improved to allow for third party token issuance and validation (OAuth). Static JWT's that sit on the filesystem are a joke, at best. Ideally, I would be able to write or use a plugin to implement authN/authZ with any third party solution I choose. Example of how to do this would be Hashicorp's go plugin module. https://github.com/hashicorp/go-plugin Tangential note - multiaddrs don't add value, are very annoying to work with, and actually inhibit SPs ability to securely host lotus. I was originally building some proxy servers to both forward and reverse proxy all traffic over TLS internally (to give our lotus instances some meaningful security) but was unable to do it due to the use of multiaddrs. Whereas if lotus just used normal addressing schemes I could have done this all relatively easily. Please stick to modern standards that work with existing solutions. Boiling the ocean / reinventing the wheel only hurts the filecoin community and degrades the quality of the software you provide. As a small team, I urge you leverage existing solutions as much as possible. |
Beta Was this translation helpful? Give feedback.
-
Problem: Remote / Offline Worker Timeout IssuesWhen using Workers in remote datacenter, cloud provider or any remote Sealing as a Service, latency and network timeout / reconnects become very problematic and causes issues with scheduler. Some of this is also related even to local workers which may have gone offline for variety of reasons. Timeouts between SP and workers should be increased and/or configurable, I've briefly looked at the code and it appears to be in many places. This should apply to all types of workers including wdpost workers as SP storage might be in different location than SP itself. A more robust retry & reconnect to the Workers should be implemented and while Worker is in disconnect / reconnecting state the scheduler so do a better job of ignoring it, currently when worker is going into disconnected state the schedule will eventually show it as disabled but doesn't handle it very well and/or is slow to schedule other jobs. Similarly the storage list routine should also do better job of ignoring disabled / disconnected workers and just list it as offline vs trying to connect every time and getting stuck while it's waiting for return. When worker does disconnect / crashes and comes online again, scheduler unnecessarily moves sectors to other workers, it should just retry it in place, no need to move it. |
Beta Was this translation helpful? Give feedback.
-
Hey everyone! 👋 First of all, thank you all SO MUCH for providing feedback in this discussion. Your feedback has been invaluable for defining which direction we should go with what is currently called the We have now published the 🗺 Mega In essence we are moving to an deployment with zero points of failure, and have all components be easily scalable. A discussion thread has also been opened for the project, which can be found here: #10341 |
Beta Was this translation helpful? Give feedback.
-
Hey everyone! 👋
The Lotus-Miner team has been actively looking at ways we can support large-scale/enterprise-level SP deployments lately (both current and aspiring). We think it's time to highlight the major high-level problem areas we have identified, potential solution(s) we have considered for each area, and implications the different routes might have. We'd like to gather feedback from you, to hear your potential concerns about these different solutions, the priority they have for you, and other ideas you might have for these areas. Feedback here will help us determine the system design and implementation prioritization!
I have separated each high-level problem area into its own post so it's easier to discuss a single area in each thread. If you feel a major high-level problem area is missing in this discussion, there is also a separate post named:
Problem: ??
where you can elaborate on which high-level area you feel is missing to support enterprise-level SP deployments.Note: The problems we trying to look at here are architectural/system-level problems that is blocking SPs to scale. That being said, individual bugs are out of scope of this discussion.
We plan to present and discuss these high-level areas in the SP working groups in the coming weeks (exact dates: 28.11.2022 & 05.12.2022). So reading and discussing async before these meetings would be highly beneficial for a constructive discussion and move project proposals forward.
Beta Was this translation helpful? Give feedback.
All reactions