Architectural/System-level problems for enterprise-level Storage Providers #9686

rjan90 · 2022-11-18T17:38:25Z

rjan90
Nov 18, 2022
Maintainer

Hey everyone! 👋

The Lotus-Miner team has been actively looking at ways we can support large-scale/enterprise-level SP deployments lately (both current and aspiring). We think it's time to highlight the major high-level problem areas we have identified, potential solution(s) we have considered for each area, and implications the different routes might have. We'd like to gather feedback from you, to hear your potential concerns about these different solutions, the priority they have for you, and other ideas you might have for these areas. Feedback here will help us determine the system design and implementation prioritization!

I have separated each high-level problem area into its own post so it's easier to discuss a single area in each thread. If you feel a major high-level problem area is missing in this discussion, there is also a separate post named: Problem: ?? where you can elaborate on which high-level area you feel is missing to support enterprise-level SP deployments.
Note: The problems we trying to look at here are architectural/system-level problems that is blocking SPs to scale. That being said, individual bugs are out of scope of this discussion.

We plan to present and discuss these high-level areas in the SP working groups in the coming weeks (exact dates: 28.11.2022 & 05.12.2022). So reading and discussing async before these meetings would be highly beneficial for a constructive discussion and move project proposals forward.

We define large-scale enterprise-level storage provider here as SPs who focus on onboarding data with 10+ PiB raw storage and beyond, or SPs that aspire to have that amount of deal storage capacity.

Taking in storage deals and serving retrievals is a requirement to be considered an enterprise storage provider.

rjan90 · 2022-11-18T17:41:20Z

rjan90
Nov 18, 2022
Maintainer Author

Problem: Sealing tasks scheduling inefficiencies

In both large and smaller SP deployments, one can observe that there are lags in what resources a lotus-worker reports it has available. Lotus-Workers need to be allocated tasks more efficiently. There is a big discrepancy between the theoretical throughput a sealing worker can have and the actual throughput. These inefficiencies compounds for storage providers wanting to scale fast, or that need to have a high deal/sealing throughput (i.e many lotus-workers).

All scheduling is currently done on the Lotus-Miner process. This process does not have perfect information about each lotus-worker state, which means that it has to model each of the Lotus-Workers states. This leads to the scheduling logic being complex and inevitably not having a perfect state of the workers.

The scheduler also has an upper boundary of how many lotus-workers it's built for. There are currently many network roundtrips involved in the scheduling logic for a single process, which leads to a theoretically supported upper boundary of 100 Lotus-Workers. We have seen Storage Providers run with more workers than this, but additional issues observed grow exponentially from this point.

Potential solution 1:
A solution we have explored is removing the current logic entirely, replacing it with a message queue based system like RabbitMQ. In such an approach most of the complex scheduling logic would be gone. When sealing tasks are created on the Lotus-Miner, those requests would be sent to a queue. Lotus-Workers can model their own state, and based on that the Lotus-Worker can decide if they want to accept more work by pulling a tasks from the task queue.

Pros:

Lotus-Worker should utilize their resources more efficiently, enabling SPs to have a higher deal/sealing throughput. The discrepancy between theoretical throughput vs actual throughput would be lower.
Configurability - Storage Providers can let their lotus-workers subscribe to different tasks-queues.

Cons:

It is a big change that will take a lot of development effort.
It requires SPs to learn how to configure the new scheduling system.
Current operational tooling SP might have built and developed would most likely need to be changed.
We face potential unknown downsides / limitations, and new issues by swapping out the whole scheduling logic. The migration path to swapping out the whole scheduling logic is also not totally clear, finding a not too painful migration path can be time consuming.

Potential solution 2:

We have been looking into ways to optimize the current scheduler by removing and re-architecting its current blockers (especially the network round trips done between the Lotus-Miner and Lotus-Woker, and O(m*n)-scheduling). Moving to a scheduler that has shallower queues and better storage awareness of the Lotus-Workers could also improve a lot of the issues observed with not selecting workers appropriately.

Pros:

This approach should not interfere with any current operational tooling built out by SPs.
Shorter development time and little downside risks compared to moving to a completely new architecture.
We could see improvements large enough to satisfy most enterprise storage providers needs.

Cons:

Scheduling will still happen on the Lotus-Miner, so its expected that it will have some imperfect knowledge about lotus-workers available resources.

16 replies

sedasdas Dec 15, 2022

I think the best way is to eliminate miner scheduling, and let workers manage their own jobs by themselves. Users can write appropriate configuration files for each worker according to hardware conditions. I think we can refer to Venus' practice in this regard. I have 300 workers. At present, I have encapsulated 10p computing power with Venus, and the operation is very stable. The current lotus scheduling is based on miner, and lotus still has all kinds of confusion, such as task stuck, The file transfer is abnormal, the task fails, and the scheduling fails. In a word, I hope that the new version of lotus program should be simple enough and easy to expand. Should we abandon the idea of "scheduling"?

kyle-piknik Dec 15, 2022

100% agree with @sedasdas . Turning a simple job queue'ing system into a bin packing problem is a poor approach as you have made the problem harder.

sedasdas Dec 15, 2022

A configuration based system is necessarily stronger, simpler, and easier to expand than a scheduling based system

sedasdas Dec 15, 2022

100%同意@sedasdas. 将一个简单的作业排队系统变成装箱问题是一种糟糕的方法，因为您使问题变得更加困难。

I totally agree. In fact, we don't need the so-called scheduler. We just need the task to be executed as accurately as the pipeline according to our configuration

johansealstorage Jan 13, 2023

Lotus Lily uses Redis with asynq to schedule tasks from the Notifier to workers. Each worker can subscribe to the tasks it is 'setup' for in terms of hardware and resources etc. A similar approach using Redis, NATS, MQTT or any of those could see the benefits that is discussed. It could be implemented with a "swtich" to use lotus scheduling as it stands now, or switch that off and external queue on.
In that scenario both point 1 and 2 could be implemented and SP's can choose their preferred scheduling system.

rjan90 · 2022-11-18T17:44:17Z

rjan90
Nov 18, 2022
Maintainer Author

Problem: Sector Storage Subsystem

Under pressure (or crashes/abrupt shutdowns) the sectors storage subsystem can cause failure of reporting what state a sector is actually in, which makes monitoring and troubleshooting hard. In some cases it can also lead to sectors being stuck in a state, where the only solution is to remove (if in sealing pipeline) or potentially terminating it (if it made it on-chain and unable to prove).

On the Filecoin roadmap there are also improvements like mutable sectors (re-snap), new Proofs-of-Replication and Hierachial Consensus that the sector storage subsystem is not prepared for. When SnapDeals (mutating CC-sector to deal sectors) was introduced, this restriction was circumvented by adding more sector file types and folders - which added more complexity for something that did not necessarily need to exist:

sector storage/
├─ sealed/
│  ├─ s-t0100-1
├─ cache/
│  ├─ s-t0100-1
├─ unsealed/
│  ├─ s-t0100-1
├─ update/ (used for SnapDeals)
│  ├─ s-t0100-1
├─ update-cache/ (used for SnapDeals)
│  ├─ s-t0100-1

We face ending up in a similar complexity trade-off down the line with the current Sector Storage Subsystem.

Re-architecting to hardening the stability and performance of this subsystem would be needed to support large scale SP deployments. The sector index is today backed by non-persistent in-memory maps running inside the main Lotus-Miner process which is not ideal at large deployments.

Potential solution/idea:

One solution we have explored is moving the sector index outside of the Lotus-Miner process into a shared robust database.

Pros:

Allows using the same storage paths with multiple minerIDs
Ability to support/address mutable sectors (re-snapping) and new Proofs-of-Replication in a less restrictive way, then given todays architecture.
Opens up the ability to work on worker-pools that can connect to multiple MinerIDs. As well as using storage paths with multiple MinerIDs.

Cons:

Managing a separate DB adds operational burden, and potentially more hardware costs for storage providers.
Knowledge about a chosen DB is also needed/ideal. There a time costs and wage costs tied to learning new things for enterprises.

11 replies

jennijuju Nov 24, 2022
Maintainer

from the lotus side. it's just an expression of what i would favor to get implemented in lotus

when you refer to lotus, are you referring to lotus team, lotus as a blockchain client implemenation or lotus miner?

it's important for me to distinguish such cuz, like you said

the lotus team is a shared resource

so something can be a mid-prio for lotus team, however, at the same time be the top priority for the lotus-miner (fil-sp-stack) sub team to work on!

implementing fancy storage tech into lotus itself that most people will have no need for will not move us forward for now.

good flag.

f8-ptrk Nov 24, 2022

maybe get constructive here, this is what i would do - i simplify stuff for the sake of bringing the idea across:

split the actual storage operation from lotus itself. invent a module called lotus-storage and lotus-storage-local (name them how you want, i would actually locate them both outside of the lotus repo itself)

lotus-storage provides lotus with a unified interface for storage operations lotus needs to do:

write sectors
read wdpost data
read in post data
whatnot

lotus-local-storage implements lotus-storage to get exactly where we are right now. same functionality.

now we document exactly what a lotus-storage-* modeule needs to provide for lotus-storage to work properly. access times etc. etc. etc.

that should allow stu to implement lotus-storage-obect-tech-xyz and easily integrate it into lotus. he might maybe even open source it (i doubt that they would actually implement it thou - if the excess cost of not having it would actually justify the implementation they'd done it - open sourcing it then is a totally different thing in a competitive env SPs are all in)

i am aware of the golang shortcomings when it comes to runtime plugin based designs. simple, well documented "interface modules" seem to be the easiest option to allow people to "plug in" their own implementations

f8-ptrk Nov 24, 2022

excuse me from using terms wrong here, calling things modules if they aren't etc. my software design phase is based in the early 2000s and wasn't golang focused at all

in short and simple: design the storage ops module in a way that it is easy to replace the standard implementation with something else (same thing i would apply to the scheduler [edit again!] and for the badgerbs stuff)

jennijuju Nov 24, 2022
Maintainer

split the actual storage operation from lotus itself.

This is definitely something I am pushing for. The tradeoff there is splitting the code base itself might take 2 weeks to get it right and 1 full month to make sure its error-prune. However, if doing so do provide more flexibility for community/user we should def prioritize it.

f8-ptrk Nov 24, 2022

making the whole lotus complex less integrated, splitting concepts/interfaces from actual implementations will help accelerate this. i am quite sure the actual storage implementation hasn't been touched since day 3 - but it's deeply integrated in the rest of the code base. that makes it "risky" for people to touch in a 12 week mandatory release cycle.

rjan90 · 2022-11-18T17:44:53Z

rjan90
Nov 18, 2022
Maintainer Author

Problem: Lotus-miner doesn’t support PoSt redundancy

Proving your storage is a critical component for storage providers, and failure to post WindowPoSt poses great earnings loss and costs. Today you can connect multiple windowPoSt workers to a single Lotus-Miner instance, which allows storage providers to have many partitions in a single deadline, enabling one to scale up to large raw storage capacities. This is still only partially useful at an enterprise scale since there is no redundancy in windowPoSt. There is also no redundancy in winningPoSt.

Potential solution for windowPoSt redundancy:

Real redundancy (duplicate work) for windowPoSt is hard to accomplish due to the nature of how windowPoSt is computed. Since windowPoSt requires reading random parts of each sector in a partition, doing 2x or more reads for every sector in a partition can lead to slower read speeds, which in turn can cause windowPoSt failures. Sending duplicate windowPoSt message to the chain is also not ideal since this wastes both FIL spent on gas-fees and chain space.

What we can do is adding redundancy in the windowPoSt scheduling logic (watching the chain and asking workers to make a snark after the challenges has been read) so that failures in creating the snark or sending the message to chain will be retried.

Potential solution for winningPoSt redundancy:

Adding redundancy (duplicate work) for winningPoSt is easier to accomplish, since it is only a Proof-of-Spacetime on a single random sector, and therefor does not pose any storage reading threats/bottlenecks. Adding this mostly requires new slash-prevention logic before broadcasting the block.

4 replies

Fatman13 Nov 21, 2022

Another problem could be solved by Damocles (venus-cluster) though custom plugin.

rjan90 Nov 23, 2022
Maintainer Author

though custom plugin

PoSts are critical components to the daily operation of a storage provider, and having such features natively in lotus-miner/PoSt-workers seems ideal though, especially when it comes to having automated test coverage of such a solution.

f8-ptrk Nov 24, 2022

if we have the lotus-miner redundant we could allow of 2+ lotus-miner processes to calculate the same wdPost (via their workers or in the lotus-miner worker itself) and then coordinate across all nodes these miners are using to send out only the first one

in addition to doing that on a worker level too. a system wide *post threshold might be defined to coordinate everything, "do max 3 wdPost for the same partition/deadline" and then "the scheduler" for this handles that somehow.

kyle-piknik Nov 24, 2022

I would like to see a system where the knowledge of what post must be done in the current / next proving windows is decoupled from the actual proof generation. I would like to have the option of scaling out proof generation.

rjan90 · 2022-11-18T17:46:03Z

rjan90
Nov 18, 2022
Maintainer Author

Problem: Managing multiple MinerIDs

Enterprise Storage Providers might take on funding from multiple investors or different lending platforms. A common way of managing funds from these different entities (and also managing risk), is to separate funds/ownership between different MinerIDs. The storage providers usually have a fixed amount of sealing workers and storage servers which they then spread across these MinerIDs. Managing all these workers and changing workers to different MinerIDs is both time consuming and not very flexible. The same issue also arises for storage providers running test networks, as the MinerID for the test network would need a dedicated sealing-pipeline and storage.

Potential solution:
We have explored the potential solution to extract that sealing manager component into a separate process. The sealing manager component is currently a part of the lotus-miner process.

Decoupling these components will allow one to have a pool of storage or sealing-workers for multiple MinerIDs. Including the ability to have different sectors sizes and connecting to multiple networks with a single pool of storage/workers.

Being able to work on this potential problem relies on changes being made in the Sector Storage Subsystem which is another problem: Sector Storage Subsystem

6 replies

jennijuju Nov 24, 2022
Maintainer

@rjan90 am I right this will also enable folks to use the same stack for multiple networks, i.e: IPC subnets?

s0nik42 Nov 24, 2022

Is this not solved by sealing as a service features currently under development ? If someone has multiple miners and a large sealing cluster, he should be able to leverage the SaaS features or spinning an on-premise private SaaS service.

kyle-piknik Nov 24, 2022

I agree with @s0nik42 this problem is better solved by decoupling sealing from the miner.

rjan90 Nov 25, 2022
Maintainer Author

Decoupling the sealing manager component into a separate process is what the potential solution here is aiming to do. I have added some additional text in the proposed solution to make that clearer.

Is this not solved by sealing as a service features currently under development ?

The Sealing-as-a-Service features are just a set of API endpoints that external providers can leverage to build sealing-as-a-service businesses. SaaS providers would be able to assign, seal and send sectors to multiple minerIDs in Lotus-Miner today. But for storage providers that do not want to leverage SaaS due to cost tradeoffs, or wants to have all or some sealing-workers in-house - those storage providers would not be able to manage multiple minerIDs/networks with a single pool of sealing-workers. Having that ability natively for easier management in enterprise deployments could therefore be ideal.

If someone has multiple miners and a large sealing cluster, he should be able to leverage the SaaS features or spinning an on-premise private SaaS service.

This work is going to do 90% of the work needed to be done for storage providers that wants to spin up their own SaaS-service with the sealing-manager.

rjan90 Nov 25, 2022
Maintainer Author

am I right this will also enable folks to use the same stack for multiple networks, i.e: IPC subnets?

Most of the groundwork/understading for supporting multiple networks (also Hierachial Consensus / IPC subnets) would be uncovered during a potential re-architecturing of the Sector Storage Subsystem. Such a re-architecture would need to take into account upcoming protocol changes to make it more future proof and have a design draft for such changes.

There are potentially work-items to enable SPs to leverage a pool of sealing-workers in subnets as well, but enabling IPC subnets with the same Lotus-Miner stack is not directly tied to this.

rjan90 · 2022-11-18T17:46:49Z

rjan90
Nov 18, 2022
Maintainer Author

Problem: Monitoring Systems

Enterprise scale storage providers needs good monitoring systems to observe the health and condition of their systems, this is especially true when managing multiple MinerIDs. There are currently some open source monitoring systems that can be leveraged/or built upon by enterprise SPs for monitoring, but they might not be suitable for enterprise requirements. Many SPs already has such montioring systems developed in-house, but new entereprising SPs face the issue of having no easy way to deploy montoring systems.

Potential solution:
The Lotus-Miner has actually a lot of prometheus endpoints, but the documentation for these are lacking. A solution here is to improve the documentation for these endpoints, as well as adding more endpoints,and develop a Grafana dashboard that is suitable for enterprise storage providers. Adding better crash analytics is also that something that would be ideal to develop with such solution.

16 replies

kyle-piknik Dec 1, 2022

I believe @f8-ptrk is right. Metrics belong to the process that generates them. Lotus should have one set of metrics, boost would have its own.

I do not favor push metrics over pull or vice-versa. Both have use cases and tradeoffs. I don't see it as particularly hard to offer the option of both.

Push metrics will couple lotus to a specific metric implementation more tightly than pull metrics. Using a sidecar or plugin model to do the pushing, or expose the metrics to be pulled, strikes me as a better solution.

ianconsolata Dec 5, 2022

This is an area I have been thinking on a lot (and chatting with @panksea06 from fil-infra about), and I think we would gain a lot by considering support for an error and exception tracking system like Sentry. Storage Providers could use their own Sentry account an API keys, and get access to critical error reporting and tools to monitor the health of their system, and the rates of errors between versions, without having to do fully implement their own stack: https://sentry.io/vs/logging/

f8-ptrk Dec 11, 2022

none of this should ever:

depend on a platform that is only available in the cloud
depend on a platform/feature that is only available to paying customers

just saying.

smagdali Dec 15, 2022

@f8-ptrk Correct. I agree that whatever support gets built shouldn't be specific to a single platform, and you should be able to do it locally. But I think making it easy to enable people to hook to their platform of choice would be immensely valuable.

magik6k Jan 4, 2023
Maintainer

Modern monitoring is done by calling an endpoint to scrape values over http [..] Off the top of my head I know we want to know things like

Some of those are in the existing prometheus endpoint - http://[127.0.0.1/api addr]:[1234/2345]/debug/metrics

For wallet monitoring there was a lotus-monitor project started a while ago - #6382, it's an utility where you specify a list of wallets, and it would export their stats to prometheus. Possibly worth finally landing that.

rjan90 · 2022-11-18T17:47:04Z

rjan90
Nov 18, 2022
Maintainer Author

Problem: ??

Any high-level problem areas of the Lotus-Miner you feel have not been covered by the above threads? Then please elaborate in-depth here:

3 replies

stuberman Nov 24, 2022

For many enterprises (our potential customers and data owners), an annual security audit is required. There are many different types, but some of the more common audits are SOC2 Type II or ISAE 3402 as well as cybersecurity frame works such as ISO 27001 and NIST 800-53 compliance. Lotus (and the FIlecoin blockchain) should be security reviewed and tested annually. Results should be made available for SPs to show the software we rely on is secure and audited.

The last posted security report is now more than two years old.

willscott Mar 2, 2023
Collaborator

Problem: support low latency data access.

currently, unsealed copies, when stored, are stored with FR32 padding which means reads can't make use of DMA / hardware offload but must go through a CPU and take an additional memory allocation + memory copies to handle re-alignment.

to support serving retrievals, with the expectation that there are more reads than writes, it will be great to:

expand the remote storage api to allow serving unsealed pieces without padding
allow the on-disk unsealed piece to be stored without padding, so that it can efficiently be accessed over the unpadded api, but rather padded if/when needed for subsequent storage interactions.

magik6k Mar 13, 2023
Maintainer

I don't think this padding is anywhere close to being relevant bottleneck today.

The unpadding can do many GB/s in benchmarks, iirc up to ~20 in multicore benchmarks
Moving to storing unpadded data should eventually happen, but it will be a lot of effort, and will not solve current bottlenecks.
If unpadding becomes a bottleneck, that code / code-path can relatively easily be optimized to be, likely, multiple times faster, for a small fraction of the effort it will take to store data unpadded.

reads can't make use of DMA / hardware offload

That's true, but also not a huge issue. Most modern server platforms should be able to do 40Gbps transfers easily without DMA. The new platforms with Gen5 PCIe appear to do work fine >100Gbps, without jumbo frames..

We can start worrying about DMA reads when we see evidence that it's a relevant bottleneck.

kikakkz · 2022-11-21T01:55:44Z

kikakkz
Nov 21, 2022

Problem: RPC Connection to Lotus should be Reliable

When miner's loading is high, sometimes RPC connection to lotus will be stuck. Or, when the link between miner and lotus is not so stable sometime, miner and lotus connection will be lost directly. At that time, for a retry connection, miner can resume after the connection is closed. But sometimes the connection will be stuck for long time, then miner will also be stuck. For the sealing process, if the lotus and miner connection lost, the only way is to restart miner. A lot of sectors' state are lost, and never be scheduled again.

8 replies

kikakkz Nov 24, 2022

HA lotus cluster can help that (i think)~

marshyonline Nov 24, 2022

This is the exact issue destroying out cluster at the moment - as soon as RAFT docs are out I'll try and see if that fixes this issue.

kyle-piknik Nov 24, 2022

Don't think JSON-RPC is a huge problem in terms of protocol choice (it's just Http) but there are definitely issues with respect to retries / timeouts and bad states arising from disconnects or process crashes.

marco-storswift Jan 4, 2023

we got a lot of rpc message like this even updated go-jsonrpc-0.1.9 version
2023-01-04T14:43:30.439+0800 WARN rpc [email protected]/client.go:368 rpc output message buffer {"n": 3310}
2023-01-04T14:43:30.572+0800 WARN rpc [email protected]/client.go:368 rpc output message buffer {"n": 3311}
2023-01-04T14:43:31.806+0800 WARN rpc [email protected]/client.go:368 rpc output message buffer {"n": 3310}

marshyonline Jan 4, 2023

This is the exact issue destroying out cluster at the moment - as soon as RAFT docs are out I'll try and see if that fixes this issue.

It did not fix the issue

kikakkz · 2022-11-21T01:59:34Z

kikakkz
Nov 21, 2022

Problem: Miner itself should be a HA cluster

Currently we have only one miner to generate proof (may be more than 1 post worker, but post worker actually is slave, if miner down, post worker done). I think miner should be a HA cluster, master / slave or not, then when master down, at least one slave could become to master to let the work go ahead.

0 replies

kikakkz · 2022-11-21T02:02:03Z

kikakkz
Nov 21, 2022

Problem: If mountpoint stuck, Miner stuck too

Miner will use a LOTUS_MINER_REPO, normally it's a mountpoint from NFS or CEPH or some other filesystem. But for such mountpoint system, when the filesystem stuck, it'll stuck the whole miner process. It's better to detect such stuck point, let miner to report some error but do not stuck the whole process.

6 replies

scotthconner Nov 30, 2022

Probably worth being able to provide a health check before entering IO blocking code that causes hangs? Just speaking out loud here. Should folks not be using NFS as mount points? AFAIK (which isn't much), the FileStore interface allows for this sort of abstraction?

rjan90 Dec 1, 2022
Maintainer Author

Just speaking out loud here. Should folks not be using NFS as mount points? AFAIK (which isn't much), the FileStore interface allows for this sort of abstraction?

Historically a lot smaller to medium scale users of Lotus-Miner have opted for NFS. One of the main benefits of NFS is that its easy setup, and gives an easy way to expand. Just add an additional storage server, connect the NFS-share to Lotus, initialise the path and off you go. And it´s also quite flexible in storage provider setups, as the NFS-shares can be mounted across the system.

But its flexibility and ease of use comes with a cost, and that is its performance. NFS really struggles with allocating resources, especially when different I/O-operations happen simultaneously, which they often do in Lotus/Filecoin. Historically we have seen numerous issues in Lotus-Miner which indicate that NFS is causing the issue.

NFS struggles especially when writing (Finalize) sectors to the long-term storage, and reading sectors for proving the storage (WindowPoSt) happen at the same time. This often leads to windowPoSt-failures.
NFS is brittle. In a storage provider setup based on NFS - mounts can often “freeze up” - if a lotus-worker abruplty disconnects for example. And these “frozen” NFS mounts can be hard to detect, and often ends up with the sectors not being accessible when windowPoSt starts.
Write speeds are poor compared other solutions - low upper limit on the sealing throughput to long-term storage per day.

f8-ptrk Dec 1, 2022

thats why we have storage workers, don't we? read only should allow for write separation. IO can be handled on the process level.

but storage lotus-worker's are not there yet, the last few meters are missing (read/write separation in a clean lotus native interface for example)

rjan90 Dec 1, 2022
Maintainer Author

Yeah, "storage workers" can potentially help mitigate a lot of these issues. But as you noted @f8-ptrk they are not there yet, mostly because they where not developed to be a connector to the long-term storage in the first place.

For the histroical context, when windowPoSt & winningPoSt-workers were introduced they opened up new configuration possibilities. There was no longer a need to have the storage paths available on the lotus-miner daemon.

While not very common, some SPs started testing and using the lotus-worker with all sealing tasks set to disabled as the main read and write point to the long-term storage. And when the no-default option came to lotus-worker, some started referring to this configuration as ”storage-only lotus-worker”.

But these “hacky storage workers” have a lot of downsides to them currently:

The weights and limits set in the sectorstore.json for the long term storage paths are not obeyed when using these “hacky storage workers” to write to the long term storage paths.
The current assigner types are not optimised for the storage tasks:
- The utilization assigner: This scheduler assigns tasks based on compute utilization. Since GET tasks require very little compute, all such tasks get assigned to the first available worker.
- The Spread assigner: This scheduler tries to assign to multiple workers, but only does so if a worker is already utilized. Since GET-tasks are quite fast this assigner often ends up assigning all tasks to either one or two “hacky storage workers”.
The builtin windowPoSt operation on the lotus-miner is not compatible with the ”hacky storage workers”. As the builtin windowPoSt operation in lotus-miner can’t read vanilla proofs from remote storage.
At a certain operational scale, "storage workers" might also not make sense, and there might be better solutions.

kikakkz Dec 1, 2022

i think it may be not only for NFS, but for all filesystem which must be mounted to the miner. we use with NFS, ceph, glusterfs, and when network of the cluster is slight unstable, we will be stucked.
for storage worker, it's actually not easy to maintain if you have a large amount of hard disks.
and we finally use private s3 storage cluster(with http multi-range support). it's not easy to setup but we can have a timeout for http request then we won't be stuck in a single read syscall. i don't mean let everybody to use s3 storage, but i think at least we can let miner process still alive even we have a single read stuck. we can report it (with some log), then miner operator go to fix the stuck machine or hard disk, then recover smoothly. most of the time, the stuck will be caused by only one hard disk or one storage node, but it'll block all proving process.

kikakkz · 2022-11-21T02:04:02Z

kikakkz
Nov 21, 2022

Problem: For a lot of sector, Miner load sector list very slow when start

I think miner should have some cache / compensate mechanism to the sector list to avoid so slow loading process when restart miner.

2 replies

rjan90 Nov 23, 2022
Maintainer Author

Hey @kikakkz! Thanks for the input. The above problem should be covered in the broader Problem: Sector Storage Subsystem, where a potential solution we have explored is to move the sector index to a shared robust database.

Having the sector list load fast (also during restarts) can be a success metric for such an solution.

kikakkz Nov 24, 2022

yeah, a robust database could be an candidate. and even important is the synchronization of the sector index. that means: some sector is going to proving, but still in memory and not in database; sometimes the miner halt down, for a high speed sealing miner, it'll lost some sector index. and compensate mechanism will help to save such fail case.

jennijuju · 2022-11-24T05:15:49Z

jennijuju Nov 24, 2022
Maintainer

points taken. Tho my 2 attofils here is:

when we first developed lotus, it was mainly built to demonstrate filecoin could work. thus lotus includes everything needed to demonstrate that - lotus, lotus client, lotus-miner and all, and this might not be the best choice from a software development point of view. We had a tradeoff to make back then, and it was the choice that's made so that we could ship Filecoin. However, we are now revisiting that choice (we also announced that we are deprecating lotus-market in favor of boost for that effort), so that lotus can just focus on being a client of Filecion.
now when we talk about "client/client implementation" for a blockchain, I think this defines the concept pretty well. That being said, i don't consider lotus-miner nor most of the venus-* a client of Filecoin protocol.
- to be more precise: IMHO venus-* is more of a software stack for Filecoin miners (SP), instead of a unique client of Filecoin protocol. And IPFSForce team, the maintainer of venus-*, develops a decent service stack and series of programs for miners/SPs, including node service(chain cluster/HA service), mining pool service (including venus market, which distributes deals), support service and so on.
- post go-filecoin world, especially for recent upgrades, i believe venus as a client is essentially identical to lotus (we can tell that by the amount of PR that's ported). Frankly I'm not sure how much value venus brings to filecoin as a client implementation. For example, one of the core reasons we wanna maintain go-filecoin even lotus exists was so that - if a bug exists in one of the implementations that cause nodes stop syncing the chain/proceed the newtork, the other implementation may carry the network forward. currently, I'm not convinced that's what gonna happen with lotus & venus - on the other hand, forest may.
- To be clear, I do believe the stuff mentioned in the first sub-point is valuable to the filecoin ecosystem. And it were me, i'd suggest IPFSForce team to dedicate more resources to focus on those services!
I definitely agree that each client implementation shall have their strong suite to serve different filecoin needs, and our team has been working with Forest team constantly to find what's missing in lotus that forest would be better at fulfilling the gap.

jennijuju · 2022-11-24T05:45:19Z

jennijuju Nov 24, 2022
Maintainer

I think a Lotus SP can be easily migrated to be a Venus SP

To be frank, from my understanding (chats with SPs) there is also a trust/commitment concern on how long venus-* will be maintained.

kikakkz · 2022-11-24T06:02:03Z

kikakkz Nov 24, 2022

I think a Lotus SP can be easily migrated to be a Venus SP

To be frank, from my understanding (chats with SPs) there is also a trust/commitment concern on how long venus-* will be maintained.

hahahahaha~~ i think i won't use venus for ~ maybe within 3 years~ one is the cost of migration, another is that i need to learn another tech-stack almost from scratch to do almost the same thing.

Fatman13 · 2022-11-24T06:15:33Z

Fatman13 Nov 24, 2022

Thank you for the reply!

Tho my 2 attofils here is

🤣 I see what you did here.

I think this defines the concept pretty well. That being said, i don't consider lotus-miner nor most of the venus-* a client of Filecoin protocol.

Great point! Didn't know client was used to referring to mining software, wallet and block explorers. Let me amend that in the context of differentiation in my original point, I was referring to client for SPs and SCs.

From CMC's definition you linked...

Cryptocurrency clients are employed in numerous use cases. One of these is for miners, software that manages the operation of its user’s mining equipment and communicates the calculated hashes to the blockchain network.

Frankly I'm not sure how much value venus brings to filecoin as a client implementation.

Again, IMHO, in the context of being a client software for SPs and SCs, the differentiation comes down to Venus' distributed architecture, unique sealing architecture and multi SP support, which are all great for large scale enterprise adoption.

For example, one of the core reasons we wanna maintain go-filecoin even lotus exists was so that - if a bug exists in one of the implementations that cause nodes stop syncing the chain/proceed the newtork, the other implementation may carry the network forward. currently, I'm not convinced that's what gonna happen with lotus & venus - on the other hand, forest may.

There is no way for the network to recover from a consensus bug if one implementation dominates 90% of the network. This is why we have to port many of the consensus change as there is no right or wrong in consensus. The majority is the consensus.

jennijuju · 2022-11-24T06:37:40Z

jennijuju Nov 24, 2022
Maintainer

(gonna hide this thread so it doesn't distract folks), happy to discuss more here

kikakkz · 2022-11-24T06:23:42Z

kikakkz
Nov 24, 2022

Just a tip: suggest not make this thread to be a battle between venus and lotus, 😄.

2 replies

jennijuju Nov 24, 2022
Maintainer

Hahaha no battle, we share the same goal to make filecoin better - how to allocate our limited team resources more efficiently so that we can bring more improvements faster to the network is what’s being discussed here imho!
Again I do like the things Venus team did a lot! (But also wanna clear some misconceptions)

but exactly what you said, let’s focus on lotus-miner user needs in this discussion 💙

Fatman13 Nov 24, 2022

but exactly what you said, let’s focus on lotus-miner user needs in this discussion 💙

Agreed! Sorry to sidetrack the conversation too much.

Was looking at many of the problems presented here in this thread and couldn't help to break the news that Venus has solved many of the enterprise concerns. Would just like to let people know there is alternative out there for enterprise if they are adventurous enough to give it a tryout.

kyle-piknik · 2022-11-24T21:07:41Z

kyle-piknik
Nov 24, 2022

Problem: Lotus has no meaningful security

All communication is done via http or tcp in lotus. SPs should at least have the option to specify certificates and keys to use modern versions of TLS. This problem is very concerning due to the use of static JWTs as the primary form of api authZ. I would also like to see the option for mTLS. A good example of how this works is how some k8s distributions do it - they auto generate certs and default to mTLS among peers while giving you the option to specify certs.

API security should be improved to allow for third party token issuance and validation (OAuth). Static JWT's that sit on the filesystem are a joke, at best. Ideally, I would be able to write or use a plugin to implement authN/authZ with any third party solution I choose. Example of how to do this would be Hashicorp's go plugin module. https://github.com/hashicorp/go-plugin

Tangential note - multiaddrs don't add value, are very annoying to work with, and actually inhibit SPs ability to securely host lotus. I was originally building some proxy servers to both forward and reverse proxy all traffic over TLS internally (to give our lotus instances some meaningful security) but was unable to do it due to the use of multiaddrs. Whereas if lotus just used normal addressing schemes I could have done this all relatively easily. Please stick to modern standards that work with existing solutions. Boiling the ocean / reinventing the wheel only hurts the filecoin community and degrades the quality of the software you provide. As a small team, I urge you leverage existing solutions as much as possible.

9 replies

f8-ptrk Dec 15, 2022

is nic offloading avail. in linux by now?

f8-ptrk Dec 15, 2022

but yeah - if you want TLS, why not :D just make it optional

kyle-piknik Dec 15, 2022

My concern is largely with the security of the APIs which are returning metadata and other sensitive information that isn't large. I don't see any of this performance discussion as particularly relevant to that as the gain is minimal. File transfer / bulk data transfer should generally follow a slightly different path and will obviously have to be optimized differently.

As for nic offload. I believe it's available but its device specific. How well it works... 🤷‍♂️ probably depends on the device and driver support.

kyle-piknik Dec 15, 2022

https://www.kernel.org/doc/html/latest/networking/tls-offload.html

beck-8 Feb 27, 2023

This problem is still very far away, and it is more important to solve the current urgent problems.

SealStorage-Jacques · 2023-01-16T14:43:40Z

SealStorage-Jacques
Jan 16, 2023

Problem: Remote / Offline Worker Timeout Issues

When using Workers in remote datacenter, cloud provider or any remote Sealing as a Service, latency and network timeout / reconnects become very problematic and causes issues with scheduler. Some of this is also related even to local workers which may have gone offline for variety of reasons.

Timeouts between SP and workers should be increased and/or configurable, I've briefly looked at the code and it appears to be in many places. This should apply to all types of workers including wdpost workers as SP storage might be in different location than SP itself.

A more robust retry & reconnect to the Workers should be implemented and while Worker is in disconnect / reconnecting state the scheduler so do a better job of ignoring it, currently when worker is going into disconnected state the schedule will eventually show it as disabled but doesn't handle it very well and/or is slow to schedule other jobs. Similarly the storage list routine should also do better job of ignoring disabled / disconnected workers and just list it as offline vs trying to connect every time and getting stuck while it's waiting for return.

When worker does disconnect / crashes and comes online again, scheduler unnecessarily moves sectors to other workers, it should just retry it in place, no need to move it.

1 reply

TippyFlitsUK Jan 16, 2023
Maintainer

Many thanks Jacques!!

rjan90 · 2023-02-27T10:47:28Z

rjan90
Feb 27, 2023
Maintainer Author

Hey everyone! 👋

First of all, thank you all SO MUCH for providing feedback in this discussion. Your feedback has been invaluable for defining which direction we should go with what is currently called the lotus-miner.

We have now published the 🗺 Mega Lotus-Miner project issue, which provides insights for what we are building towards.

In essence we are moving to an deployment with zero points of failure, and have all components be easily scalable. A discussion thread has also been opened for the project, which can be found here: #10341

0 replies

This comment was marked as off-topic.

Sign in to view

This comment was marked as off-topic.

Sign in to view

This comment was marked as off-topic.

Sign in to view

Architectural/System-level problems for enterprise-level Storage Providers #9686

rjan90 Nov 18, 2022 Maintainer

Replies: 16 comments · 92 replies

rjan90 Nov 18, 2022 Maintainer Author

Problem: Sealing tasks scheduling inefficiencies

rjan90 Nov 18, 2022 Maintainer Author

Problem: Sector Storage Subsystem

jennijuju Nov 24, 2022 Maintainer

jennijuju Nov 24, 2022 Maintainer

rjan90 Nov 18, 2022 Maintainer Author

Problem: Lotus-miner doesn’t support PoSt redundancy

rjan90 Nov 23, 2022 Maintainer Author

rjan90 Nov 18, 2022 Maintainer Author

Problem: Managing multiple MinerIDs

jennijuju Nov 24, 2022 Maintainer

rjan90 Nov 25, 2022 Maintainer Author

rjan90 Nov 25, 2022 Maintainer Author

rjan90 Nov 18, 2022 Maintainer Author

Problem: Monitoring Systems

magik6k Jan 4, 2023 Maintainer

rjan90 Nov 18, 2022 Maintainer Author

Problem: ??

willscott Mar 2, 2023 Collaborator

magik6k Mar 13, 2023 Maintainer

Problem: RPC Connection to Lotus should be Reliable

Problem: Miner itself should be a HA cluster

Problem: If mountpoint stuck, Miner stuck too

rjan90 Dec 1, 2022 Maintainer Author

rjan90 Dec 1, 2022 Maintainer Author

Problem: For a lot of sector, Miner load sector list very slow when start

rjan90
Nov 18, 2022
Maintainer

Replies: 16 comments 92 replies

rjan90
Nov 18, 2022
Maintainer Author

rjan90
Nov 18, 2022
Maintainer Author

jennijuju Nov 24, 2022
Maintainer

jennijuju Nov 24, 2022
Maintainer

rjan90
Nov 18, 2022
Maintainer Author

rjan90 Nov 23, 2022
Maintainer Author

rjan90
Nov 18, 2022
Maintainer Author

jennijuju Nov 24, 2022
Maintainer

rjan90 Nov 25, 2022
Maintainer Author

rjan90 Nov 25, 2022
Maintainer Author

rjan90
Nov 18, 2022
Maintainer Author

magik6k Jan 4, 2023
Maintainer

rjan90
Nov 18, 2022
Maintainer Author

willscott Mar 2, 2023
Collaborator

magik6k Mar 13, 2023
Maintainer

rjan90 Dec 1, 2022
Maintainer Author

rjan90 Dec 1, 2022
Maintainer Author