This is so interesting. You’re using auction theory in this way is something that I would’ve never thought of but is ingenious. And we know it scales because other giant problems such ads something similar.
So you wanted to perform bin-packing, but treat the machines' resources (vcores, ram, etc.) as the "capacity" of a container? When you were talking about how it's not ideal to try and fit as many applications as possible on a single worker, did "fit" refer to the size of the fly machine images on disk, or the expected resource utilization of an application? (Unsure how you could even predict resource utilization without relying on probalistic methods anyway, so I'm assuming it was the former)
Very intresting use case. Isnt this basically the same as the inner engine of GitLab / GitHub / ADO when they run pipelines with a pull instead of a push. Its exactly how runners / agents listen to broadcasted events from the main "API" service.
I know very little about orchestrators and how they work, but I do know a bit about quadratic programming (QP) and optimal resource utilization? It seems like most of these orchistrators use some form of heuristic for scheduling, but is that actually the best approach? Would orchastrators not be betters served using QP where you add the resources and constraints and then schedule the work continuously? That way you can constrain the resources used, yet still optimize the system as a whole?
(Disclaimer: I work for fly) It's an interesting idea but in practice seems like QP has some suboptimal tradeoffs. With QP you are trying to find some vector x that minimizes a quadratic function f(x). In terms of a scheduling problem we can imagine we are trying to minimize some set of quadratic equations like the total resource imbalance across workers. There are a few problems I can imagine running into here. One is that solving that equation for a large scale system can get computationally expensive whereas most schedulers value fast decisions. Additionally in a high throughput/turnover system technically suboptimal heuristic based decisions smooth out over time. Also you run into some issues where you have to quantify and solve for all constraints you might want to express in a system which gets thorny. That said, for a system with large batch scheduling systems could maybe benefit from some could of QP minimization solver. But for our use case (and the majority of scheduling use cases) Annie explained the general theory which informs modern scheduling in most places.
If I'm understanding this right, it's not really a decentralized orchestrator, just decentralized state? When a job comes in, the orchestrator polls (asks for bids from) the workers, but then the orchestrator still decides which worker gets the job. Also still not clear how that resolves bin packing. Do the workers try to maintain some percentage of available resources and not take on more jobs? If so, the same can be accomplished from a centralized orchestrator, and if not you'll end up bin packed anyway unless you intentionally overprovision the number of workers in someway. There must be more going in flyd to get the binpacking resolution that is not explained here
Awesome video Annie! 🤩You are killing it!
This is so interesting. You’re using auction theory in this way is something that I would’ve never thought of but is ingenious. And we know it scales because other giant problems such ads something similar.
This is a really cool approach! The bidding system is really clever.
The quality of your videos puts most of the "big operators" to shame; excellent presentation!!
Nice speed run through the problem space! 🏁
I love how you guys are innovating in cloud engineering
Shouldn't a decentralized orchestrator be called a choreographer?
What
This was simply beautiful 👏
Who orchestrates the orchestators?
thanks for watching? thanks for explaning :)
So you wanted to perform bin-packing, but treat the machines' resources (vcores, ram, etc.) as the "capacity" of a container? When you were talking about how it's not ideal to try and fit as many applications as possible on a single worker, did "fit" refer to the size of the fly machine images on disk, or the expected resource utilization of an application? (Unsure how you could even predict resource utilization without relying on probalistic methods anyway, so I'm assuming it was the former)
Very intresting use case. Isnt this basically the same as the inner engine of GitLab / GitHub / ADO when they run pipelines with a pull instead of a push. Its exactly how runners / agents listen to broadcasted events from the main "API" service.
Make a video on how you implemented SERF to ensure HA
I know very little about orchestrators and how they work, but I do know a bit about quadratic programming (QP) and optimal resource utilization? It seems like most of these orchistrators use some form of heuristic for scheduling, but is that actually the best approach? Would orchastrators not be betters served using QP where you add the resources and constraints and then schedule the work continuously? That way you can constrain the resources used, yet still optimize the system as a whole?
(Disclaimer: I work for fly) It's an interesting idea but in practice seems like QP has some suboptimal tradeoffs. With QP you are trying to find some vector x that minimizes a quadratic function f(x). In terms of a scheduling problem we can imagine we are trying to minimize some set of quadratic equations like the total resource imbalance across workers. There are a few problems I can imagine running into here. One is that solving that equation for a large scale system can get computationally expensive whereas most schedulers value fast decisions. Additionally in a high throughput/turnover system technically suboptimal heuristic based decisions smooth out over time. Also you run into some issues where you have to quantify and solve for all constraints you might want to express in a system which gets thorny. That said, for a system with large batch scheduling systems could maybe benefit from some could of QP minimization solver. But for our use case (and the majority of scheduling use cases) Annie explained the general theory which informs modern scheduling in most places.
it isn't distributed though? All the workers still talk to a central scheduler.
If I'm understanding this right, it's not really a decentralized orchestrator, just decentralized state? When a job comes in, the orchestrator polls (asks for bids from) the workers, but then the orchestrator still decides which worker gets the job.
Also still not clear how that resolves bin packing. Do the workers try to maintain some percentage of available resources and not take on more jobs? If so, the same can be accomplished from a centralized orchestrator, and if not you'll end up bin packed anyway unless you intentionally overprovision the number of workers in someway.
There must be more going in flyd to get the binpacking resolution that is not explained here
By having workers bid on jobs, each worker can have a different utilisation strategy, opaque to the rest of the system.
Kubernetes can run any workload, containers, VMS and whatever comes next 🙈
Did you just reinvent vmturbo? Auction style VMware DRS tech from early 2010s
It's still centralized tho? 😅 The central API server makes all the final decisions.
You should call it "mildly centralized"
nice - out of interest have you open-sourced flyd?
#Miami mentioned!!! 🎉🎉🎉
this was cool
So flyctl keep track of empty workers to kill or in loaded region to create more workers?
Because "reasons" 👻
More of a Choreographer than Orchestrator! Brilliant!
😂😂😂😂 I did not know if to cry or to laugh. U know that real system engineers will not believe you?
"Real system engineer" here. I believe Annie, not you.