Open Questions in Resource Allocation and Scheduling without Costs

24 May 2023

Imagine a shared resource, such as a compute server. Users can submit jobs to the server. The resource is “free”, in the sense that no costs are imposed on the users. The question is how to best assign and prioritize jobs when multiple users submit jobs simultaneously.

I have encountered similar problems in the past, and I still don’t have a truly satisfactory answer. Worse, I don’t even know how to phrase or approach this kind of problem properly.

Here are a few more details on the specific situation I have in mind:

jobs typically run for a finite amount of time: they do not complete “instantaneously”
jobs are not preemptible: while running, they can not be interrupted and must run to completion (or be killed, losing the result)
different users may have very different needs for the service: some use it heavily, others rarely or not at all

It is an explicit requirement to avoid the need for manual prioritization of users or their jobs, mostly to avoid the political debates about which users are more important than others and therefore might deserve priority treatment.

Given these constraints, how should jobs be sequenced?

Two Unsatisfactory Answers

Two extreme possibilities present themselves; unfortunately, both fail to deliver satisfactory behavior if different users place very different demands on the system — something that is explicitly expected in the intended scenario.

A single queue starves light users

One possibility is to place all jobs into a single queue: essentially a FIFO (first-in, first-out) policy. The problem is that with this queue discipline, a heavy user may effectively crowd out lighter users. Imagine that there are only two users. One submits (say) 1000 jobs in any time period, the other user only one job. The long list of jobs by the first user will effectively prevent the second user’s job from running in a timely fashion. That seems hardly “fair”.

One queue per user starves heavy users

The alternative is a configuration where every user is given their own queue, and the system selects jobs from currently occupied queues, either using a round-robin or random-selection scheme. But this situation is prone to starvation problems, too. Imagine that there is one user who submits 10 to 100 jobs, but there are several thousand users who submit a single job, each. In this case, the long queue will essentially not make progress until all the light queues have been drained. Again, this hardly seems “fair”.

What is “Fair”?

Of course, one can come up with some sort of ad-hoc scheme that might work in practice. (Group users into categories based on their usage patterns, then give each such group a dedicated queue. Possibly re-balance work among queues periodically.)

But such an ad-hoc solution simply feels unsatisfactory. Worse, it’s not even clear what criteria one should use to determine whether one solution is better than another one. As discussed earlier, the two proposed solutions can lead to behavior that seems unfair. But what is fair?

Obviously, we try to optimize something. But what? The number of users with pending jobs? The average (or maximum) wait time per job? Per user? The predictability of the time to wait? The fraction of pending jobs per user? Its variation across all users?

Curiously, it doesn’t help to push the question back to the customer. The requirements did not specify a prioritization scheme, but it is also not clear what the requirements should say. What behaviors would be desirable and how can they be measured? All this is quite unclear. (Keep in mind that the system in question is essentially a batch system, hence “responsiveness”, as in a graphical user interface, is not required.)

Still: the two schemes sketched earlier seem intuitively “unfair”. But why?

One might think that Game Theory might have something to offer in answering this question. I have studied the literature (to some extent), but did not seem to find anything applicable.

Blocking Others as Potential Cost

Interestingly, the fact that the system is free to use does not make the problem easier. The absence of costs and a currency to pay them in makes it impossible to create a “secondary market”, where heavy and light users might trade their access rights with each other.

In the absence of an external currency, one may search for a naturally arising, intrinsic “cost” that one user (or their jobs) impart on others. One possible candidate is blocking another job from running. If the machine is uncontested, then its use is indeed free. But if one user’s job is preventing another user’s job from running, then the first user is incurring a “cost”. In subsequent scheduling rounds, users who have incurred such debits are penalized compared to users who have not.

This approach seems promising, although it raises many questions itself. For example, if a light users job is running, then it is now blocking the heavy user’s jobs, leading to the light user being penalized in return. It is also not clear how to balance accounts between naturally heavy and naturally light users in the long run.

How to Discourage Waste

Finally, the existence of a “free” resource invites a problem of its own: it encourages waste. Some users may fill up the server with low-value jobs, even if they don’t expect to use the results. This is fine as it goes, but these low-value jobs do tend to crowd out the (presumptively high-value) jobs of light users.

One response is to make the resource not free, but require some form of payment (“tokens”) from the users. However, this leads to the question how tokens should initially be assigned to users — in particular, if some users have legitimately much higher need for the resource in question than others. One approach that does not work is to assign tokens based on previous usage patterns, because this does, in fact, merely reward and hence encourage wasteful usage patterns!

In Conclusion: Many Questions, No Answers

The questions sketched in this post seem natural and relevant. (And I have encountered comparable problems in the real world.)

Hence it is frustrating to concede that they don’t seem to have satisfactory answers. What is worse is that it is not even clear how to think about these problems properly. As long as we can’t even state what the desired behavior should be, we won’t be able to build a system that exhibits it!