                QNX-style Scheduling v1.21 for Linux 2.0/2.1

                                    by

                                Adam McKee


INTRODUCTION

This patch provides QNX-style scheduling for Linux 2.0.  The intent is to
provide more flexible and powerful scheduling, and to provide improved
interactive performance under heavy CPU load.  This scheduler does not provide
increased throughput however.  In fact, there is a very small price to pay in
terms of throughput in order to achieve the aforementioned goals.  So that you
can appreciate exactly what the patch does, I will give an (oversimplified)
explanation of how scheduling is normally done under Linux 2.0, followed by a
description of the QNX scheduler (as implemented).  I have also included a
user-space utility 'qsched' that allows tuning of scheduling parameters.  I
provide a brief description of this utility and the parameters it allows you
to modify.


NORMAL SCHEDULING UNDER LINUX 2.0

There is a single run-queue.  Each task may be assigned a priority or
"niceness" which will determine how long a timeslice it gets.  For example, if
you give a task a niceness of -20, the kernel will not directly use the -20.
but will instead use this number to determine how long/often the task should
be allowed to run.  The "counter" attribute of each task determines how much
time it has left in its timeslice.  After every task on the run-queue has used
up its timeslice (counter = 0), the counter for each task is reset to the
original priority.  This scheduler has some nice properties:

	o It obeys the KISS principle (Keep It Simple, Stupid!).  There is
          always a danger that trying to get "too clever" will introduce
          unexpected problems.

	o It guarantees that no task will starve.

	o It allows the user to have some control over the scheduling.

Some drawbacks:

	o Interactive performance under heavy CPU load is not good.

	o Limited control over scheduling.


THE QNX SCHEDULER

My understanding of the QNX scheduler is based entirely on a short blurb I
read about it, so I would not be surprised to find that the following
discussion contains errors and/or omissions.  However, please do read
on... :-)

There are 32 separate run-queues, numbered 0-31.  When the scheduler is
looking for a task to run, it will select a task from the lowest-numbered
run-queue that has a runnable task on it.  This means that, for example, a
task on run-queue 1 will *not* run until there are no tasks on run-queue 0
that want to run.  The init task has a minimum run-queue of 15.  Newly
created tasks inherit their minimum run-queue from their parent.

Three scheduling policies are supported:

--- FIFO

The selected task will run until:

	o it blocks
	  -OR-
	o a task on a lower-numbered run-queue wants to run

--- Round-Robin

The selected task will run until:

	o it blocks
	  -OR-
	o a task on a lower-numbered run-queue wants to run
	  -OR
	o 200ms have passed

--- Adaptive

This is the default policy, and the most interesting policy.  Adaptive
scheduling is like Round-Robin, but it also tries to "intelligently" move
tasks between run-queues in order to provide good interactive response.
Here are the rules for adaptive scheduling:

	o Each task has a 'minimum run-queue' attribute that tells the
	  scheduler the lowest-numbered run-queue the task can be on.
	  "Normal" tasks have minimum run-queue = 15.

	o When a task is initially started, it inherits its current and
          minimum run-queue from its parent.  Something I have added: if the
	  parent is fork()'ing a lot (at least once per 15 seconds), its
	  children will start out demoted 1 run-queue (i.e. same min-run-queue
	  as the parent, but current current-run-queue = min-run-queue + 1).

	o If a task blocks (does not use all of its timeslice), it will
	  be placed on its minimum run-queue when it becomes runnable again.
	  I have implemented a slight variation on this: the task will be
	  promoted at most one run-queue for every 100ms that it blocks.  I
	  hope that this makes it effectively pointless for a task to do
	  system calls just to regain its CPU priority.  It should also result
	  in generally fairer scheduling.

	o If a task uses up all of its timeslice, and there is at least one
	  other task on the same run-queue that wants to run, its run-queue
	  will be incremented ("demotion").

	o If a task has been starving for one second, and its current run-queue
	  is greater than its minimum run-queue, its run-queue will be
	  decremented ("promotion").

The result of applying these rules is that tasks with heavy CPU requirements
will tend to migrate to higher-numbered run-queues, whereas tasks with light
CPU requirements will tend to stay on lower-numbered run-queues.  This is
*good* for interactive performance!


USING 'qsched' TO TUNE SCHEDULING PARAMETERS

qsched allows you to change scheduling parameters for a running process, or
start a new process with modified scheduling parameters.  It accepts the
following arguments:

  -p : Scheduling policy. Only root may change the scheduling policy.

  -t : Timeslice length (in milliseconds).  This is the amount of CPU time
       a process may be allowed to use before it will be required to
       relinquish control.  Only root may increase the timeslice length.

  -f : Fork penalty (in seconds).  If the process fork()s an average of at
       least once every x seconds, its children will start out demoted a
       single run-queue.  -1 disables, 0 enables unconditionally.  Only root
       may decrease this parameter, but any user can set it to 0.

  -s : Starvation threshold (in milliseconds).  If a process is runnable but
       has not run for this long, it will be promoted 1 run-queue so that it
       will (temporarily) receive more favorable scheduling.  Only root may
       decrease this parameter.

  -m : Consecutive timeslices before demotion.  If a process has not blocked
       for x timeslices, it will become eligible for demotion.  As described
       above, it will not actually be demoted unless there is at least 1 other
       process on the same run-queue that wants to run.  Only root may
       increase this parameter.

  -l : Minimum run-queue.  This is the "base" run-queue for the process -- it
       will never be promoted beyond this run-queue.  Only root may decrease
       this parameter.

  -u : Maximum run-queue.  No matter how naughty the process is, it will never
       be demoted beyond this run-queue.  Only root may decrease this
       parameter.

  -n : Ignore SIGHUP.  This is intended for situations where a background job
       is being started from the shell.


GENERAL SCHEDULING TIPS

You must take care when changing the minimum run-queue.  Unlike the normal
Linux scheduler, this scheduler does not guarantee tasks will not starve.
When you change the minimum run-queue, you may shoot yourself in the foot in
a couple of different ways:

	o The task hogs the CPU, and starves out other tasks (in the case of
	  a negative niceness).  This can be enough to force you to press the
          Panic Button[tm].

	o The task is starved out by higher-priority tasks (in the case of a
	  positive niceness).

Here are a few general tips for setting scheduling parameters:

	o On a machine whose primary function is web-serving or news-serving,
	  you may want to put the httpd or innd task on a run-queue < 15.
	  Other tasks would then only be allowed to consume "left-over" CPU
	  time.

	o You should be able to achieve better responsiveness by placing tasks
	  like 'gpm' and your X-server on a run-queue < 15.

	o When starting a non-time-critical CPU-intensive job that may take
	  awhile to complete, you may want to put it on a run-queue > 15 to
	  ensure the absolute minimum impact on interactive performance.  When
	  compiling a kernel, you might do 'qsched -l 16 make zlilo' -- users
	  of the system would might not even notice any slowdown!

        o It's almost never a good idea to put a totally CPU-intensive task
          on a run-queue less than 15.  This will most likely result in an
	  unusable system.

In general, don't set scheduling parameters unless you understand how the
scheduler works, and you can really convince yourself that it's a good idea.
Like most tools, 'qsched' can be used for Good or Evil.


NOTE: Don't be surprised if you find that your load average tends to be higher
than it was with the normal Linux scheduler -- this is a perfectly normal
effect.  If it tends to be a *lot* higher, then you may want to reconsider
how you schedule tasks.


CONCLUSION

Please let me know how well this patch works for you.  If you have criticism
and/or ideas to make the patch better and more robust, I would be particularly
interested to hear them.

Happy task-switching :-)

	-- Adam McKee		<amckee@iname.com>
