Új FreeBSD processz scheduler kiadás

Címkék

Luigi Rizzo implementált egy új, súly-alapú processz ütemezőt (scheduler) a FreeBSD-stable -hez. A scheduler portolása a FreeBSD-current -hez jelenleg is folyamatban van.A "Proportional Share scheduler" elérhető az alábbi linken:

http://info.iet.unipi.it/~luigi/ps_sched.20020719a.diff

A patch a legújabb -STABLE rendszerekehez készült, valószínűleg működig a 4.4-nél újabb rendszereken. Összesen 3 file került módosításra: kern_synch.c, kern_switch.c és a proc.h, plusz egy one-line változás a kern_exit.c-ben.)

Rizzo letesztelte a patchet egy diskless gépen, és a teszt szerint túlélt egy X sessiont, Netscapet, xterm-et, stb. Ennek ellenére NEM ajánlott roduktív rendszereken. Egyelőre tesztelési fázisban van.

Bejelentés:

**************************************************

Date: Thu, 18 Jul 2002 16:31:57 -0700

From: Luigi Rizzo

To: arch@FreeBSD.ORG

Subject: NEW SCHEDULER available (was Re: proposed changes to kern_switch.c and kern_synch.c)

Hi,

as promised, a first version of the Proportional Share scheduler

that we developed is available at

http://info.iet.unipi.it/~luigi/ps_sched.20020719a.diff

These are for a recent -STABLE (i think any version from 4.4 should

work; the only 3 files modified are kern_synch.c, kern_switch.c and

proc.h, plus a one-line change to kern_exit.c).

I have tested it a little bit on a diskless system, and it seems

to survive running a full X session with the usual set of xterm,

netscape etc. while i do a "renice" of the processes and even switch

back and forth between schedulers. But do not trust this yet for a

production system!

The sysctl variable kern.scheduler controls the scheduler in use.

kern.scheduler=1 (default) selects Proportional Share

kern.scheduler=0 selects the Feedback Priority ("classic BSD")

You can switch between the two schedulers by changing the value of

the sysctl variable.

To change the "weight" associated to a process, use the "nice" and

"renice" commands. As usual, positive values (+1..+20) mean the

process will get a smaller share of the CPU, negative values (-1..-20)

will get a larger share. The actual weights (which control the

relative fraction of CPU cycles given to each process) are assigned

through a lookup table which maps the value to the range

1 ... 100 ... 1000 (min, normal, max weight).

The "old" scheduler (Feedback Priority) should be as robust and

stable as always, whereas there are still a few things to cleanup

in the Proportional Share scheduler, namely:

* I have not looked in detail at the SMP case, so do not run

it on an SMP kernel;

* given that there are no priorities, processes woken up by a

tsleep() are not guaranteed to run before all processes

with p_priority >= PUSER; however, they are given a shorter

deadline so they are likely to run first.

* RTPRI and IDLEPRI processes do not have yet any special treatment

(they all end up together with normal processes; this is trivial

to fix, i just haven't had time to look at that).

Have fun, and please if you use it let me know of any bugs and

possibly suggestions to adapt it to -current.

cheers

luigi

On Tue, Jul 16, 2002 at 11:52:16PM -0700, Luigi Rizzo wrote:

> Hi,

> we have implemented a weight-based process scheduler

> for FreeBSD-stable, and are trying to port it to -current (in the

> description below replace "process" with "thread/kse" as appropriate

> for the -current case).

>

> In order to make this work, it is convenient to have all

> scheduler-specific functions and data structures in a

> single file (kern_switch*.c), and generic support in

> another one (kern_synch.c). I believe this was also the original

> BSD design in partitioning the code between the two files.

> However, in our current code, there are some functions which

> are scheduler-specific (see below) which are misplaced.

>

> So I would like to make the following, purely cosmetic, change

> identified as step #1 below, both for consistency with what i

> believe to be the original design, and to simplify further work

> in this area. These would go to both -current and -stable; I

> repeat, they are only cosmetic changes.

>

> Comments ? I already spoke to julian who has no objections.

>

> For those interested, a patch for -current is attached, and the one

> for -stable is very similar (for the records, in -stable we have a

> fully functional weight-based process scheduler, where you still

> control the weights using "nice", and you can switch between the

> old and the new scheduler at any time using a sysctl variable).

>

> --- Re. the multi-scheduler architecture ---

>

> The general idea is to make the process/thread/kse scheduler

> a replaceable piece of the kernel, requiring no modifications

> to the "struct proc", and with the ability of switching from

> one scheduler to another one at runtime (this both for testing

> purposes and for whatever need may arise).

>

>

> The way to achieve this would be the following:

>

> 1. identify all scheduler-specific functions, put all of them

> in one file (e.g. kern_switch.c for the default scheduler),

> and define function pointers for all of them;

>

> 2. use one field in "struct proc" as a link to whatever

> scheduler-specific data structures are necessary.

> In -stable, we have the p_pad3 field that can be used

> for this purpose, much like the if_index field in struct ifnet.

>

> 3. implement a function which, under control of a sysctl call,

> activate a new scheduler (by initialising all the function

> pointers to appropriate values) and moves all processes from

> the old to the new one.

>

> Step #1 is basically a cosmetic change, requiring mostly a move of

> some functions from kern_synch.c to kern_switch.c. These are

>

> roundrobin();

>

> curpriority_cmp();

>

> resetpriority();

>

> parts of schedcpu() and schedclock();

>

> cheers

> luigi

> ----------------------------------

>

> Index: kern_switch.c

> ==================================================

=================

> RCS file: /home/ncvs/src/sys/kern/kern_switch.c,v

> retrieving revision 1.33

> diff -u -r1.33 kern_switch.c

> --- kern_switch.c14 Jul 2002 03:43:33 -00001.33

> +++ kern_switch.c16 Jul 2002 22:15:06 -0000

> @@ -97,6 +97,7 @@

> #include

> #include

> #include

> +#include

> #include

>

> CTASSERT((RQB_BPW * RQB_LEN) == RQ_NQS);

> @@ -107,6 +108,9 @@

> static struct runq runq;

> SYSINIT(runq, SI_SUB_RUN_QUEUE, SI_ORDER_FIRST, runq_init, &runq)

>

> +static struct callout roundrobin_callout;

> +

> +static voidroundrobin(void *arg);

> static void runq_readjust(struct runq *rq, struct kse *ke);

> / **************************************************

**********************

> * Functions that manipulate runnability from a thread perspective.*

> @@ -442,9 +446,13 @@

> {

> int i;

>

> +callout_init(&roundrobin_callout, 0);

> +

> bzero(rq, sizeof *rq);

> for (i = 0; i TAILQ_INIT(&rq->rq_queues[i]);

> +

> +roundrobin(NULL);

> }

>

> /*

> @@ -719,3 +727,207 @@

> }

> #endif

>

> +/*

> + * -- formerly in kern_synch.c

> + */

> +

> +int curpriority_cmp(struct thread *td);

> +

> +int

> +curpriority_cmp(struct thread *td)

> +{

> +return (td->td_priority - curthread->td_priority);

> +}

> +

> +/*

> + * Force switch among equal priority processes every 100ms.

> + * We don't actually need to force a context switch of the current process.

> + * The act of firing the event triggers a context switch to softclock() and

> + * then switching back out again which is equivalent to a preemption, thus

> + * no further work is needed on the local CPU.

> + */

> +/* ARGSUSED */

> +static void

> +roundrobin(arg)

> +void *arg;

> +{

> +

> +#ifdef SMP

> +mtx_lock_spin(&sched_lock);

> +forward_roundrobin();

> +mtx_unlock_spin(&sched_lock);

> +#endif

> +

> +callout_reset(&roundrobin_callout, sched_quantum, roundrobin, NULL);

> +}

> +

> +/* calculations for digital decay to forget 90% of usage in 5*loadav sec */

> +#define loadfactor(loadav) (2 * (loadav))

> +#define decay_cpu(loadfac, cpu) (((loadfac) * (cpu)) / ((loadfac) + FSCALE))

> +extern fixpt_t ccpu;

> +#define CCPU_SHIFT 11/* XXX dup from kern_synch.c! */

> +

> +/*

> + * Recompute process priorities, every hz ticks.

> + * MP-safe, called without the Giant mutex.

> + */

> +void schedcpu1(struct ksegrp *kg);

> +/* ARGSUSED */

> +void

> +schedcpu1(struct ksegrp *kg)

> +{

> +register fixpt_t loadfac = loadfactor(averunnable.ldavg[0]);

> +struct thread *td;

> +struct kse *ke;

> +int realstathz;

> +int awake;

> +

> + realstathz = stathz ? stathz : hz;

> +

> +awake = 0;

> +FOREACH_KSE_IN_GROUP(kg, ke) {

> +/*

> + * Increment time in/out of memory and sleep

> + * time (if sleeping). We ignore overflow;

> + * with 16-bit int's (remember them?)

> + * overflow takes 45 days.

> + */

> +/* XXXKSE **WRONG***/

> +/*

> + * the kse slptimes are not touched in wakeup

> + * because the thread may not HAVE a KSE

> + */

> +if ((ke->ke_state == KES_ONRUNQ) ||

> + ((ke->ke_state == KES_THREAD) &&

> + (ke->ke_thread->td_state == TDS_RUNNING))) {

> +ke->ke_slptime++;

> +} else {

> +ke->ke_slptime = 0;

> +awake = 1;

> +}

> +

> +/*

> + * pctcpu is only for ps?

> + * Do it per kse.. and add them up at the end?

> + * XXXKSE

> + */

> +ke->ke_pctcpu = (ke->ke_pctcpu * ccpu) >> FSHIFT;

> +/*

> + * If the kse has been idle the entire second,

> + * stop recalculating its priority until

> + * it wakes up.

> + */

> +if (ke->ke_slptime > 1) {

> +continue;

> +}

> +

> +#if(FSHIFT >= CCPU_SHIFT)

> +ke->ke_pctcpu += (realstathz == 100) ?

> + ((fixpt_t) ke->ke_cpticks) + (FSHIFT - CCPU_SHIFT) :

> + 100 * (((fixpt_t) ke->ke_cpticks) + (FSHIFT - CCPU_SHIFT)) / realstathz;

> +#else

> +ke->ke_pctcpu += ((FSCALE - ccpu) *

> + (ke->ke_cpticks * FSCALE / realstathz)) >>

> + FSHIFT;

> +#endif

> +ke->ke_cpticks = 0;

> +} /* end of kse loop */

> +if (awake == 0) {

> +kg->kg_slptime++;

> +} else {

> +kg->kg_slptime = 0;

> +}

> +kg->kg_estcpu = decay_cpu(loadfac, kg->kg_estcpu);

> +resetpriority(kg);

> +FOREACH_THREAD_IN_GROUP(kg, td) {

> +int changedqueue;

> +if (td->td_priority >= PUSER) {

> +/*

> + * Only change the priority

> + * of threads that are still at their

> + * user priority.

> + * XXXKSE This is problematic

> + * as we may need to re-order

> + * the threads on the KSEG list.

> + */

> +changedqueue =

> + ((td->td_priority / RQ_PPQ) !=

> + (kg->kg_user_pri / RQ_PPQ));

> +

> +td->td_priority = kg->kg_user_pri;

> +if (changedqueue &&

> + td->td_state == TDS_RUNQ) {

> +/* this could be optimised */

> +remrunqueue(td);

> +td->td_priority =

> + kg->kg_user_pri;

> +setrunqueue(td);

> +} else {

> +td->td_priority = kg->kg_user_pri;

> +}

> +}

> +}

> +}

> +

> +/*

> + * Compute the priority of a process when running in user mode.

> + * Arrange to reschedule if the resulting priority is better

> + * than that of the current process.

> + */

> +void

> +resetpriority(kg)

> +register struct ksegrp *kg;

> +{

> +register unsigned int newpriority;

> +struct thread *td;

> +

> +mtx_lock_spin(&sched_lock);

> +if (kg->kg_pri_class == PRI_TIMESHARE) {

> +newpriority = PUSER + kg->kg_estcpu / INVERSE_ESTCPU_WEIGHT +

> + NICE_WEIGHT * (kg->kg_nice - PRIO_MIN);

> +newpriority = min(max(newpriority, PRI_MIN_TIMESHARE),

> + PRI_MAX_TIMESHARE);

> +kg->kg_user_pri = newpriority;

> +}

> +FOREACH_THREAD_IN_GROUP(kg, td) {

> +maybe_resched(td);/* XXXKSE silly */

> +}

> +mtx_unlock_spin(&sched_lock);

> +}

> +

> +#if 0

> +/*

> + * We adjust the priority of the current process. The priority of

> + * a process gets worse as it accumulates CPU time. The cpu usage

> + * estimator (p_estcpu) is increased here. resetpriority() will

> + * compute a different priority each time p_estcpu increases by

> + * INVERSE_ESTCPU_WEIGHT

> + * (until MAXPRI is reached). The cpu usage estimator ramps up

> + * quite quickly when the process is running (linearly), and decays

> + * away exponentially, at a rate which is proportionally slower when

> + * the system is busy. The basic principle is that the system will

> + * 90% forget that the process used a lot of CPU time in 5 * loadav

> + * seconds. This causes the system to favor processes which haven't

> + * run much recently, and to round-robin among other processes.

> + */

> +void

> +schedclock(td)

> +struct thread *td;

> +{

> +struct kse *ke;

> +struct ksegrp *kg;

> +

> +KASSERT((td != NULL), ("schedlock: null thread pointer"));

> +ke = td->td_kse;

> +kg = td->td_ksegrp;

> +ke->ke_cpticks++;

> +kg->kg_estcpu = ESTCPULIM(kg->kg_estcpu + 1);

> +if ((kg->kg_estcpu % INVERSE_ESTCPU_WEIGHT) == 0) {

> +resetpriority(kg);

> +if (td->td_priority >= PUSER)

> +td->td_priority = kg->kg_user_pri;

> +}

> +}

> +#endif

> Index: kern_synch.c

> ==================================================

=================

> RCS file: /home/ncvs/src/sys/kern/kern_synch.c,v

> retrieving revision 1.185

> diff -u -r1.185 kern_synch.c

> --- kern_synch.c14 Jul 2002 03:43:33 -00001.185

> +++ kern_synch.c16 Jul 2002 22:16:46 -0000

> @@ -67,6 +67,8 @@

>

> #include

>

> +voidschedcpu1(struct ksegrp *kg); /* XXX in kern_switch.c */

> +

> static void sched_setup(void *dummy);

> SYSINIT(sched_setup, SI_SUB_KICK_SCHEDULER, SI_ORDER_FIRST, sched_setup, NULL)

>

> @@ -76,7 +78,6 @@

>

> static struct callout loadav_callout;

> static struct callout schedcpu_callout;

> -static struct callout roundrobin_callout;

>

> struct loadavg averunnable =

> { {0, 0, 0}, FSCALE };/* load average, of runnable procs */

> @@ -92,7 +93,6 @@

>

> static voidendtsleep(void *);

> static voidloadav(void *arg);

> -static voidroundrobin(void *arg);

> static voidschedcpu(void *arg);

>

> static int

> @@ -135,28 +135,6 @@

> }

>

> /*

> - * Force switch among equal priority processes every 100ms.

> - * We don't actually need to force a context switch of the current process.

> - * The act of firing the event triggers a context switch to softclock() and

> - * then switching back out again which is equivalent to a preemption, thus

> - * no further work is needed on the local CPU.

> - */

> -/* ARGSUSED */

> -static void

> -roundrobin(arg)

> -void *arg;

> -{

> -

> -#ifdef SMP

> -mtx_lock_spin(&sched_lock);

> -forward_roundrobin();

> -mtx_unlock_spin(&sched_lock);

> -#endif

> -

> -callout_reset(&roundrobin_callout, sched_quantum, roundrobin, NULL);

> -}

> -

> -/*

> * Constants for digital decay and forget:

> *90% of (p_estcpu) usage in 5 * loadav time

> *95% of (p_pctcpu) usage in 60 seconds (load insensitive)

> @@ -225,7 +203,7 @@

> #definedecay_cpu(loadfac, cpu)(((loadfac) * (cpu)) / ((loadfac) + FSCALE))

>

> /* decay 95% of `p_pctcpu' in 60 seconds; see CCPU_SHIFT before changing */

> -static fixpt_tccpu = 0.95122942450071400909 * FSCALE;/* exp(-1/20) */

> +fixpt_tccpu = 0.95122942450071400909 * FSCALE;/* exp(-1/20) */

> SYSCTL_INT(_kern, OID_AUTO, ccpu, CTLFLAG_RD, &ccpu, 0, "");

>

> /* kernel uses `FSCALE', userland (SHOULD) use kern.fscale */

> @@ -255,105 +233,15 @@

> schedcpu(arg)

> void *arg;

> {

> -register fixpt_t loadfac = loadfactor(averunnable.ldavg[0]);

> -struct thread *td;

> struct proc *p;

> -struct kse *ke;

> struct ksegrp *kg;

> -int realstathz;

> -int awake;

>

> -realstathz = stathz ? stathz : hz;

> sx_slock(&allproc_lock);

> FOREACH_PROC_IN_SYSTEM(p) {

> mtx_lock_spin(&sched_lock);

> p->p_swtime++;

> FOREACH_KSEGRP_IN_PROC(p, kg) {

> -awake = 0;

> -FOREACH_KSE_IN_GROUP(kg, ke) {

> -/*

> - * Increment time in/out of memory and sleep

> - * time (if sleeping). We ignore overflow;

> - * with 16-bit int's (remember them?)

> - * overflow takes 45 days.

> - */

> -/* XXXKSE **WRONG***/

> -/*

> - * the kse slptimes are not touched in wakeup

> - * because the thread may not HAVE a KSE

> - */

> -if ((ke->ke_state == KES_ONRUNQ) ||

> - ((ke->ke_state == KES_THREAD) &&

> - (ke->ke_thread->td_state == TDS_RUNNING))) {

> -ke->ke_slptime++;

> -} else {

> -ke->ke_slptime = 0;

> -awake = 1;

> -}

> -

> -/*

> - * pctcpu is only for ps?

> - * Do it per kse.. and add them up at the end?

> - * XXXKSE

> - */

> -ke->ke_pctcpu = (ke->ke_pctcpu * ccpu) >> FSHIFT;

> -/*

> - * If the kse has been idle the entire second,

> - * stop recalculating its priority until

> - * it wakes up.

> - */

> -if (ke->ke_slptime > 1) {

> -continue;

> -}

> -

> -#if(FSHIFT >= CCPU_SHIFT)

> -ke->ke_pctcpu += (realstathz == 100) ?

> - ((fixpt_t) ke->ke_cpticks) - (FSHIFT - CCPU_SHIFT) :

> - 100 * (((fixpt_t) ke->ke_cpticks) - (FSHIFT - CCPU_SHIFT)) / realstathz;

> -#else

> -ke->ke_pctcpu += ((FSCALE - ccpu) *

> - (ke->ke_cpticks * FSCALE / realstathz)) >>

> - FSHIFT;

> -#endif

> -ke->ke_cpticks = 0;

> -} /* end of kse loop */

> -if (awake == 0) {

> -kg->kg_slptime++;

> -} else {

> -kg->kg_slptime = 0;

> -}

> -kg->kg_estcpu = decay_cpu(loadfac, kg->kg_estcpu);

> - resetpriority(kg);

> -FOREACH_THREAD_IN_GROUP(kg, td) {

> -int changedqueue;

> -if (td->td_priority >= PUSER) {

> -/*

> - * Only change the priority

> - * of threads that are still at their

> - * user priority.

> - * XXXKSE This is problematic

> - * as we may need to re-order

> - * the threads on the KSEG list.

> - */

> -changedqueue =

> - ((td->td_priority / RQ_PPQ) !=

> - (kg->kg_user_pri / RQ_PPQ));

> -

> -td->td_priority = kg->kg_user_pri;

> -if (changedqueue &&

> - td->td_state == TDS_RUNQ) {

> -/* this could be optimised */

> -remrunqueue(td);

> -td->td_priority =

> - kg->kg_user_pri;

> -setrunqueue(td);

> -} else {

> -td->td_priority = kg->kg_user_pri;

> -}

> -}

> -}

> +schedcpu1(kg);

> } /* end of ksegrp loop */

> mtx_unlock_spin(&sched_lock);

> } /* end of process loop */

> @@ -944,32 +832,6 @@

> }

>

> /*

> - * Compute the priority of a process when running in user mode.

> - * Arrange to reschedule if the resulting priority is better

> - * than that of the current process.

> - */

> -void

> -resetpriority(kg)

> -register struct ksegrp *kg;

> -{

> -register unsigned int newpriority;

> -struct thread *td;

> -

> -mtx_lock_spin(&sched_lock);

> -if (kg->kg_pri_class == PRI_TIMESHARE) {

> -newpriority = PUSER + kg->kg_estcpu / INVERSE_ESTCPU_WEIGHT +

> - NICE_WEIGHT * (kg->kg_nice - PRIO_MIN);

> -newpriority = min(max(newpriority, PRI_MIN_TIMESHARE),

> - PRI_MAX_TIMESHARE);

> -kg->kg_user_pri = newpriority;

> -}

> -FOREACH_THREAD_IN_GROUP(kg, td) {

> -maybe_resched(td);/* XXXKSE silly */

> -}

> -mtx_unlock_spin(&sched_lock);

> -}

> -

> -/*

> * Compute a tenex style load average of a quantity on

> * 1, 5 and 15 minute intervals.

> * XXXKSE Needs complete rewrite when correct info is available.

> @@ -1022,11 +884,9 @@

> {

>

> callout_init(&schedcpu_callout, 1);

> -callout_init(&roundrobin_callout, 0);

> callout_init(&loadav_callout, 0);

>

> /* Kick off timeout driven events by calling first time. */

> -roundrobin(NULL);

> schedcpu(NULL);

> loadav(NULL);

> }

>

> ----- End forwarded message -----

>

> To Unsubscribe: send mail to majordomo@FreeBSD.org

> with "unsubscribe freebsd-arch" in the body of the message

To Unsubscribe: send mail to majordomo@FreeBSD.org

with "unsubscribe freebsd-arch" in the body of the message