Molnár Ingo: Hyper-Threading-et tudó scheduler

Linux

Szerdán írtam az Intel által kifejlesztett Hyper-Threading (HT) technológiáról. A technológia lényege, hogy egy fizikai CPU képes azt "hazudni", hogy ő valójában több (többnyire 2), és képes ezen az egy fizikai processzoron a több szálon futó alkalmazásokat párhuzamosan futtatni. Így akár 40%-os teljesítmény-növekedést is kaphatunk egyes alkalmazásoknál. Molnár Ingó, az O(1) scheduler készítője készített egy patchet, amely lehetővé teszi, hogy az O(1) ütemező is kihasználhassa a HT technológiát.

"A Hyper-Threading technológia egy érdekes koncepció, amely teljes támogatást érdemel szerintem." - írja Ingó.

Néhány szám a tesztelések során:

a floppy.c fordítása végtelen ciklusban, 2.55 mp-et vesz igénybe egy ciklus. Elindítva párhuzamosan két egyforma ciklust 2 fizikai, 2 logikai CPU-n (összesen 4 logikai CPU) egy P4 HT box az alábbi eredményeket adta:

2.5.31-BK-curr: - fluctuates between 2.60 secs and 4.6 seconds.

BK-curr + sched-F3: - stable 2.60 sec results.

A tesztek folytak kernelfordítással is:

kernelfordítás "make -j2"-vel:

2.5.31-BK-curr: 45.3 sec

BK-curr + sched-F3: 41.3 sec

Ez kb. 10%-os növekedés. Ez talán betudható az első patchnek, de a növekedés mindenképpen a 10-40% között reális valahol.

Ingó teljes levele:From: Ingo Molnar

To: linux-kernel

Subject: [patch] "fully HT-aware scheduler" support, 2.5.31-BK-curr

Date: Tue, 27 Aug 2002 03:44:23 +0200 (CEST)

symmetric multithreading (hyperthreading) is an interesting new concept that IMO deserves full scheduler support. Physical CPUs can have multiple (typically 2) logical CPUs embedded, and can run multiple tasks 'in parallel' by utilizing fast hardware-based context-switching between the two register sets upon things like cache-misses or special instructions. To the OSs the logical CPUs are almost undistinguishable from physical

CPUs. In fact the current scheduler treats each logical CPU as a separate physical CPU - which works but does not maximize multiprocessing performance on SMT/HT boxes.

The following properties have to be provided by a scheduler that wants to be 'fully HT-aware':

* HT-aware passive load-balancing: the irq-driven balancing has to be per-physical-CPU, not per-logical-CPU.

Otherwise it might happen that one physical CPU runs 2 tasks, while another physical CPU runs no threads. The stock scheduler does not recognize this condition as 'imbalance' - to the scheduler it appears as if the first two CPUs had 1-1 task running, the second two CPUs had 0-0 tasks running. The stock scheduler does not realize that the two logical CPUs belong to the same physical CPU.

* 'active' load-balancing when a logical CPU goes idle and thus causes a physical CPU imbalance.

This is a mechanism that simply does not exist in the stock 1:1 scheduler - the imbalance caused by an idle CPU can be solved via the normal load-balancer. In the HT case the situation is special because the source physical CPU might have just two tasks running, both runnable - this is a situation that the stock load-balancer is unable to handle - running tasks are hard to be migrated away. But it's essential to do this - otherwise a physical CPU can get stuck running 2 tasks, while another physical CPU stays idle.

* HT-aware task pickup.

When the scheduler picks a new task, it should prefer all tasks that share the same physical CPU - before trying to pull in tasks from other CPUs. The stock scheduler only picked tasks that were scheduled to that particular logical CPU.

* HT-aware affinity.

Tasks should attempt to 'stick' to physical CPUs, not logical CPUs.

* HT-aware wakeup.

again this is something completely new - the stock scheduler only knows about the 'current' CPU, it does not know about any sibling [== logical CPUs on the same physical CPU] logical CPUs. On HT, if a thread is woken up on a logical CPU that is already executing a task, and if a sibling CPU is idle, then the sibling CPU has to be woken up and has to execute the newly woken up task immediately.

the attached patch (against 2.5.31-BK-curr) implements all the above HT-scheduling needs by introducing the concept of a shared runqueue: multiple CPUs can share the same runqueue. A shared, per-physical-CPU

runqueue magically fulfills all the above HT-scheduling needs. Obviously this complicates scheduling and load-balancing somewhat (see the patch for details), so great care has been taken to not impact the non-HT schedulers (SMP, UP). In fact the SMP scheduler is a compile-time special case of the HT scheduler. (and the UP scheduler is a compile-time special case of the SMP scheduler)

the patch is based on Jun Nakajima's prototyping work - the lowlevel x86/Intel bits are still those from Jun, the sched.c bits are newly implemented and generalized.

There's a single flexible interface for lowlevel boot code to set up physical CPUs: sched_map_runqueue(cpu1, cpu2) maps cpu2 into cpu1's runqueue. The patch also implements the lowlevel bits for P4 HT boxes for the 2/package case.

(NUMA systems which have tightly coupled CPUs with a smaller cache and protected by a large L3 cache might benefit from sharing the runqueue as well - but the target for this concept is SMT.)

some numbers:

compiling a standalone floppy.c in an infinite loop takes 2.55 seconds per iteration. Starting up two such loops in parallel, on a 2-physical, 2-logical (total of 4 logical CPUs) P4 HT box gives the following numbers:

2.5.31-BK-curr: - fluctuates between 2.60 secs and 4.6 seconds.

BK-curr + sched-F3: - stable 2.60 sec results.

the results under the stock scheduler depends on pure luck: which CPUs get

the tasks scheduled. In the HT-aware case each task gets scheduled on a separate physical CPU, all the time.

compiling the kernel source via "make -j2" [under-utilizes CPUs]:

2.5.31-BK-curr: 45.3 sec

BK-curr + sched-F3: 41.3 sec

ie. a ~10% improvement. The tests were the best results picked from lots of (>10) runs. The no-HT numbers fluctuate much more (again the randomness effect), so the average compilation time in the no-HT case is higher.

saturated compilation "make -j5" results are roughly equivalent, as expected - the one-runqueue-per-CPU concept works adequately when the number of tasks is larger than the number of logical CPUs. The stock scheduler works well on HT boxes in the boundary conditions: when there's

1 task running, and when there's more nr_cpus tasks running.

the patch also unifies some of the other code and removes a few more #ifdef CONFIG_SMP branches from the scheduler proper.

(the patch compiles/boots/works just fine on UP and SMP as well, on the P4 box and on another PIII SMP box as well.)

Testreports, comments, suggestions welcome,

Ingo