Judging by the comments in the code and my own experience, the conventional wisdom for holding the lock for the entire bulk operation is to maximize throughput. Releasing and re-acquiring the lock may increase the number of cache misses: in the non-contended case, the lock itself may become evicted, and in the contended case, the entire data structure may be ping-ponged between CPUs. Thus, from the perspective of a given critical section, throughput is maximized by completing the largest possible amount of work per lock acquire.
It turns out, however, that this may not be the best approach in optimizing the execution time of parallel applications. For the set of benchmarks below, there is generally a marginal performance improvement when the zone lock is released after each iteration. My intuition is that this is because smaller allocations and frees spend less time waiting on larger ones, allowing more of these threads to progress faster.
Essentially, then, optimizing for latency at a slight cost to throughput withing the zone allocator decreases execution time for these workloads.
diff -uprN linux-2.6.23.1/mm/page_alloc.c linux-2.6.23.1-opt/mm/page_alloc.c
--- linux-2.6.23.1/mm/page_alloc.c 2007-10-12 11:43:44.000000000 -0500
+++ linux-2.6.23.1-opt/mm/page_alloc.c 2007-10-29 18:29:05.000000000 -0500
@@ -477,19 +477,19 @@ static inline int free_pages_check(struc
static void free_pages_bulk(struct zone *zone, int count,
struct list_head *list, int order)
{
- spin_lock(&zone->lock);
zone->all_unreclaimable = 0;
zone->pages_scanned = 0;
while (count--) {
struct page *page;
+ spin_lock(&zone->lock);
VM_BUG_ON(list_empty(list));
page = list_entry(list->prev, struct page, lru);
/* have to delete it as __free_one_page list manipulates */
list_del(&page->lru);
__free_one_page(page, zone, order);
+ spin_unlock(&zone->lock);
}
- spin_unlock(&zone->lock);
}
static void free_one_page(struct zone *zone, struct page *page, int order)
@@ -665,14 +665,17 @@ static int rmqueue_bulk(struct zone *zon
{
int i;
- spin_lock(&zone->lock);
for (i = 0; i < count; ++i) {
- struct page *page = __rmqueue(zone, order);
+ struct page *page;
+ spin_lock(&zone->lock);
+
+ page = __rmqueue(zone, order);
if (unlikely(page == NULL))
break;
list_add_tail(&page->lru, list);
+ spin_unlock(&zone->lock);
}
- spin_unlock(&zone->lock);
+
return i;
}
It turns out that the optimization helps the performance both with locks and transactions, hence the motivation for submitting it back to the kernel developers.
Simulation parameters: I simulated an 8, 16, and 32 CPU SMP (Pentium 4) with a clock speed of 1GHz and an IPC of 1. Each CPU has a private L1 and L2 cache. Each L1 cache has 256 64-byte lines with an associativity of 4 and a 0 cycle access time. Each L2 cache has 64k 64-byte lines with an associativity of 8 and an access time of 16 cycles. Main memory is 1 GB and has a 200 cycle access latency. The hard drive is modeled with a fixed 5.5 ms latency. I realize that the fixed IPC and disk latency are particularly unsatisfying limitations of the simulator.
These comparisons were made using kernel version 2.6.16.1. To exercise the kernel, I ran the following parallel workloads:
Name | Description |
---|---|
find | 32 instances of the find command, each in a different directory, searching files from the Linux 2.6.16 kernel for a text string that is not found. Each directory is 4.6--5.0MB and contains 333--751 files and 144--254 directories. |
pmake | Runs make -j 2 * number_of_procs to compile 27 source files totaling 6,031 lines of code from the libFLAC 1.1.2 source tree in parallel. |
dpunish | A locally developed micro-benchmark to stress synchronization in VFS directory entry cache. Parallel lookups and renames across multiple, memory-based file systems. |
bonnie++ | Simulates file system bottleneck activity on Squid and INN servers stressing create/stat/unlink. 32 instances of: bonnie++ -d /var/local -n 1 Runs with disk delay disabled. |
MAB | Runs one instance per processor of the Modified Andrew Benchmark, without the compile phase. |
Config | Run several parallel instances of the configure script for a large software package, one for each processor. |
The mean execution time (in seconds), standard deviation, speedup, and 95% confidence intervals (for 4 runs using pseudo-random L2 cache miss perturbation as described by Alameldeen and Wood) are presented below. Data from asterisked experiments (*) are available but too noisy to be helpful.
8 CPU
Metric | find* | pmake | dpunish | bonnie++* | MAB | config | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
unmod | opt | unmod | opt | unmod | opt | unmod | opt | unmod | opt | unmod | opt | |
Mean | 3.291 | 3.351 | 5.773 | 5.793 | 5.934 | 5.916 | 12.898 | 12.833 | ||||
Standard Deviation | .05 | .096 | .044 | .082 | .009 | .014 | .096 | .466 | ||||
95% Confidence Interval | .049 | .094 | .044 | .080 | .009 | .013 | .096 | .466 | ||||
Speedup | .982 | .997 | 1.003 | 1.005 |
16 CPU
Metric | find | pmake | dpunish | bonnie++* | MAB | config | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
unmod | opt | unmod | opt | unmod | opt | unmod | opt | unmod | opt | unmod | opt | |
Mean | 1.647 | 1.704 | 2.606 | 2.518 | 4.436 | 4.450 | 3.499 | 3.546 | 9.526 | 9.361 | ||
Standard Deviation | .213 | .310 | .110 | .058 | .031 | .017 | .044 | .130 | .339 | .154 | ||
95% Confidence Interval | .209 | .304 | .108 | .057 | .030 | .017 | .043 | .127 | .332 | .151 | ||
Speedup | .967 | 1.035 | .997 | .987 | 1.018 |
32 CPU
Metric | find | pmake | dpunish | bonnie++* | MAB | config | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
unmod | opt | unmod | opt | unmod | opt | unmod | opt | unmod | opt | unmod | opt | |
Mean | 1.059 | 1.108 | 2.640 | 2.524 | 3.630 | 3.609 | 3.400 | 3.379 | 8.985 | 8.937 | ||
Standard Deviation | .086 | .102 | .174 | .077 | .046 | .017 | .047 | .020 | .043 | .089 | ||
95% Confidence Interval | .085 | .100 | .171 | .075 | .045 | .017 | .046 | .020 | .042 | .088 | ||
Speedup | .956 | 1.046 | 1.006 | 1.006 | 1.005 |
I used the following benchmarks to exercise the kernel on the real hardware.
Name | Description |
---|---|
MAB | Same as above |
Config | Same as above |
Objdump | For 2.6.23.1 kernel image, execute: objdump -d -l vmlinux | grep mov > /dev/null |
Kernel Compile | Execute 'make -j 16' on the 2.6.23.1 kernel source that I used for these experiments. |
Kernel Compile (Fast) | Execute 'make -j 16' on the 2.6.16.1 kernel source that I used for the simulation experiments. It has many fewer Kconfig options selected. |
hdparm | Execute 'hdparm -t /dev/sda1' to check for adverse effects on the IO processing path. Unlike the others, which report seconds, this reports MB/s. |
lmbench | Ran the default lmbench suite three times for each kernel, listed here (the first three are unmod). |
The performance data are presented below, for 15 samples on each kernel. All measurements are seconds, except for hdparm, which is MB per second.
Metric | MAB | config | objdump | Kernel compile | Kernel compile fast | hdparm | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
unmod | opt | unmod | opt | unmod | opt | unmod | opt | unmod | opt | unmod | opt | |
Mean | 2.916 | 2.872 | 2.350 | 2.337 | 23.234 | 23.057 | 255.870 | 258.027 | 37.764 | 36.030 | 87.543 | 87.377 |
Standard Deviation | .103 | .090 | .159 | .178 | .255 | .393 | 5.861 | 4.903 | 3.397 | 2.985 | .307 | 1.066 |
95% Confidence Interval | .052 | .046 | .080 | .090 | .180 | .199 | 2.966 | 2.481 | 1.719 | 1.511 | .155 | .539 |
Speedup | 1.015* | 1.006 | 1.008* | 0.992* | 1.048* | 1.002 |
The starred speedups are statistically significant by the Wilcoxon signed-rank test
Speaking broadly, the spinlocks in the Linux kernel are already highly tuned, leaving developers on the wrong side of the law of diminishing returns. It should therefore be unsurprising that tuning a given lock yields a small delta in performance.
The starred speedups are statistically significant by the Wilcoxon
signed-rank test The speedups are the speedup over the previous column. In other
words, the speedup of ts is over unmod, the speedup for ts+opt is the
speedup over ts, etc. These data indicate that there is a small performance penalty
(1-2%) incurred when adding ticket spinlocks alone, probably because
they are not used enough in the kernel to reap the performance
benefits of the implementation cost. By allowing rescheduling in just
these two places, the 1-2% overhead of ticket spin performance is
reclaimed in most benchmarks, making overall performance comparable to
the baseline kernel. The only data inconsistent with these results are the kernel
compilation benchmarks. These data indicate that the best performance
is from the baseline kernel. It is not clear to me what property of
kernel compilation causes it to have this performance profile. Placing the zone spin lock in its own cache line hurts
performance. The dd regression test actually shows a similar trend to the other
benchmarks; the ts+opt kernel performs best.Subsequent Data
Feedback from the kernel mailing list led me to collect more data
points to further understand this patch's performance.
Average size and order of allocation/deallocation
Within our simulation environment, I instrumented the bulk allocation
and deallocation routines in the kernel to determine the average
number of pages requested and the order of the allocations requested.
The means and standard deviations are listed below:
Metric
config
MAB
dpunish
pmake
mean
stdev
mean
stdev
mean
stdev
mean
stdev
free_pages_bulk() - count
7.01 0.74
6.82 1.01
7.00 0.14
7.00 0.00
order
0.00 0.05
0.03 0.17
0.00 0.06
0.00 0.00
rmqueue_bulk() - count
7.03 1.70
7.21 2.42
7.07 1.29
6.28 1.77
order
0.21 0.65
0.31 0.87
0.13 0.14
0.57 1.07
Integration with Ticket Spinlocks
I measured the performance effects of this patch when integrated with
the
ticket spinlocks
patch like so:
I used kernel version 2.6.24-rc7 as a baseline (unmod). I compared
this to ticket spinlocks alone (ts), ticket spins + my optimization
(ts+opt), and ticket spins + my optimization + putting the zone lock
in its own cache line (ts+opt+cl).
if (lock_is_contended(lock)) {
spin_unlock(lock);
spin_lock(lock); /* To the back of the queue */
}
Per Andrew Morton's request, I added the dd regression test that
executes this command
The performance data are presented below, for 15 samples on each
kernel (except kernel compiles, which had 5). All measurements are
seconds, except for hdparm, which is MB per second.
time (dd if=/dev/zero of=foo bs=16k count=2048 ; rm foo)
Metric
MAB
config
objdump
Kernel compile
Kernel compile fast
hdparm
dd
unmod
ts
ts+opt
ts+opt+cl
unmod
ts
ts+opt
ts+opt+cl
unmod
ts
ts+opt
ts+opt+cl
unmod
ts
ts+opt
ts+opt+cl
unmod
ts
ts+opt
ts+opt+cl
unmod
ts
ts+opt
ts+opt+cl
unmod
ts
ts+opt
ts+opt+cl
Mean
3.139 3.239
3.190 3.291
2.520 2.562
2.498 2.528
22.744 22.921
22.476 23.363
239.501 239.754
241.821 240.022
37.204 38.795
39.715 38.872
87.496 87.510
87.493 87.513
0.079 0.078
0.078 0.079
Standard Deviation
.174 .175
.188 .261
.202 .256
.205 .230
.447 .583
.500 .512
6.180 2.459
4.489 4.489
2.861 4.416
3.081 5.167
.127 .062
.104 .057
.001 .002
.002 .003
95% Confidence Interval
.088 .088
.095 .132
.102 .130
.104 .116
.226 .295
.253 .259
5.417 2.155
3.934 3.963
2.508 3.871
2.701 4.529
.064 .032
.053 .029
.000 .001
.001 .001
Speedup
0.969*
1.015*
0.969*
0.983*
1.026*
0.988
0.992*
1.020*
0.962*
0.999
0.991
1.007
0.959
0.977
1.022
1.000
1.000
1.000
1.002
1.012*
0.984*