Kernel Interfaces for Goal Oriented Workload Management
The eWLM project[1]
is motivated by IBM's autonomic computing vision[2].
IBM has considerable experience with managing heterogeneous workloads
(e.g. web services, database services, transaction coordination and
management as well as traditional batch) running on homogeneous
clusters of large servers[3].
Resources (CPU, I/O priority, etc.) are dynamically allocated to
important work (e.g. stock trades in excess of some monetary threshold)
that is not meeting goals (sub-second response time) in preference
to work which is meeting goals or is "less important" (e.g. balance
queries for small accounts).
We wish to apply this experience to similar workloads on a distributed
collection of heterogeneous servers.
As a first step we needed to provide instrumentation for measuring
internal kernel activity that is analogous, if not completely equivalent,
to what is available for mainframes.
Data is collected on
applications instrumented using libraries implementing
the Application Response Measurement (ARM) standards[4].
Different types of work have different performance goals, and any data
collected must be attributable to a particular class of work.
On each server, a local agent collects data about application response time,
resource usage and reasons for delay (waiting for CPU, I/O, network, etc.).
This data is collected from the local agents by an eWLM manager which
uses the data to form a view of the topology of the configuration
(i.e. how individual "transactions" flow between servers),
where bottlenecks can arise and what can be done to meet
desired performance goals.
The
ARM 4.0 standard[5] defines APIs to
- register applications
- register transactions
- "classes" of transactions as opposed to individual transactions
which are instances of the transaction class.
- report the start of a transaction
- report the completion of a transaction
- associate/disassociate a transaction with/from an individual thread
- I.e. a pthread in the case of a UNIX-like OS
- report that a transaction is "blocked"
as well as others.
In addition to the implementation of the ARM APIs which associate
transactions/classes of work with kernel entities, interfaces
are required for the local agent to collect "sampling" data
from the kernel.
Information on resources used and delays experienced by individual
transactions and/or classes of transactions and applications is collected
by the local agents from the kernel and forwarded to the eWLM manager.
We describe the data structures and APIs designed for a prototype kernel
implementation of the ARM APIs for two different UNIX kernels, and compare
(qualitatively) the data collected by the UNIX-style instrumentation
with the more mature mainframe workload manager[3].
We restrict our attention mainly to the requirements of local
agents on individual servers, but comment on some issues about
collection and assimilation of data by the eWLM manager.
A key issue is providing adequate data collection granularity
(sub second) without imposing unacceptable overhead
(actual or perceived).
Investigators:
- Matt Thoennes
- Donna Dillenberger
- Josh Knight
All of IBM T.J. Watson Research.
This poster session reports on work done in collaboration with
many others in various product divisions.
The emphasis is on the exploratory work done in the Research Division
as distinguished from the product division design and development work
in support of which the exploratory work was done.
Notes:
-
http://www.research.ibm.com/thinkresearch/pages/2002/20020529_ewlm.shtml
-
http://www.research.ibm.com/autonomic/research/
-
"Adaptive Algorithms for Managing a Distributed Data Processing
Workload," J. Aman, C.K. Eilert, D. Emmes, P. Yocom and D. Dillenberger,
IBM Systems Journal, Vol. 36, No. 2, p. 242, 1997,
http://www.research.ibm.com/journal/sj/362/aman.html
-
Application Response Measurement - ARM, The Open Group,
http://www.opengroup.org/tech/management/arm/
-
ARM 4.0 C Binding - Final Ballot Draft (PDF),
http://www.opengroup.org/tech/management/arm/doc.tpl?CALLER=index.tpl&gdid=3600