Page 1

A Dollar from 15 Cents:

Cross-Platform Management for Internet Services

Christopher Stewart

†

Terence Kelly

∗

Alex Zhang

∗

Kai Shen

†

University of Rochester

∗

Hewlett-Packard Laboratories

Abstract

As Internet services become ubiquitous, the selection

and management of diverse server platforms now af-

fects the bottom line of almost every firm in every in-

dustry. Ideally, such cross-platform management would

yield high performance at low cost, but in practice, the

performance consequences of such decisions are often

hard to predict. In this paper, we present an approach

to guide cross-platform management for real-world In-

ternet services. Our approach is driven by a novel per-

formance model that predicts application-level perfor-

mance across changes in platform parameters, such as

processor cache sizes, processor speeds, etc., and can be

calibrated with data commonly available in today’s pro-

duction environments. Our model is structured as a com-

position of several empirically observed, parsimonious

sub-models. These sub-models have few free parameters

and can be calibrated with lightweight passive observa-

tions on a current production platform. We demonstrate

the usefulness of our cross-platform model in two man-

agement problems. First, our model provides accurate

performance predictions when selecting the next gener-

ation of processors to enter a server farm. Second, our

model can guide platform-aware load balancing across

heterogeneous server farms.

1 Introduction

In recent years, Internet services have become an in-

dispensable component of customer-facing websites and

enterprise applications. Their increased popularity has

prompted a surge in the size and heterogeneity of the

server clusters that support them. Nowadays, the man-

agement of heterogeneous server platforms affects the

bottom line of almost every firm in every industry. For

example, purchasing the right server makes and models

can improve application-level performance while reduc-

ing cluster-wide power consumption. Such management

decisions often span many server platforms that, in prac-

tice, cannot be tested exhaustively. Consequently, cross-

platform management for Internet services has histori-

cally been ad-hoc and unprincipled.

Recent research [9,14,31,37,40,41,44]has shown that

performance models can aid the management of Internet

services by predicting the performance consequences of

contemplated actions. However, past models for Inter-

net services have not considered platform configurations

such as processor cache sizes, the number of processors,

and processor speed. The effects of such parameters

are fundamentally hard to predict, even when data can

be collected by any means. The effects are even harder

to predict in real-world production environments, where

data collection is restricted to passive measurements of

the running system.

This paper presents a cross-platform performance

model for Internet services, and demonstrates its use

in making management decisions. Our model predicts

application-level response times and throughput from a

composition of several sub-models, each of which de-

scribes a measure of the processor’s performance(hence-

forth, a processor metric) as a function of a system pa-

rameter. For example, one of our sub-models relates

cache misses (a processor metric) to cache size (a system

parameter). The functional forms of our sub-models are

determined from empirical observations across several

Internet services and are justified by reasoning about the

underlying design of Internet services. Our knowledge-

lean sub-models are called trait models because, like hu-

man personality traits, they stem from empirical obser-

vations of system behaviors and they characterize only

one aspect of a complex system. Figure 1 illustrates the

design of our cross-platform model.

The applicability of our model in real-world produc-

tion environments was an important design considera-

tion. We embrace the philosophy of George Box, “All

models are wrong, but some [hopefully ours] are use-

ful.” [10] To reach a broad user base, our model targets

third-party consultants. Consultants are often expected

to propose good management decisions without touch-

ing their clients’ production applications. Such inconve-

nient but realistic restrictions forbid source code instru-

mentation and controlled benchmarking. In many ways

the challenge in such data-impoverished environments

fits the words of the old adage, “trying to make a dol-

lar from 15 cents.” Typically, consultants must make do

with data available from standard monitoring utilities and

Figure 1:

The design of our cross-platform performance

model. The variable x represents a system parameter, e.g.,

number of processors or L1 cache size. The variable y rep-

resents a processor metric (i.e., a measure of processor per-

formance), such as instruction count or L1 cache misses.

Application-level performance refers to response time and

throughput, collectively.

application-level logs. The simplicity of our sub-models

allows us to calibrate our model using only such readily

available data.

We demonstrate the usefulness of our model in the

selection of high-performance, low-power server plat-

forms. Specifically, we used our model (and proces-

sor power specifications) to identify platforms with high

performance-per-watt ratios. Our model outperforms al-

ternative techniques that are commonly used to guide

platform selection in practice today. We also show that

model-driven load balancing for heterogeneous clusters

can improve response times. Under this policy, request

types are routed to the hardware platform that our model

predicts is best suited to their resource requirements.

The contributions of this paper are:

1. We observe and justify trait models across several

Internet services.

2. We integrate our trait models into a cross-platform

model of application-level performance.

3. We demonstrate the usefulness of our cross-

platform model for platform selection and load bal-

ancing in a heterogeneous server farm.

The remainder of this paper is organized as follows.

Section 2 overviews the software architecture, process-

ing patterns, and deployment environment of the realistic

Internet services that we target. Section 3 presents sev-

eral trait models. Section 4 shows how we compose trait

models into an established framework to achieve an ac-

curate cross-platform performance prediction and com-

pares our approach with several alternatives. Section 5

shows how trait-based performance models can guide the

selection of server platforms and guide load balancing in

a heterogeneous server cluster. Section 6 reviews related

work and Section 7 concludes.

2 Background

Internet services are often designed according to a

three-tier software architecture. A response to an end-

user request may traverse software on all three tiers. The

first tier translates end-user markup languages into and

out of business data structures. The second tier (a.k.a. the

business-logic tier) performs computation on business

data structures. Our work focuses on this tier, so we will

provide a detailed example of its operation below. The

third tier provides read/write storage for abstract business

objects. Requests traverse tiers via synchronous com-

munication over local area networks (rather than shared

memory) and a single request may revisit tiers many

times [26]. Previous studies provide more information

on multi-tier software architectures [29,44].

Business-logic computations are often the bottleneck

for Internet services. As an example, consider the busi-

ness logic of an auction site: computing the list of cur-

rently winning bids can require complex considerations

of bidder histories, seller preferences, and shipping dis-

tances between bidders and sellers. Such workloads

are processor intensive, and their effect on application-

level performance depends on the underlying platform’s

configuration in terms of cache size, on-chip cores, hy-

perthreading, etc. Therefore, our model, which spans

a wide range of platform configurations, naturally tar-

gets the business-logic tier. Admittedly, application-level

performance for Internet services can also be affected

by disk and network workloads at other tiers. Previous

works [14,41] have addressed some of these issues, and

we believe our model can be integrated with such works.

Internet services keep response times low to satisfy

end-users. However as requests arrive concurrently, re-

sponse times increase due to queuing delays. In produc-

tion environments, resources are adequately provisioned

to limit the impact of queuing. Previous analyses of sev-

eral real enterprise applications [40] showed max CPU

utilizations below 60% and average utilizations below

25%. Similar observations were made in [8]. Services

are qualitatively different when resources are adequately

provisioned compared to overload conditions. For exam-

ple, contention for shared resources is more pronounced

under overload conditions [29].

Page 3

Aggregate counts

Time

Type

Instr. count

stamp

...

(x10

)

miss

2:00pm

...

280322

31026

2072

2:01pm

...

311641

33375

2700

Table 1:

Example of data available as input to our model.

Oprofile [2] collects instruction counts and cache misses.

Apache logs [1] supply frequencies of request types.

2.1 Nonstationarity

End-user requests can typically be grouped into a

small number of types. For example, an auction site

may support request types such as bid for item, sell item,

and browse items. Requests of the same type often fol-

low similar code paths for processing, and as a result,

requests of the same type are likely to place similar de-

mands on the processor. A request mix describes the pro-

portion of end-user requests of each type.

Request mix nonstationarity describes a common phe-

nomenon in real Internet services: the relative frequen-

cies of request types fluctuate over long and short inter-

vals [40]. Over a long period, an Internet service will

see a wide and highly variable range of request mixes.

On the downside, nonstationarity requires performance

models to generalize to previously unseen transaction

mixes. On the upside, nonstationarity ensures that ob-

servations over a long period will include more unique

request mixes than request types. This diversity over-

constrains linear models that consider the effects of each

request type on a corresponding output metric (e.g., in-

struction count), which enables parameter estimation us-

ing regression techniques like Ordinary Least Squares.

2.2 Data Availability

In production environments, system managers must

cope with the practical (and often political) issues of trust

and risk during data collection. For example, third-party

consultants—a major constituent of our work—are typ-

ically seen as semi-trusted decision makers, so they are

not allowed to perform controlled experiments that could

pose availability or security risks to business-critical pro-

duction systems. Similarly, managers of shared host-

ing centers are bound by business agreements that pre-

vent them from accessing or modifying a service’s source

code. Even trusted developers of the service often re-

linquish their ability to perform invasive operations that

could cause data corruption.

Our model uses data commonly available in most

production environments, even in consulting scenarios.

Specifically, we restrict our model inputs to logs of re-

quest arrivals and CPU performance counters. Table 1

provides an example of our model’s inputs. Our model

also uses information available from standard platform

specification sheets such as the number of processors and

on-chip cache sizes [6].

3 Trait Models

Trait models characterize the relationship between a

system parameter and a processor metric. Like personal-

ity traits, they reflect one aspect of system behavior (e.g.,

sensitivity to small cache sizes or reaction to changes in

request mix). The intentional simplicity of trait models

has two benefits for our cross-platform model. First, we

can extract parsimonious yet general functional forms

from empirical observations of the parameter and out-

put metric. Second, we can automatically calibrate trait

models with data commonly available in production en-

vironments.

Our trait models take the simplest functional form that

yields low prediction error on the targeted system param-

eter and processor metric. We employ two sanity checks

to ensure that our traits reflect authentic relationships—

not just peculiarities in one particular application. First,

we empirically validate our trait models across several

applications. Second, we justify the functional form of

our trait models by reasoning about the underlying struc-

ture of Internet services.

In this section, we observe two traits in the busi-

ness logic of Internet services. First, we observe that a

power law characterizes the miss rate for on-chip pro-

cessor caches. Specifically, data-cache misses plotted

against cache size on a log-log scale are well fit by a

linear model. We justify such a heavy-tail relationship

by reasoning about the memory access patterns of back-

ground system activities. Compared to alternative func-

tional forms, a power law relationship achieves excellent

prediction accuracy with few free model parameters.

Second, we observe that a linear request-mix model

describes instruction count and aggregate cache misses.

This trait captures the intuition that request type and vol-

ume are the primary determinants of runtime code paths.

Our experiments demonstrate the resiliency of request

mix models under a variety of processor configurations.

Specifically, request-mix models remain accurate under

SMP, multi-core, and hyperthreading processors.

3.1 Error Metric

The normalized residual error is the metric that we use

to evaluate trait models. We also use it in the validation

of our full cross-platform model. Let Y andY represent

observed and predicted values of the targeted output met-

ric, respectively. The residual error, E = Y − Y tends

toward zero for good models. The normalized residual

error,

|E|

, accounts for differences in the magnitude of Y.

Page 4

100

120

140

(

)

100

200

300

400

500

600

700

Instructions

Most popular type

Request arrival rate

2nd most popular type

2 hours

Figure 2:

Nonstationarity during a RUBiS experiment. Re-

quest arrivals fluctuate in a sinusoidal fashion, which corre-

spondingly affects the aggregate instructions executed. The ra-

tio of the most popular request type to the second most popular

type (i.e.,

freq

most

freq

2nd

) ranges from 0.13 to 12.5. Throughout this

paper, a request mix captures per-type frequencies over a 30

second interval.

3.2 Testbed Applications

We study the business logic of three benchmark In-

ternet Services. RUBiS is a J2EE application that cap-

tures the core functionalities of an online auction site [3].

The site supports 22 request types including browsing

for items, placing bids, and viewing a user’s bid history.

The software architecture follows a three-tier model con-

taining a front-end web server, a back-end database, and

Java business-logic components. The StockOnline stock

trading application [4] supports six request types. End

users can buy and sell stocks, view prices, view hold-

ings, update account information, and create new ac-

counts. StockOnline also follows a three-tier software

architecture with Java business-logic components. TPC-

W simulates the activities of a transactional e-commerce

bookstore. It supports 13 request types including search-

ing for books, customer registration, and administrative

price updates. Applications run on the JBoss 4.0.2 appli-

cation server. The database back-end is MySQL 4.0. All

applications run on the Linux 2.6.18 kernel.

The workload generators bundled with RUBiS, Stock-

Online, and TPC-W produce synthetic end-user requests

according to fixed long-term probabilities for each re-

quest type. Our previous work showed that the resulting

request mixes are qualitatively unlike the nonstationary

workloads found in real production environments [40].

In this work, we used a nonstationary sequence of in-

tegers to produce a nonstationary request trace for each

benchmark application. We replayed the trace in an

open-arrival fashion in which the aggregate arrival rate

fluctuated. Figure 2 depicts fluctuations in the aggregate

arrival rate and in the relative frequencies of transaction

types during a RUBiS experiment. Our nonstationary se-

quence of integers is publicly available [5] and can be

used to produce nonstationary mixes for any application

with well-defined request types.

Cache Size (KB)

16384

256

RUBiS

StockOnline

TPC-W

16384

256

16384

256

Figure 3:

Cache misses (per 10k instructions) plotted against

cache size on a log-log plot. Measurements were taken from

real Intel servers using the Oprofile monitoring tool [2]. The

same nonstationary workload was issued for each test. Cache

lines were 64 bytes.

3.3 Trait Model of Cache Size on Cache Misses

Figure 3 plots data-cache miss rates relative to cache

size on a log-log scale. Using least squares regres-

sion, we calibrated linear models of the form ln(Y) =

Bln(X) + A. We observe low residual errors for each

application in our testbed. Specifically, the largest nor-

malized residual error observed for RUBiS, StockOnline,

and TPCW is 0.08, 0.03, and 0.09 respectively. The cali-

brated B parameters for RUBiS, StockOnline, and TPCW

are -0.83, -0.77, and -0.89 respectively. Log-log linear

models with slopes between (-2, 0) are known as power

law distributions [19,33,43]

We justify our power-law trait model by making ob-

servations on its rate of change, shown in Equation 1.

= Be

B−1

(1)

For small values of X, the rate of change is steep, but

as X tends toward infinity the rate of change decreases

and slowly (i.e., with a heavy tail) approaches zero.

The heavy-tail aspect of a power law means the rate of

change decreases more slowly than can be described us-

ing an exponential model. In terms of cache misses,

this means a power-law cache model predicts significant

miss rate reductions when small caches are made larger,

but almost no reductions when large caches are made

larger. The business logic tier for Internet services ex-

hibits such behavior by design. Specifically, per-request

misses due to lack of capacity are significantly reduced

by larger L1 caches. However, garbage collection and

other infrequent-yet-intensive operations will likely in-

cur misses even under large cache sizes.

A power law relationship requires the calibration of

only two free parameters (i.e., A, and B), which makes it

practical for real-world production environments. How-

ever, there are many other functional forms that have

only two free parameters; how does our trait model

compare to alternatives? Table 2 compares logarithmic,

Page 5

Log

Exp.

Power

Log

law

-normal

lowest

0.011

0.105

0.005

0.001

RUBiS

median

0.094

0.141

0.028

0.027

highest

0.168

0.254

0.080

0.072

lowest

0.013

0.010

0.012

Stock

median

0.075

0.099

0.023

0.024

highest

0.026

0.142

0.034

0.042

lowest

0.046

0.044

0.011

0.007

TPCW

median

0.109

0.084

0.059

0.060

highest

0.312

0.146

0.099

0.101

Table 2:

Normalized residual error of cache models that have

fewer than three free parameters. The lowest, median, and

highest normalized residuals are reported from observations on

seven different cache sizes.

exponential, and power law functional forms. Power

law models have lower median normalized residual er-

ror than logarithmic and exponential models for each

application in our testbed. Also, we compare against

a generalized (quadratic) log-normal model, ln(Y) =

ln(X)

ln(X)+A. This model allows for an addi-

tional free parameter (B

) in calibration, and is expected

to provide a better fit, though it cannot be calibrated from

observations on one machine. Our results show that the

additional parameter does not substantially reduce resid-

ual error. For instance, the median residual error for the

power law distribution is approximately equal to that of

the generalized log-normal distribution. We note that

other complex combinations of these models may pro-

vide better fits, such as Weibull or power law with expo-

nential cut-off. However, such models are hard to cali-

brate with the limited data available in production envi-

ronments.

3.4 Trait Models of Request Mix on Instruc-

tions and Cache Misses

Figure 4 plots a linear combination of request type fre-

quencies against the instruction count, L1 misses, and L2

misses for RUBiS. Our parsimonious linear combination

has only one model parameter for each request type, as

shown below.

∑

types j

(2)

Where C

represents one of the targeted processor met-

rics and N

represents the frequency of requests of type

j. The model parameter

transforms request-type

frequencies into demand for processor resources. Intu-

itively,

represents the typical demand for processor

resource k of type j. We refer to this as a request-mix

model, and we observe excellent prediction accuracy.

The 90th percentile normalized error for instructions, L1

misses, and L2 misses were 0.09, 0.10, 0.06 respectively.

100

10000

1000000

100

10000

1000000

Actual

y = 1.15x

y = .85x

Instructions

L1 misses

L2 misses

Figure 4:

RUBiS instruction count and aggregate cache misses

(L1 and L2) plotted against a linear request mix model’s predic-

tions. Measurements were taken from a single processor Intel

Pentium D server. Lines indicate 15% error from the actual

value. Collected with OProfile [2] at sampling rate of 24000.

Request-mix models are justifiable, because business-

logic requests of the same type typically follow similar

code paths. The number of instructions required by a

request will likely depend on its code path. Similarly,

cold-start compulsory misses for each request will de-

pend on code path, as will capacity misses due to a re-

quest’s working set size. However, cache misses due to

sharing between requests are not captured in a request-

mix model. Such misses are not present in the single

processor tests in Figure 4.

Table 3 evaluates request-mix models under platform

configurations that allow for shared misses. The first

three columns report low normalized error when re-

sources are adequately provisioned (below 0.13 for all

applications), as they would be in production environ-

ments. However under maximum throughput conditions,

accuracy suffers. Specifically, we increased the volume

of the tested request mixes by a factor of 3. Approx-

imately 50% of the test request mixes achieved 100%

processor utilization. Normalized error increased for the

L1 miss metric to 0.22–0.34. These results are consis-

tent with past work [29] that attributed shared misses in

Internet services to contention for software resources. In

the realistic, adequately provisioned environments that

we target, such contention is rare. We conclude that

request-mix models are most appropriate in realistic en-

vironments when resources are not saturated.

Request-mix models can be calibrated with times-

tamped observations of the targeted processor metric and

logs of request arrivals, both of which are commonly

available in practice. Past work [40] demonstrated that

unrealistic stationary workloads are not sufficient for cal-

ibration. Request-mix models can require many observa-

tions before converging to good parameters. For the real

and benchmark applications that we have seen, request

mix models based on ten hours of log files are typically

sufficient.

Page 6

2-way SMP

Dual-core

Hyperthreading

Max tput

(L1 miss)

(L2 miss)

(L1 miss)

RUBiS

0.044

0.030

0.045

0.349

Stock

0.060

0.077

0.096

0.276

TPCW

0.035

0.084

0.121

0.223

Table 3:

Median normalized residual error of a request-mix

model under different environments. Evaluation spans 583 re-

quest mixes from a nonstationary trace. The target architectural

metric is shown in parenthesis. All tests were performed on

Pentium D processors. Hyperthreading was enabled in the sys-

tem BIOS. 2-way SMP and dual-cores were enabled/disabled

by the operating system scheduler. The “max tput” test was

performed on a configuration with 2 processors and 4 cores en-

abled with hyperthreading on.

4 Cross-Platform Performance Predictions

Section 3 described trait models, which are parsimo-

nious characterizations of only one aspect of a complex

system. In this section, we will show that trait mod-

els can be composed to predict application-level perfor-

mance for the whole system. Our novel composition is

formed from expert knowledge about the structure of In-

ternet services. In particular, we note that instruction and

memory-access latencies are key components of the pro-

cessing time of individual requests, and that Internet ser-

vices fit many of the assumptions of a queuing system.

Our model predicts application-level performance across

workload changes and several platform parameters in-

cluding: processor speed, number of processors (e.g., on-

chip cores), cache size, cache latencies, and instruction

latencies. Further, our model can be calibrated from the

logs described in Table 1. Source code access, instru-

mentation, and controlled experiments are not required.

We attribute the low prediction error of our model to ac-

curate trait models (described in Section 3) and a princi-

pled composition based on the structure of Internet ser-

vices.

The remainder of this section describes the composi-

tion of our cross-platform model, then presents the test

platforms that we use to validate our model. Finally, we

present results by comparing against alternative model-

ing techniques. In Section 5, we complete the challenge

of turning 15-cent production data into a dollar by using

our model to guide management decisions like platform

selection and load balancing.

4.1 Composition of Traits

The amount of time spent processing an end-user re-

quest, called service time, depends on the instructions

and memory accesses necessary to complete the request.

Average CPU service time can be expressed as

s =

I ×(CPI+(H

mem

)))

CPU speed×number of requests

where I is the aggregate number of instructions required

by a request mix, CPI is the average number of cycles per

instruction (not including memory access delays), H

the percentage of hits in the L

cache per instruction, M

is the percentage of misses in the L

cache per instruc-

tion, and C

is the typical cost in cycles of accesses to the

cache [22]. CPI, CPUspeed, and C

are typically re-

leased in processor spec sheets [6]. Recent work [16] that

more accurately approximates CPI and C

could trans-

parently improve the prediction accuracy of our model.

This model is the basis for our cross-platform perfor-

mance prediction. Subsequent subsections will extend

this base to handle changes in application-level workload

and cache size parameters, and to predict the application-

level performance.

4.1.1 Request Mix Adjustments

In Section 3, we observed that both instruction counts

and cache misses, at both L1 and L2, are well modeled

as a linear combination of request type frequencies:

I =

∑

types j

I j

and

∑

types j

where I is the number of instructions for a given volume

and mix of requests, N

is the volume of requests of type

j, and #M

is the number of misses at cache level k. The

intuition behind these models is straightforward:

I j

, for

example, represents the typical number of instructions

required to serve a request of type j. We apply ordinary

least squares regression to a 10-hour trace of nonstation-

ary request mixes to calibrate values for the

parame-

ter. After calibration, the acquired

parameters can be

used to predict performance under future request mixes.

Specifically, we can predict both instruction count and

aggregate cache misses for an unseen workload repre-

sented by a new vector N of request type frequencies.

4.1.2 Cache Size Adjustments

Given L1 and L2 cache miss rates observed on the cur-

rent hardware platform, we predict miss rates for the

cache sizes on a new hardware platform using the power-

law cache model: M

= e

where S

is the size of the

level-k cache.

We calibrate the power law under the most strenuous

test possible: using cache-miss observations from only

an L1 and L2 cache. This is the constraint under which

many consultants must work: they can measure an ap-

plication running in production on only a single hard-

ware platform. Theoretically, the stable calibration of

Page 7

power law models desires observations of cache misses

on 5 cache sizes [19]. Calibration from only two obser-

vations skews the model’s predictions for smaller cache

sizes [33]. However in terms of service time prediction,

the penalty of such inaccuracies—L1 latency—is low.

4.1.3 Additional Service Time Adjustments

Most modern processors support manual and/or dynamic

frequency adjustments. Administrators can manually

throttle CPU frequencies to change the operation of the

processor under idle and busy periods. Such manual poli-

cies override theCPUspeed parameter in processor spec.

sheets. However, power-saving approaches in which the

frequency is automatically adjusted only during idle pe-

riods are not considered in our model. Such dynamic

techniques should not affect performanceduring the busy

times in which the system is processing end-user re-

quests.

We consider multi-processors as one large virtual pro-

cessor. Specifically, the CPUspeed parameter is the sum

of cycles per second across all available processors. We

do not distinguish between the physical implementations

of logical processors seen by the OS (e.g., SMT, multi-

core, or SMP). We note however that our model accuracy

could be improved by distinguishing between cores and

hyperthreads.

4.1.4 Predicting Response Times

Service time is not the only component of a request’s

total response time; often the request must wait for re-

sources to become available. This aspect of response

time, called queuing delay, increases non-linearly as the

demand for resources increases. Queuing models [27]

attempt to characterize queuing delay and response time

as a function of service time, request arrival rate, and

the availability of system resources. Our past work pre-

sented a queuing model that achieves accurate response

time prediction on real Internet services [40]. That par-

ticular model has two key advantages: 1) it considers the

weighted impact of request mix on service times and 2) it

can be easily calibrated in the production environments

that we target. It is a model of aggregate response time y

for a given request mix and is shown below:

y =

∑

j=1

∑

( 1

1−U

)·

∑

j=1

where

and U

respectively denote the aggregate count

of requests in the given mix (i.e., arrival rate) and the av-

erage utilization of resource r, respectively. Utilization is

the product of average service time and arrival rate. The

first term reflects the contribution of service times to ag-

gregate response time, and the second considers queuing

delays. For average response time, divide y by

. The

parameter s

captures average service time for type j and

can be estimated via regression procedures using obser-

vations of request response times and resource utiliza-

tions [40]. Note that y reflects aggregate response time

for the whole system; s

includes delays caused by other

resources— not just processing time at the business-logic

tier. Our service time predictions target the portion of s

attributed to processing at the business-logic tier only.

4.2 Evaluation Setup

We empirically evaluate our service and response time

predictions for RUBiS, StockOnline, and TPC-W. First,

we compare against alternative methods commonly used

in practice. Such comparisons suggest that our model

is immediately applicable for use in real world prob-

lems. Then we compare against a model recently pro-

posed in the research literature [25]. This comparison

suggests that our principled modeling methodology pro-

vides some benefits over state-of-the-art models.

4.2.1 Test Platforms

We experiment with 4 server machines that allow for a

total of 11 different platform configurations. The various

servers are listed below:

PIII Dual-processor PIII Coppermine with 1100 MHz

clock rate, 32 KB L1 cache, and 128 KB L2 cache.

PRES Dual-processor P4 Prescott with 2.2 GHz clock

rate, 16 KB L1 cache, and 512 KB L2 cache.

PD4 Four-processor dual-core Pentium D with 3.4 GHz

clock rate, 32 KB L1 cache, 4 MB L2 cache, and

16 MB L3 cache.

XEON Dual-processor dual-core Pentium D Xeon that

supports hyperthreading. The processor runs at

2.8 GHz and has a 32 KB L1 cache and 2 MB L2

cache.

We used a combination of BIOS features and OS

scheduling mechanisms to selectively enable/disable hy-

perthreading, multiple cores, and multiple processors on

the XEON machine, for a total of eight configurations.

We describe configurations of the XEON using nota-

tion of the form “#H/#C/#P.” For example, 1H/2C/1P

means hyperthreading disabled, multiple cores enabled,

and multiple processors disabled. Except where other-

wise noted, we calibrate our models using log files from

half of a 10-hour trace on the PIII machine. Our log files

contain aggregate response times, request mix informa-

tion, instruction count, aggregate L1 cache misses, and

aggregate L2 cache misses. The trace contains over 500

mixes cumulatively. Request mixes from the remain-

ing half of the trace were used to evaluate the normal-

Page 8

ized residual error

|predicted−actual|

actual

on the remaining ten

platforms/configurations. Most mixes in the second half

of the trace constitute extrapolation from the calibration

data set; specifically, mixes in the second half lie outside

the convex hull defined by the first half.

4.2.2 Alternative Models Used In Practice

Because complex models are hard to calibrate in data-

constrained production environments, the decision sup-

port tools used in practice have historically preferred

simplicity to accuracy. Commonly used tools involve

simple reasoning about the linear consequences of plat-

form parameters on CPU utilization and service times.

Processor Cycle Adjustments Processor upgrades

usually mean more clock cycles per second. Although

the rate of increase in clock speeds of individual proces-

sor cores has recently slowed, the number of cores per

system is increasing. Highly-concurrent multi-threaded

software such as business-logic servers processing large

numbers of simultaneous requests can exploit the in-

creasing number of available cycles, so the net effect is

a transparent performance improvement. A common ap-

proach to predicting the performance impact of a hard-

ware upgrade is simply to assume that CPU service times

will decrease in proportion to the increase in clock speed.

For service time prediction, this implies

new