4. Measuring Parallelism

There are fundamental issues with throwing more cores at a problem.

The remaining sequential parts will dominate runtime
Choose the right data-structures for parallel access
synchronisation and switching overheads

This results in non-linear speedup.

Cache

There are multiple levels of cache in a CPU. The closer (and usually smaller), the faster access time is.

This is where locality counts in programs (row vs. column major access). In a program, summing from an array with a stride of 64 will consistently produce cache-busts.

Measuring Performance

Parallel Performance

Sequential execution time $T_{1}$

$T_{p} = T_{1} / p$ (perfection)

$T_{p} > T_{1} / p$ (performance loss, normal situation)

$T_{p} < T_{1} / p$ (magic)

Parallel Speedup

The speedup $S_{p}$ for $p$ cores

$S_{p} = p$ linear speedup (perfection)

$S_{p} < p$ sub-linear speedup

$S_{p} > p$ super-linear speedup

Efficiency

$S_{p} / p$ , how well utilised each core is.

Speedup Calculations

Calculate the Speedup of a program with the following distribution:

Identify the sequential and parallel fractions of your program

$T_{se q} = 0.2 \cdot T_{1}$
$T_{p a r} = 0.8 \cdot T_{1}$
Calculate parallel execution time on $p$ cores $T_{p} = T_{se q} + \frac{T _{p a r}}{p} = 0.2 \cdot T_{1} + \frac{0.8 \cdot T _{1}}{p}$ for this. The speedup is then: $S_{p} = \frac{T _{1}}{T _{p}}$ , where efficiency is $E = \frac{S _{p}}{p}$ .

Laws of Parallel Execution Time

Amdahl’s Law

The execution time of a sequential program $T_{1}$ is either:

parallelisable: $W_{ser}$
non-parallelisable serial work $W_{p a r}$

Given $P$ workers available, the time for sequential and parallel execution are:

T_{1} = W_{ser} + W_{p a r}

and this bounds execution time to

T_{p} \geq W_{ser} + \frac{W _{p a r}}{P}

Amdahl's Law

$S_{p} \leq \frac{W _{ser} + W _{p a r}}{W _{ser} + \frac{W _{p a r}}{P}}$

Parallelisable fraction

If $f$ ist the fraction of non-parallelisable work $W_{ser} = f \cdot T_{1}$ and $W_{p a r} = (1 - f) \cdot T_{1}$ which gives
$S_{P} \leq \frac{1}{f + \frac{1 - f}{P}}$

Proof:

The speedup is $T_{1} / T_{p}$ .
$T_{1}$ is simply the time it takes to do the serial fraction, then the parallel one
- $T_{1} = f + (1 - f)$
$T_{P}$ has the same amount of time for the serial fraction, but we divide the parallel fraction by $P$
- $T_{P} = T_{1} \cdot F + T_{1} (1 - f) / p$
We can then remove the $T_{1}$ and get the speedup bound.

Note, if we have infinite workers, $S_{\infty} \leq \frac{1}{f}$ .

If we compare the speed-up and efficiency of a workload, depending on the serial fraction, we can see gains quickly flatten off.

Amdahl’s Law is a pessimist approach - it puts limits on scalability. All non-parallel parts of a program can cause problems.

The effort to reduce the fraction of non-parallel code pays off in large performance gains!

One has to weigh the trade-off between increasing ressources and actual performance gain.

Gustafson’s Law

Gustafson’s Law is the optimistic version. It considers problem size (where run-time is constant). More processors allow larger problems to be solved.

Gustafson's Law

$W = P (1 - f) \cdot T_{w a ll} + f \cdot T_{w} a ll$
and
$S_{P} = f + P (1 - f) = P - f (P - 1)$

Proof:

for Gustafson’s we determine the $T_{1}$ sequential work as the total parallel work, just done sequentially
- $T_{1} = T \cdot f + T \cdot (1 - f) p$ - we basically stack the amount we could have done in parallel
Then the parallel time is just $T_{p} = T \cdot f + T \cdot (1 - f) \cdot p / p = T$
Then the speedup shortens that away and just gives us $S_{P} = f + (1 - f) p$ as we have to do the sequential fraction once, then we do $(1 - f)$ for as many processors as we want.

Summary

Amdahl’s and Gustafson’s aren’t different views on the same problem. They make different assumptions (fixed work or fixed time).

Niklas @ ETHZ

Explorer

4. Measuring Parallelism

Cache

Measuring Performance

Speedup Calculations

Laws of Parallel Execution Time

Amdahl’s Law

Gustafson’s Law

Summary

Graph View

Table of Contents

Backlinks