5.1 Fork/Join (also called Cilk-Style)

Example of summing together all the values of the array. In a naive parallel program, we could divide the array beforehand, into distinct pieces and then run a thread to compute the “sub-sum” for each part.

This has a few issues:

  • Processor Utilisation what if the number of available cores changes during execution? We can’t just create more threads
  • Load Imbalance what if some parts of the array take longer to sum up

This is where fork/join style programming comes in. We just create a fuckton more “threads” (which in reality are tasks handled by a pool of threads) which solves the issue - divide and conquer style.

5.1.2 Manual Fixes

We can fix the issue of creating too many threads (two per recursive step in the case of summing), by creating one and doing the other half of the work in the main thread directly.

The second version reduces the number of Threads needed by x2.
We can also set the cutoff higher, which reduces the overhead by better utilising the Threads.

However this is still not optimal, as we still create Threads, which creates overhead and is very wasteful on deep recursion trees.

Note that each task should perform between 1k and 100k sequential operations in order to guarantee they have enough work to do.

5.1.3 Usage

ForkJoinPool pool = new ForkJoinPool(); // or ForkJoinPool.commonPool()

RecursiveTask<V> (returns result)

class SumTask extends RecursiveTask {
    private final int[] arr;
    private final int lo, hi;
    private static final int THRESHOLD = 1000;
 
    SumTask(int[] arr, int lo, int hi) { this.arr = arr; this.lo = lo; this.hi = hi; }
 
    @Override
    protected Long compute() {
        if (hi - lo <= THRESHOLD) {
            long sum = 0;
            for (int i = lo; i < hi; i++) sum += arr[i];
            return sum;
        }
        int mid = (lo + hi) / 2;
        SumTask left  = new SumTask(arr, lo, mid);
        SumTask right = new SumTask(arr, mid, hi);
        left.fork();              // async execute left
        long rResult = right.compute(); // compute right in current thread
        long lResult = left.join();     // wait for left
        return lResult + rResult;
    }
}
 
// Invoke
Long result = pool.invoke(new SumTask(arr, 0, arr.length));

RecursiveAction (no return value)

class SortAction extends RecursiveAction {
    @Override
    protected void compute() {
        if (small) { /* base case */ }
        else { invokeAll(leftTask, rightTask); }
    }
}
MethodDescription
fork()Submit subtask asynchronously
join()Wait for and get result
compute()Execute task logic
invoke(task)Submit + join (blocking)
invokeAll(a, b)Fork all, join all

5.2 Executor Service

The executor service has the issue that if we have a deep recursion tree which necessitates a lot of waiting (dependent tasks), we might eventually run out of threads.

It is therefore more suited for flat structures (independent requests, standalone tasks).

Thread-Starvation

All threads executing in the pool are blocked on tasks that are waiting for a thread in the internal queue.

5.2.1 Usage

Setup

// Fixed pool
ExecutorService exec = Executors.newFixedThreadPool(4);
// Single thread
ExecutorService exec = Executors.newSingleThreadExecutor();
// Cached (elastic)
ExecutorService exec = Executors.newCachedThreadPool();
// Scheduled
ScheduledExecutorService exec = Executors.newScheduledThreadPool(2);

Submit Tasks

// Runnable (no result)
exec.execute(() -> doWork());
 
// Callable<V> (returns Future)
Future<String> future = exec.submit(() -> {
    return "result";
});
String val = future.get();            // blocking
String val = future.get(5, TimeUnit.SECONDS); // with timeout
 
// Batch
List<Callable<String>> tasks = List.of(task1, task2, task3);
List<Future<String>> futures = exec.invokeAll(tasks);   // wait for all
String first = exec.invokeAny(tasks);                   // first completed

Shutdown

exec.shutdown();                      // no new tasks, finish existing
exec.shutdownNow();                   // interrupt running tasks
exec.awaitTermination(10, TimeUnit.SECONDS);

CompletableFuture (modern alternative)

CompletableFuture.supplyAsync(() -> fetchData(), exec)
    .thenApply(data -> process(data))
    .thenAccept(result -> save(result))
    .exceptionally(ex -> { log(ex); return null; });

Key Methods

MethodDescription
execute(Runnable)Fire-and-forget
submit(Callable)Returns Future<V>
invokeAll(tasks)Run all, block until done
invokeAny(tasks)Return first completed
shutdown()Graceful stop
shutdownNow()Force stop

5.3 Trade-offs

5.4 Measuring Parallelism with the new Frameworks

We can model the execution of a Fork/Join program using a DAG.

For a classic divide-and-conquer approach, the DAG will look as follows:

5.4.1 Task Graphs

Task-Graph

Ein Task-Graph ist ein DAG (einzige Bedingung, kann weird ausschauen).
Die Kanten stellen Dependencies dar.
Die Knoten enthalten die Runtime des Tasks (oder manchmal parameters).

Beispiel: Task-Graphs?

Alle dieser Graphen sind valide Task-Graphs.

Kompletter Task-Graph

Ein kompletter Task-Graph enthält auch die “joins” am Ende. Ein simplified Task Graph nicht.

Span im Kompletten / Simplified

Für den Span zählt man Nodes auf dem Critical Path.
Man zählt trotzdem immer nur die Nodes, also ist der Span für komplett / simplified unterschiedlich.

Note that this task graph is dynamic, it unfolds as execution proceeds.
Independent nodes can but don’t have to be executed in parallel.

Example for Fibonacci

Work in a DAG

  1. (work): tells us total work - the sum of all task execution times
  2. (real world parallel performance) - how much speedup we get with threads
  3. (span best possible speedup) - the longest dependency chain that cannot be parallelised

We cannot control , it depends on the Scheduler, work distribution, thread contention, OS load, etc…

Example Here the scheduler could give us more or less parallel speedup:

We can calculate the speedup for this deep example graph and see the limits.

On this graph, we have , thus even with infinite threads, the maximum speedup is 2. The bottleneck is the span.

Note that the span is the sequential fraction of work in Amdahl’s Law. We want to reduce the span in order to increase speed-up.

5.4.2 Extract Runtime from Graph

Given this example task-graph, how do we compute:

  1. Maximal Speedup
  2. Number of processors needed to achieve it

This graph has:

Thus the max speedup is .

For this graph, 2 processors suffice. We just figure out which workloads each of them can take. Here with 2 we get total time 260, because of the chain that takes 260 on the right.

5.4.3 Concrete Runtime Bounds

We have the following bounds on work.

Parallel Time Bounds

Lower bound:

  • Work law:
  • Span law:
  • — no scheduler can do better than this

Upper bound:

  • Parallelizable work:
  • DAG dependencies:

We get the following guarantees in Java.

THEOREM

The Fork/Join framework gives us an expected-time guarantee of asymptotically optimal:

using the FJ work-stealing scheduler.