Java Fork/Join framework performances: when is it worth?

In the previous post we saw how to use the Fork/Join framework introduced by Java 7 to take advantage of multithreading in the resolution of problems that can be approached with a divide-and-conquer strategy. In this article let’s see if and when its use actually improves performances compared to traditional sequential approach. We start again with the problem we have faced to illustrate how the framework works that was counting the number of occurrences of a certain value x within an array of integers. Using the fork/join framework we have divided the problem into subproblems of smaller size and counted the occurrences of each sub-problem with a different thread, based on the number of processors available on the machine. Then we calculated the final result by assembling the various partial results of the individual sub-problems.

What we do now is compare the performance of this solution with those of the classical solution that consist in iterate on all the elements of the array and update the counter of occurrences, without using any multithreading mechanism.
We will perform this test on different sizes of the input array, executing on each of the dimensions taken into consideration 5 runs, in order to analyze how the effectiveness of the multithreading varies depending on the size of the input to be processed. The size of the input array that we’ll test are:

  • 1 Million elements
  • 5 Million elements
  • 10 Million elements
  • 20 Million elements
  • 35 Million elements

First we have to change the code of the previous article, adding the phase of array loading, where we populate it with elements we generate randomly. To do this we use a static initialization block:

static {
	Random random = new Random();
	for( int i = 0; i< SEARCH_ARRAY.length; i++) {
		SEARCH_ARRAY[i] = random.nextInt(100);
	}
}

Furthermore, compared to the old code, we also have to insert the acquisition of the start time and end time of the computation for both the parallel procedure, based on the fork/join framework, and the classic loop-based approach:

// we start the execution with the fork/join framework
long startTime = System.nanoTime();
int numOfOcc = pool.invoke(new RecursiveCountOfX(0, n - 1, dim));

System.out.printf("Total number of occurrences of the element " + TO_FIND + ": " + numOfOcc + "\n");
long endTime = System.nanoTime();
System.out.println("Solution with Fork/Join:  " + ((endTime - startTime)/1000000.0) + " milliseconds");

So, the full code of the program used for testing becomes the following:

import java.util.Random;
import java.util.concurrent.ForkJoinPool;
import java.util.concurrent.RecursiveTask;

public class ForkJoinFindXOccurrences {

    public static int   numOfThreads;
    public static final int[] SEARCH_ARRAY = new int[35_000_000];
    public static final int   TO_FIND      = 1;

    static {
        Random random = new Random();
        for( int i = 0; i< SEARCH_ARRAY.length; i++) {
            SEARCH_ARRAY[i] = random.nextInt(100);
        }
    }

    public static void main(final String[] args) {

        System.out.println("Number of processors: " + Runtime.getRuntime().availableProcessors());
        numOfThreads = Runtime.getRuntime().availableProcessors();

        System.out.println("Number of array elements: " + SEARCH_ARRAY.length);

        int dim = SEARCH_ARRAY.length / numOfThreads;

        final ForkJoinPool pool = new ForkJoinPool(numOfThreads);

        final int n = SEARCH_ARRAY.length;

        // we start the execution with the fork/join framework
        long startTime = System.nanoTime();
        int numOfOcc = pool.invoke(new RecursiveCountOfX(0, n - 1, dim));

        System.out.printf("Total number of occurrences of the element " + TO_FIND + ": " + numOfOcc + "\n");
        long endTime = System.nanoTime();
        System.out.println("Solution with Fork/Join:  " + ((endTime - startTime)/1000000.0) + " milliseconds");

        // we calculate the result with a "classic" loop
        numOfOcc = 0;
        long startTime2 = System.nanoTime();
        for (int k :SEARCH_ARRAY) {
            if (k == TO_FIND) {
                numOfOcc++;
            }
        }

        System.out.printf("Total number of occurrences of the element "
                          + TO_FIND + ": " + numOfOcc + "\n");
        long endTime2 = System.nanoTime();
        System.out.println("Solution with a classic loop:  " + ((endTime2 - startTime2)/1000000.0) + " milliseconds");
    }


    static class RecursiveCountOfX extends RecursiveTask {

        int start, end, dim;

        public RecursiveCountOfX(final int start, final int end, final int dim) {
            this.start = start;
            this.end = end;
            this.dim = dim;
        }

        @Override
        public Integer compute() {

            if ((this.end - this.start) < this.dim) {
                int ris = 0;
                for (int i = this.start; i <= this.end; i++) {
                    if (SEARCH_ARRAY[i] == TO_FIND) {
                        ris++;
                    }
                }

                return ris;
            }

            final int mid = (this.start + this.end) / 2;

            final RecursiveCountOfX subTask1 = new RecursiveCountOfX(this.start, mid, this.dim);

            subTask1.fork();

            final RecursiveCountOfX subTask2 = new RecursiveCountOfX(mid + 1, this.end, this.dim);

            return subTask2.compute() + subTask1.join() ;
        }
    }
}

Running the program on an initial size of the input array of 1 million elements, we get the following result:

Number of processors: 4
Number of array elements: 1000000
Total number of occurrences of the element 1: 9962
Solution with Fork/Join:  12.847316 milliseconds
Total number of occurrences of the element 1: 9962
Solution with a classic loop:  3.085275 milliseconds

From this single run it seems to be a significant difference in favor of the simple execution based on a normal loop that iterates through all the elements and compares them with the value of which we need to count instances.
Before proceeding with the analysis of the results obtained with a series of 5 runs for each input size, we can try to increase the size of the input array to 5 million elements and see if we can notice some difference:

Number of processors: 4
Number of array elements: 5000000
Total number of occurrences of the element: 49886
Solution with Fork/Join:  11.151486 milliseconds
Total number of occurrences of the element: 49886
Solution with a classic loop:  8.995929 milliseconds

We can immediately see how the gap between the running times of the two solutions has been significantly reduced, although the approach based on a classic loop remains more powerful even on this dimension of the input.

The following figure shows the test results of execution of the two procedures, each of which has been executed 5 times on a size of the input array of 1, 5, 10, 20 and 35 million elements.

Java7 Fork-Join framework performance

By running these tests and from the calculation of the average execution time for each dimension of the input, we can basically see that the performance of the fork/join framework becames better than the one of the standard iterative approach, as long as the size of the input increases. In the figure, for each size of the input array, is highlighted in green the solution that requires the least amount of time while the one that requires a higher computational effort is highlighted in red when it turns out to be significantly slower, or in yellow when it turns out to be only of slightly slower than the other.
On an array of one million elements using the fork/join framework is inefficient and significantly slower than the simple iteration over the elements of the array. This is because the cost of the division into subproblems and the instantiation of the various threads that are allocated affect too much compared to the cost of resolution of the individual sub-problem, so to approach the orignal problem iteratively with a single thread is still better. Increasing the size of the input array to 5 million items, as we saw earlier, reduces the gap in performance that still remain, however, to the disadvantage of the fork/join framework, even slightly (it is in yellow in the table). Again increasing the size of the input to 10 million elements we notice instead that the use of the fork/join framework begins to produce some advantages and it results to be better compared to iterate sequentially over all the elements of the array. In this case the cost related to the creation of threads and the aggregation of the partial results has a minor impact on the cost of tasks execution on subproblems and, consequently, the parallelization of the calculation gives a better result than the iterative approach. These operations of initializaztion of the fork/join framework and the creation of its various threads are gradually more and more amortized as long as the input size increases. So, reducing their impact in relation to the effort required for the calculation of the solution of the subproblem assigned to the thread, we get a significant advantage in parallelize computation across multiple threads compared to having only one thread that iterates over the entire array.

Trying to further increase the size of the input to the limit allowed by the machine used for this, before getting an OutOfMemoryError , we get an additional confirmation on the analysis of the behavior just performed.

Number of array elements: 44000000
Total number of occurrences of the element 1: 438809
Solution with Fork/Join:  28.117031 milliseconds
Total number of occurrences of the element 1: 438809
Solution with a classic loop:  67.360796 milliseconds

Leave a Reply

Your email address will not be published. Required fields are marked *