BufferedReader and FileInputStream VS Scanner: performance comparison in reading and parsing a 200k lines text file

In this post we do a performance comparison between two different mechanisms for reading a file in Java: the first one consists in the construct of chained class BufferedReader, InputStreamReader and FileInputStream and the other one is represented by the simple Scanner class. The combined use of BufferedReader and InputStreamReader is indicated by java documentation as the most efficient method for reading text data. This solution allows you to read a file line by line using the readline() method, which returns a String. If operations are necessary to interpret the contents of the rows (separation into tokens, type conversions, etc ..), they are entirely in charge of the developer.
The Scanner class instead provides a flexible solution for this last case, since it allows to divide the contents of a line into tokens according to a given separator and it provides methods to automatically convert each token in the primitive type we would expect it represents, possibly after a test to verify that it is actually of that type. For example Scanner provides the method

boolean hasNextInt()

which returns true if the next token can be interpreted as an integer. In case of positive outcome we can then invoke the method

int nextInt()

that reads the token from the file and converts it to an integer.

The goal now is to execute a test to see which of two methods provides the best performance in reading and interpretation of a text file containing 200,000 lines, formed by different tokens that have to be converted to numeric formats. The need arose after creating a sample text file to test my solution for a Spotify programming puzzle.

The files that we need to process has the following structure:

  • An integer N indicating the number of following rows that represent the intervals
  • N lines formed by 3 tokens of the following types: long, long, int
  • Another integer M indicating the number of following rows that represent the queries
  • M lines formed by 2 tokens of the following types: long, long

For our tests we use a value of 100,000 for the integers N and M representing the number of rows for each of the two types. Here’s an example of how the file we’re going to read looks like:

100000
1325338338022 320412 160
1325338361231 441201 320
1325338341231 474 123 96
1325338312302 234123 312
1325338331141 623132 147
. . .
. . .
100000
1325300000000 1325400000000
1325338300000 1325338500000
1325338336412 1325338339612
1325338312302 1325338341231
1325338320000 1325338340000
. . .
. . .

The two strategies that we will compare are:

  • Read the lines with a BufferedReader, split the obtained String and convert each part to the correct datatype with the parsing methods provided by primitive types wrapper classes (Long.parseLong, Integer.parseInt, etc ..).
  • Use a scanner and read every token directly with the method for the type that we expect to acquire (nextLong(), nextInt(), etc ..)

With the values read we create instances of two classes that we can call StatsRecord for the first type of line and Interval for the second one and we insert these instances into two ArrayList. After execution, if everything goes as expected, both ArrayList must have a size of 100,000 items. For simplicity we assume that all values are in fact of the type we expect, so there are no parsing or conversion errors.

Let’s start with the first solution, the one based on the BufferedReader. What we do is to “concatenate” a BufferedReader with an InputStreamReader and then a FileInputStream in order to read the file line by line. In the first line we have only one value, so we read it as a string and we parse it to int. At this point we have to read the N subsequent lines that, as we have seen before, are formed by three parts (token) each. Now we execute on each line the split() method, and we get an array of String with the various string pieces, obtained by dividing the original string at any occurrence of the ” ” (space) character. The token obtained with the split() are then converted into two long and an int using the parse methods provided by the relative wrapper classes of the primitive types and we use them to create objects of the StatsRecord class. After reading these N lines we find another integer, which indicates that there will be after it the M rows of the second type. On these lines we apply the same procedure, so we split them, we convert the token and we create objects of Interval class.
To measure performance, we use the currentTimeMillis() provided by System in order to get the start and end instants of the procedure of reading and parsing the data contained in the file and then we’ll subtract them in order to calculate the number of milliseconds taken.
Here’s the code of the first solution:

import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.ArrayList;
import java.util.List;

class Interval {

    private long start;
    private long end;

    public Interval(long start, long end) {

        this.start = start;
        this.end = end;
    }
}

class StatsRecord {

    private long   end;
    private long   duration;
    private int rate;

    public StatsRecord(long end, long duration, int rate) {

        this.end = end;
        this.duration = duration;
        this.rate = rate;
    }
}

public class ReadInputTest {

    List<StatsRecord> stats   = new ArrayList<>();
    List<Interval>    queries = new ArrayList<>();

    public static void main(String[] args) {

        ReadInputTest streams = new ReadInputTest();

        long startTime = System.currentTimeMillis();

        int nLines = 0, nQueries = 0;
        int size = 0;
        String input = null;
        String[] tmp = null;
        try (BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream("C:\\Users\\MolinariD\\Desktop\\input2.txt")))) {

            input = br.readLine();
            nLines = Integer.parseInt(input);

            if ((nLines >= 1) && (nLines <= 100_000)) {

                for (int i = 0; i < nLines; i++) {
                    input = br.readLine();
                    tmp = input.split(" ");
                    StatsRecord sr = new StatsRecord(Long.parseLong(tmp[0]), Long.parseLong(tmp[1]), Integer.parseInt(tmp[2]));
                    streams.stats.add(sr);
                }

            }

            input = br.readLine();
            nQueries = Integer.parseInt(input);

            if ((nQueries >= 1) && (nQueries <= 100_000)) {

                tmp = null;
                for (int i = 0; i < nQueries; i++) {
                    input = br.readLine();
                    tmp = input.split(" ");
                    Interval in = new Interval(Long.parseLong(tmp[0]), Long.parseLong(tmp[1]));
                    streams.queries.add(in);
                }

            }

            System.out.println("Streams: " + streams.stats.size());
            System.out.println("Queries: " + streams.queries.size());

            long endTime = System.currentTimeMillis();
            System.out.println("Execution Time: " + ((endTime - startTime)) + " milliseconds");
        }
        catch (IOException | NumberFormatException e) {
            System.out.println("Error processing the file!");
        }
    }
}

Executing the program we get an output like this:

Streams: 100000
Queries: 100000
Execution Time: 159 milliseconds

First we printed the size of the two ArrayList used to hold objects, in order to check that all lines have actually been read and the values have been interpreted correctly. Then we printed out the time the program takes to read the entire file.
By running it for 5 times the results obtained were the following:

159 milliseconds
159 milliseconds
157 milliseconds
157 milliseconds
158 milliseconds

Let’s look now at the solution 2, the one based on using the Scanner class. Let’s open the Scanner, directly within the Java7 try-with-resources construct, with the instruction

Scanner sc = new Scanner (new File ("C:\\Users\\Desktop\\MolinariD\\input2.txt"))

and invoke the nextInt() method to acquire the integer in the first row. At this point, for each of the N following rows, we acquire the values directly using the next methods relating to the types that we want to read. For each line of the first type we saw that we expect two long and an int, so we read them invoking on the Scanner instance twice the nextLong() method and once the nextInt() method. We use these values directly in the call to the StatsRecord class constructor to instantiate objects. At the end of N rows we read the new integer value and repeat the acquisition of M rows. In this case, as we said earlier, we assume that the values we read are not subject to errors and therefore we always find the data type that we expect when we make an acquisition with the next methods of Scanner. The correct use, actually, would be to test that the token we are going to read are actually of that type, with methods hasNextInt() , hasNextLong() , etc..
In the case where a token we read was of a different type from that of the “next method” used, for example, it was an alphanumeric string and we’re reading it with nextInt() method, a InputMismatchException would be thrown.
Here we see the code of the solution 2, where again we insert instructions for printing out the number of objects created and the time taken.

import java.io.File;
import java.io.FileNotFoundException;
import java.util.ArrayList;
import java.util.InputMismatchException;
import java.util.List;
import java.util.Scanner;

class Interval {

	private long start;
	private long end;

	public Interval(long start, long end) {

		this.start = start;
		this.end = end;
	}
}

class StatsRecord {

	private long end;
	private long duration;
	private int rate;

	public StatsRecord(long end, long duration, int rate) {

		this.end = end;
		this.duration = duration;
		this.rate = rate;
	}
}

public class ReadInputTest {

	List<StatsRecord> stats = new ArrayList<>();
	List<Interval> queries = new ArrayList<>();

	public static void main(String[] args) {

		ReadInputTest streams = new ReadInputTest();

		long startTime = System.currentTimeMillis();

		int nLines = 0, nQueries = 0;
		int size = 0;
		String input = null;
		String[] tmp = null;

		try (Scanner sc = new Scanner(new File("C:\\Users\\MolinariD\\Desktop\\input2.txt"))){
			
			nLines = sc.nextInt();

			if ((nLines >= 1) && (nLines <= 100_000)) {

				for (int i = 0; i < nLines; i++) {
					StatsRecord sr = new StatsRecord(sc.nextLong(), sc.nextLong(), sc.nextInt());
					streams.stats.add(sr);
				}
			}

			nQueries = sc.nextInt();

			if ((nQueries >= 1) && (nQueries <= 100_000)) {

				tmp = null;
				for (int i = 0; i < nQueries; i++) {
					Interval in = new Interval(sc.nextLong(), sc.nextLong());
					streams.queries.add(in);
				}
			}

			System.out.println("Streams: " + streams.stats.size());
			System.out.println("Queries: " + streams.queries.size());

			long endTime = System.currentTimeMillis();
			System.out.println("Execution Time: " + ((endTime - startTime)) + " milliseconds");
			
		} 
		catch (InputMismatchException | FileNotFoundException e) {
                        System.out.println("Error processing the file!");
		}
	}
}

Executing the program we get the following result:

Streams: 100000
Queries: 100000
Execution Time: 801 milliseconds

Also this time the number of objects created is correct, which means that reading of all the rows is successful and there were no problems about the type of token read and type conversions.
The time taken by this solution with the Scanner is 801 milliseconds, compared with 159 milliseconds taken by the solution based on the BufferedReader.
By running even in this case the program 5 times, we obtain the following results:

801 milliseconds
788 milliseconds
789 milliseconds
800 milliseconds
794 milliseconds

From this test on reading a text file with 200,000 lines, consisting of 2 or 3 token each and with token of different datatypes, it appears that solution based on BufferedReader with strings split and type conversions performed “by hand” is significantly more efficient than the solution based on the Scanner with the automatic management of data parsing and conversion.

Classes and the 200k lines sample file used for these tests can be downloaded here:

Leave a Reply

Your email address will not be published. Required fields are marked *