Package com.google.code.externalsorting
Class ExternalSort
- java.lang.Object
-
- com.google.code.externalsorting.ExternalSort
-
public class ExternalSort extends java.lang.Object
Goal: offer a generic external-memory sorting program in Java. It must be : - hackable (easy to adapt) - scalable to large files - sensibly efficient. This software is in the public domain. Usage: java com/google/code/externalsorting/ExternalSort somefile.txt out.txt You can change the default maximal number of temporary files with the -t flag: java com/google/code/externalsorting/ExternalSort somefile.txt out.txt -t 3 For very large files, you might want to use an appropriate flag to allocate more memory to the Java VM: java -Xms2G com/google/code/externalsorting/ExternalSort somefile.txt out.txt By (in alphabetical order) Philippe Beaudoin, Eleftherios Chetzakis, Jon Elsas, Christan Grant, Daniel Haran, Daniel Lemire, Sugumaran Harikrishnan, Amit Jain, Thomas Mueller, Jerry Yang, First published: April 2010 originally posted at http://lemire.me/blog/archives/2010/04/01/external-memory-sorting-in-java/
-
-
Field Summary
Fields Modifier and Type Field Description static java.util.Comparator<java.lang.String>
defaultcomparator
default comparator between strings.static int
DEFAULTMAXTEMPFILES
Default maximal number of temporary files allowed.
-
Constructor Summary
Constructors Constructor Description ExternalSort()
-
Method Summary
All Methods Static Methods Concrete Methods Modifier and Type Method Description static long
estimateAvailableMemory()
This method calls the garbage collector and then returns the free memory.static long
estimateBestSizeOfBlocks(long sizeoffile, int maxtmpfiles, long maxMemory)
we divide the file into small blocks.static void
main(java.lang.String[] args)
static long
mergeSortedFiles(java.io.BufferedWriter fbw, java.util.Comparator<java.lang.String> cmp, boolean distinct, java.util.List<com.google.code.externalsorting.BinaryFileBuffer> buffers)
This merges several BinaryFileBuffer to an output writer.static long
mergeSortedFiles(java.util.List<java.io.File> files, java.io.File outputfile)
This merges a bunch of temporary flat filesstatic long
mergeSortedFiles(java.util.List<java.io.File> files, java.io.File outputfile, java.util.Comparator<java.lang.String> cmp)
This merges a bunch of temporary flat filesstatic long
mergeSortedFiles(java.util.List<java.io.File> files, java.io.File outputfile, java.util.Comparator<java.lang.String> cmp, boolean distinct)
This merges a bunch of temporary flat filesstatic long
mergeSortedFiles(java.util.List<java.io.File> files, java.io.File outputfile, java.util.Comparator<java.lang.String> cmp, java.nio.charset.Charset cs)
This merges a bunch of temporary flat filesstatic long
mergeSortedFiles(java.util.List<java.io.File> files, java.io.File outputfile, java.util.Comparator<java.lang.String> cmp, java.nio.charset.Charset cs, boolean distinct)
This merges a bunch of temporary flat filesstatic long
mergeSortedFiles(java.util.List<java.io.File> files, java.io.File outputfile, java.util.Comparator<java.lang.String> cmp, java.nio.charset.Charset cs, boolean distinct, boolean append, boolean usegzip)
This merges a bunch of temporary flat filesstatic void
sort(java.io.File input, java.io.File output)
This sorts a file (input) to an output file (output) using default parametersstatic void
sort(java.io.File input, java.io.File output, java.util.Comparator<java.lang.String> cmp)
This sorts a file (input) to an output file (output) using customized comparatorstatic java.io.File
sortAndSave(java.util.List<java.lang.String> tmplist, java.util.Comparator<java.lang.String> cmp, java.nio.charset.Charset cs, java.io.File tmpdirectory)
Sort a list and save it to a temporary filestatic java.io.File
sortAndSave(java.util.List<java.lang.String> tmplist, java.util.Comparator<java.lang.String> cmp, java.nio.charset.Charset cs, java.io.File tmpdirectory, boolean distinct, boolean usegzip, boolean parallel)
Sort a list and save it to a temporary filestatic java.util.List<java.io.File>
sortInBatch(java.io.BufferedReader fbr, long datalength)
This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later.static java.util.List<java.io.File>
sortInBatch(java.io.BufferedReader fbr, long datalength, java.util.Comparator<java.lang.String> cmp, boolean distinct)
This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later.static java.util.List<java.io.File>
sortInBatch(java.io.BufferedReader fbr, long datalength, java.util.Comparator<java.lang.String> cmp, int maxtmpfiles, long maxMemory, java.nio.charset.Charset cs, java.io.File tmpdirectory, boolean distinct, int numHeader, boolean usegzip, boolean parallel)
This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later.static java.util.List<java.io.File>
sortInBatch(java.io.File file)
This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later.static java.util.List<java.io.File>
sortInBatch(java.io.File file, java.util.Comparator<java.lang.String> cmp)
This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later.static java.util.List<java.io.File>
sortInBatch(java.io.File file, java.util.Comparator<java.lang.String> cmp, boolean distinct)
This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later.static java.util.List<java.io.File>
sortInBatch(java.io.File file, java.util.Comparator<java.lang.String> cmp, int maxtmpfiles, java.nio.charset.Charset cs, java.io.File tmpdirectory, boolean distinct)
This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later.static java.util.List<java.io.File>
sortInBatch(java.io.File file, java.util.Comparator<java.lang.String> cmp, int maxtmpfiles, java.nio.charset.Charset cs, java.io.File tmpdirectory, boolean distinct, int numHeader)
This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later.static java.util.List<java.io.File>
sortInBatch(java.io.File file, java.util.Comparator<java.lang.String> cmp, int maxtmpfiles, java.nio.charset.Charset cs, java.io.File tmpdirectory, boolean distinct, int numHeader, boolean usegzip)
This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later.static java.util.List<java.io.File>
sortInBatch(java.io.File file, java.util.Comparator<java.lang.String> cmp, int maxtmpfiles, java.nio.charset.Charset cs, java.io.File tmpdirectory, boolean distinct, int numHeader, boolean usegzip, boolean parallel)
This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later.static java.util.List<java.io.File>
sortInBatch(java.io.File file, java.util.Comparator<java.lang.String> cmp, java.io.File tmpdirectory, boolean distinct, int numHeader)
This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later.static java.util.List<java.io.File>
sortInBatch(java.io.File file, java.util.Comparator<java.lang.String> cmp, java.nio.charset.Charset cs, java.io.File tmpdirectory, boolean distinct, int numHeader)
This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later.
-
-
-
Field Detail
-
defaultcomparator
public static java.util.Comparator<java.lang.String> defaultcomparator
default comparator between strings.
-
DEFAULTMAXTEMPFILES
public static final int DEFAULTMAXTEMPFILES
Default maximal number of temporary files allowed.- See Also:
- Constant Field Values
-
-
Method Detail
-
estimateAvailableMemory
public static long estimateAvailableMemory()
This method calls the garbage collector and then returns the free memory. This avoids problems with applications where the GC hasn't reclaimed memory and reports no available memory.- Returns:
- available memory
-
estimateBestSizeOfBlocks
public static long estimateBestSizeOfBlocks(long sizeoffile, int maxtmpfiles, long maxMemory)
we divide the file into small blocks. If the blocks are too small, we shall create too many temporary files. If they are too big, we shall be using too much memory.- Parameters:
sizeoffile
- how much data (in bytes) can we expectmaxtmpfiles
- how many temporary files can we create (e.g., 1024)maxMemory
- Maximum memory to use (in bytes)- Returns:
- the estimate
-
main
public static void main(java.lang.String[] args) throws java.io.IOException
- Parameters:
args
- command line argument- Throws:
java.io.IOException
- generic IO exception
-
mergeSortedFiles
public static long mergeSortedFiles(java.io.BufferedWriter fbw, java.util.Comparator<java.lang.String> cmp, boolean distinct, java.util.List<com.google.code.externalsorting.BinaryFileBuffer> buffers) throws java.io.IOException
This merges several BinaryFileBuffer to an output writer.- Parameters:
fbw
- A buffer where we write the data.cmp
- A comparator object that tells us how to sort the lines.distinct
- Passtrue
if duplicate lines should be discarded.buffers
- Where the data should be read.- Returns:
- The number of lines sorted.
- Throws:
java.io.IOException
- generic IO exception
-
mergeSortedFiles
public static long mergeSortedFiles(java.util.List<java.io.File> files, java.io.File outputfile) throws java.io.IOException
This merges a bunch of temporary flat files- Parameters:
files
- TheList
of sortedFile
s to be merged.outputfile
- The outputFile
to merge the results to.- Returns:
- The number of lines sorted.
- Throws:
java.io.IOException
- generic IO exception
-
mergeSortedFiles
public static long mergeSortedFiles(java.util.List<java.io.File> files, java.io.File outputfile, java.util.Comparator<java.lang.String> cmp) throws java.io.IOException
This merges a bunch of temporary flat files- Parameters:
files
- TheList
of sortedFile
s to be merged.outputfile
- The outputFile
to merge the results to.cmp
- TheComparator
to use to compareString
s.- Returns:
- The number of lines sorted.
- Throws:
java.io.IOException
- generic IO exception
-
mergeSortedFiles
public static long mergeSortedFiles(java.util.List<java.io.File> files, java.io.File outputfile, java.util.Comparator<java.lang.String> cmp, boolean distinct) throws java.io.IOException
This merges a bunch of temporary flat files- Parameters:
files
- TheList
of sortedFile
s to be merged.outputfile
- The outputFile
to merge the results to.cmp
- TheComparator
to use to compareString
s.distinct
- Passtrue
if duplicate lines should be discarded.- Returns:
- The number of lines sorted.
- Throws:
java.io.IOException
- generic IO exception
-
mergeSortedFiles
public static long mergeSortedFiles(java.util.List<java.io.File> files, java.io.File outputfile, java.util.Comparator<java.lang.String> cmp, java.nio.charset.Charset cs) throws java.io.IOException
This merges a bunch of temporary flat files- Parameters:
files
- TheList
of sortedFile
s to be merged.outputfile
- The outputFile
to merge the results to.cmp
- TheComparator
to use to compareString
s.cs
- TheCharset
to be used for the byte to character conversion.- Returns:
- The number of lines sorted.
- Throws:
java.io.IOException
- generic IO exception
-
mergeSortedFiles
public static long mergeSortedFiles(java.util.List<java.io.File> files, java.io.File outputfile, java.util.Comparator<java.lang.String> cmp, java.nio.charset.Charset cs, boolean distinct) throws java.io.IOException
This merges a bunch of temporary flat files- Parameters:
files
- TheList
of sortedFile
s to be merged.distinct
- Passtrue
if duplicate lines should be discarded.outputfile
- The outputFile
to merge the results to.cmp
- TheComparator
to use to compareString
s.cs
- TheCharset
to be used for the byte to character conversion.- Returns:
- The number of lines sorted.
- Throws:
java.io.IOException
- generic IO exception- Since:
- v0.1.2
-
mergeSortedFiles
public static long mergeSortedFiles(java.util.List<java.io.File> files, java.io.File outputfile, java.util.Comparator<java.lang.String> cmp, java.nio.charset.Charset cs, boolean distinct, boolean append, boolean usegzip) throws java.io.IOException
This merges a bunch of temporary flat files- Parameters:
files
- TheList
of sortedFile
s to be merged.distinct
- Passtrue
if duplicate lines should be discarded.outputfile
- The outputFile
to merge the results to.cmp
- TheComparator
to use to compareString
s.cs
- TheCharset
to be used for the byte to character conversion.append
- Passtrue
if result should append toFile
instead of overwrite. Default to be false for overloading methods.usegzip
- assumes we used gzip compression for temporary files- Returns:
- The number of lines sorted.
- Throws:
java.io.IOException
- generic IO exception- Since:
- v0.1.4
-
sort
public static void sort(java.io.File input, java.io.File output) throws java.io.IOException
This sorts a file (input) to an output file (output) using default parameters- Parameters:
input
- source fileoutput
- output file- Throws:
java.io.IOException
- generic IO exception
-
sort
public static void sort(java.io.File input, java.io.File output, java.util.Comparator<java.lang.String> cmp) throws java.io.IOException
This sorts a file (input) to an output file (output) using customized comparator- Parameters:
input
- source fileoutput
- output filecmp
- TheComparator
to use to compareString
s.- Throws:
java.io.IOException
- generic IO exception
-
sortAndSave
public static java.io.File sortAndSave(java.util.List<java.lang.String> tmplist, java.util.Comparator<java.lang.String> cmp, java.nio.charset.Charset cs, java.io.File tmpdirectory) throws java.io.IOException
Sort a list and save it to a temporary file- Parameters:
tmplist
- data to be sortedcmp
- string comparatorcs
- charset to use for output (can use Charset.defaultCharset())tmpdirectory
- location of the temporary files (set to null for default location)- Returns:
- the file containing the sorted data
- Throws:
java.io.IOException
- generic IO exception
-
sortAndSave
public static java.io.File sortAndSave(java.util.List<java.lang.String> tmplist, java.util.Comparator<java.lang.String> cmp, java.nio.charset.Charset cs, java.io.File tmpdirectory, boolean distinct, boolean usegzip, boolean parallel) throws java.io.IOException
Sort a list and save it to a temporary file- Parameters:
tmplist
- data to be sortedcmp
- string comparatorcs
- charset to use for output (can use Charset.defaultCharset())tmpdirectory
- location of the temporary files (set to null for default location)distinct
- Passtrue
if duplicate lines should be discarded.usegzip
- set totrue
if you are using gzip compression for the temporary filesparallel
- set totrue
when sorting in parallel- Returns:
- the file containing the sorted data
- Throws:
java.io.IOException
- generic IO exception
-
sortInBatch
public static java.util.List<java.io.File> sortInBatch(java.io.BufferedReader fbr, long datalength) throws java.io.IOException
This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later.- Parameters:
fbr
- data sourcedatalength
- estimated data volume (in bytes)- Returns:
- a list of temporary flat files
- Throws:
java.io.IOException
- generic IO exception
-
sortInBatch
public static java.util.List<java.io.File> sortInBatch(java.io.BufferedReader fbr, long datalength, java.util.Comparator<java.lang.String> cmp, boolean distinct) throws java.io.IOException
This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later.- Parameters:
fbr
- data sourcedatalength
- estimated data volume (in bytes)cmp
- string comparatordistinct
- Passtrue
if duplicate lines should be discarded.- Returns:
- a list of temporary flat files
- Throws:
java.io.IOException
- generic IO exception
-
sortInBatch
public static java.util.List<java.io.File> sortInBatch(java.io.BufferedReader fbr, long datalength, java.util.Comparator<java.lang.String> cmp, int maxtmpfiles, long maxMemory, java.nio.charset.Charset cs, java.io.File tmpdirectory, boolean distinct, int numHeader, boolean usegzip, boolean parallel) throws java.io.IOException
This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later.- Parameters:
fbr
- data sourcedatalength
- estimated data volume (in bytes)cmp
- string comparatormaxtmpfiles
- maximal number of temporary filesmaxMemory
- maximum amount of memory to use (in bytes)cs
- character set to use (can use Charset.defaultCharset())tmpdirectory
- location of the temporary files (set to null for default location)distinct
- Passtrue
if duplicate lines should be discarded.numHeader
- number of lines to preclude before sorting startsusegzip
- use gzip compression for the temporary filesparallel
- sort in parallel- Returns:
- a list of temporary flat files
- Throws:
java.io.IOException
- generic IO exception
-
sortInBatch
public static java.util.List<java.io.File> sortInBatch(java.io.File file) throws java.io.IOException
This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later.- Parameters:
file
- some flat file- Returns:
- a list of temporary flat files
- Throws:
java.io.IOException
- generic IO exception
-
sortInBatch
public static java.util.List<java.io.File> sortInBatch(java.io.File file, java.util.Comparator<java.lang.String> cmp) throws java.io.IOException
This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later.- Parameters:
file
- some flat filecmp
- string comparator- Returns:
- a list of temporary flat files
- Throws:
java.io.IOException
- generic IO exception
-
sortInBatch
public static java.util.List<java.io.File> sortInBatch(java.io.File file, java.util.Comparator<java.lang.String> cmp, boolean distinct) throws java.io.IOException
This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later.- Parameters:
file
- some flat filecmp
- string comparatordistinct
- Passtrue
if duplicate lines should be discarded.- Returns:
- a list of temporary flat files
- Throws:
java.io.IOException
- generic IO exception
-
sortInBatch
public static java.util.List<java.io.File> sortInBatch(java.io.File file, java.util.Comparator<java.lang.String> cmp, java.io.File tmpdirectory, boolean distinct, int numHeader) throws java.io.IOException
This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later. You can specify a bound on the number of temporary files that will be created.- Parameters:
file
- some flat filecmp
- string comparatortmpdirectory
- location of the temporary files (set to null for default location)distinct
- Passtrue
if duplicate lines should be discarded.numHeader
- number of lines to preclude before sorting starts- Returns:
- a list of temporary flat files
- Throws:
java.io.IOException
- generic IO exception
-
sortInBatch
public static java.util.List<java.io.File> sortInBatch(java.io.File file, java.util.Comparator<java.lang.String> cmp, int maxtmpfiles, java.nio.charset.Charset cs, java.io.File tmpdirectory, boolean distinct) throws java.io.IOException
This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later. You can specify a bound on the number of temporary files that will be created.- Parameters:
file
- some flat filecmp
- string comparatormaxtmpfiles
- maximal number of temporary filescs
- character set to use (can use Charset.defaultCharset())tmpdirectory
- location of the temporary files (set to null for default location)distinct
- Passtrue
if duplicate lines should be discarded.- Returns:
- a list of temporary flat files
- Throws:
java.io.IOException
- generic IO exception
-
sortInBatch
public static java.util.List<java.io.File> sortInBatch(java.io.File file, java.util.Comparator<java.lang.String> cmp, java.nio.charset.Charset cs, java.io.File tmpdirectory, boolean distinct, int numHeader) throws java.io.IOException
This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later. You can specify a bound on the number of temporary files that will be created.- Parameters:
file
- some flat filecmp
- string comparatorcs
- character set to use (can use Charset.defaultCharset())tmpdirectory
- location of the temporary files (set to null for default location)distinct
- Passtrue
if duplicate lines should be discarded.numHeader
- number of lines to preclude before sorting starts- Returns:
- a list of temporary flat files
- Throws:
java.io.IOException
- generic IO exception
-
sortInBatch
public static java.util.List<java.io.File> sortInBatch(java.io.File file, java.util.Comparator<java.lang.String> cmp, int maxtmpfiles, java.nio.charset.Charset cs, java.io.File tmpdirectory, boolean distinct, int numHeader) throws java.io.IOException
This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later. You can specify a bound on the number of temporary files that will be created.- Parameters:
file
- some flat filecmp
- string comparatormaxtmpfiles
- maximal number of temporary filescs
- character set to use (can use Charset.defaultCharset())tmpdirectory
- location of the temporary files (set to null for default location)distinct
- Passtrue
if duplicate lines should be discarded.numHeader
- number of lines to preclude before sorting starts- Returns:
- a list of temporary flat files
- Throws:
java.io.IOException
- generic IO exception
-
sortInBatch
public static java.util.List<java.io.File> sortInBatch(java.io.File file, java.util.Comparator<java.lang.String> cmp, int maxtmpfiles, java.nio.charset.Charset cs, java.io.File tmpdirectory, boolean distinct, int numHeader, boolean usegzip) throws java.io.IOException
This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later. You can specify a bound on the number of temporary files that will be created.- Parameters:
file
- some flat filecmp
- string comparatormaxtmpfiles
- maximal number of temporary filescs
- character set to use (can use Charset.defaultCharset())tmpdirectory
- location of the temporary files (set to null for default location)distinct
- Passtrue
if duplicate lines should be discarded.numHeader
- number of lines to preclude before sorting startsusegzip
- use gzip compression for the temporary files- Returns:
- a list of temporary flat files
- Throws:
java.io.IOException
- generic IO exception
-
sortInBatch
public static java.util.List<java.io.File> sortInBatch(java.io.File file, java.util.Comparator<java.lang.String> cmp, int maxtmpfiles, java.nio.charset.Charset cs, java.io.File tmpdirectory, boolean distinct, int numHeader, boolean usegzip, boolean parallel) throws java.io.IOException
This will simply load the file by blocks of lines, then sort them in-memory, and write the result to temporary files that have to be merged later. You can specify a bound on the number of temporary files that will be created.- Parameters:
file
- some flat filecmp
- string comparatormaxtmpfiles
- maximal number of temporary filescs
- character set to use (can use Charset.defaultCharset())tmpdirectory
- location of the temporary files (set to null for default location)distinct
- Passtrue
if duplicate lines should be discarded.numHeader
- number of lines to preclude before sorting startsusegzip
- use gzip compression for the temporary filesparallel
- whether to sort in parallel- Returns:
- a list of temporary flat files
- Throws:
java.io.IOException
- generic IO exception
-
-