Java Store and compare huge key value pair maps on disk

I have a scenario in which i have to compare 2 Maps. I have to take a keys of the source Map and iterate through the target Map for the key for equals and then compare their values. The problems is that these Maps are supposed to hold very high(>=10,000,000) volume of records. So i cannot keep these maps in-memory.

Possible solutions:

Storing both as text files as "key=value"

Problems:

We have to iterate through the target Map text file for every key value in the source which is ineffective and time consuming.

Possible solutions: Storing both as text files as "key=value" and creating an index of id=>line number for target

Problems:

No effective method to directly read a line based on the line number from a big text file. Some methods use api's for Java 1.8 and they again need the file to be loaded in-memory

Possible solutions:

Storing both in a database

Problems:

In this case, we have to query the databases for every single key value. If we have 1 million keys in source and target, We have to query 1 million times. Ineffective and time consuming

Possible solutions:

Using mapDB

Problems:

Tried this, but it failed after 260,000 records. It gave an Writer thread failed exception mainly because i was using a 32bit JVM. So i want to write the implementation myself rather than relying on MapDB.

How do i effectively store/retrieve and compare the key value maps so that it wont hit the performance much when i am doing the comparison. At any time i cannot bring any thing in-memory because it will give Out of Memory exception. The solution should read and writes to the disk, not in-memory - I don't have 12 GB of RAM. Also the solution should work for 32/64 bit systems

Jon Skeet
people
quotationmark

One reasonably simple option:

  • Put the maps on disk
  • Sort them (in any of a number of ways)
  • Open a reader for each map, and read the first line
  • Iterate over the lines of each map, deciding which map to read from based on a comparison of "current" keys.

That way you only need two key/value pairs in memory at any one time.

For example, suppose we have a "source" map with keys A, B, H, I and a "target" map with keys B, C, I, J. The process would look like this:

  • Read source=A, target=B
  • Compare: source before target, so read another line from source (B)
  • Compare: source and target keys are equal, so process the entry for B, and read another line from each (source=H, target=C)
  • Compare: target before source, so read another line from target (I)
  • Compare: source before target, so read another line from source (I)
  • Compare: source and target keys are equal, so process the entry for I, and read another line from each (source=empty, target=J)
  • As we've run out of data from source, we're done

people

See more on this question at Stackoverflow