Simple JSON Aggregator for Java
jaggr is on Bintray and Maven Central (Soon):
<dependency>
<groupId>com.caffinc</groupId>
<artifactId>jaggr</artifactId>
<version>0.5.0</version>
</dependency>
<dependency>
<groupId>com.caffinc</groupId>
<artifactId>jaggr-utils</artifactId>
<version>0.5.0</version>
</dependency>
Assume the following JSON documents are stored in a file called raw.json
:
{"_id": 1, "f": "a", "test": {"f": 3}}
{"_id": 2, "f": "a", "test": {"f": 2}}
{"_id": 3, "f": "a", "test": {"f": 1}}
{"_id": 4, "f": "a", "test": {"f": 5}}
{"_id": 5, "f": "a", "test": {"f": -1}}
{"_id": 6, "f": "b", "test": {"f": 1}}
{"_id": 7, "f": "b", "test": {"f": 1}}
{"_id": 8, "f": "b", "test": {"f": 1}}
{"_id": 9, "f": "b", "test": {"f": 1}}
{"_id": 10, "f": "b", "test": {"f": 1}}
Read it in using the JsonFileReader
in the jaggr-utils
module using:
List<Map<String, Object>> jsonList = JsonFileReader.readJsonFromFile("raw.json");
Now various aggregations can be defined using the AggregationBuilder
:
Aggregation aggregation = new AggregationBuilder()
.setGroupBy(field)
.addOperation("avg", new AverageOperation(avgField))
.addOperation("sum", new SumOperation(sumField))
.addOperation("min", new MinOperation(minField))
.addOperation("max", new MaxOperation(maxField))
.addOperation("count", new CountOperation())
.getAggregation();
Aggregation can now be performed using the aggregate()
method:
List<Map<String, Object>> result = aggregation.aggregate(jsonList);
Aggregation also supports Iterators:
List<Map<String, Object>> result = aggregation.aggregate(jsonList.iterator());
Aggregation actually works with any Iterable<Map<String, Object>>
too.
The result of the above aggregation would look as follows:
{"_id": "a", "avg": 2.0, "sum": 10, "min": -1, "max": 5, "count": 5}
{"_id": "b", "avg": 1.0, "sum": 5, "min": 1, "max": 1, "count": 5}
While aggregating files or Lists of JSON documents might be good for some use cases, not all data fits this paradigm.
There are three utilities in the jaggr-utils
library which can be used to aggregate other sources of data.
The JsonFileReader
class exposes the readJsonFromFile
and readJsonFromResource
methods which can be used to read in all the JSON objects from the file into memory for aggregation.
It is generally not a good idea to read in large files due to obvious reasons.
List<Map<String, Object>> jsonData = JsonFileReader.readJsonFromFile("afile.json");
List<Map<String, Object>> jsonData = JsonFileReader.readJsonFromResource("aFileInResources.json");
List<Map<String, Object>> result = aggregation.aggregate(iterator);
The JsonStringIterator
class provides constructors to iterate through a JSON file or a Reader
object pointing to an underlying JSON String source without loading all the data into memory.
Iterator<Map<String, Object>> iterator = new JsonStringIterator("afile.json");
Iterator<Map<String, Object>> iterator = new JsonStringIterator(new BufferedReader(new FileReader("afile.json")));
List<Map<String, Object>> result = aggregation.aggregate(iterator);
The JsonIterator
abstract class provides a way to convert an Iterator
from any type to JSON. This can be used to iterate through data coming from arbitrary databases. For example, MongoDB
provides Iterable
interfaces to the data. You could aggregate an entire collection as follows:
Iterator<Map<String, Object>> iterator = new JsonIterator<DBObject>(mongoCollection.find().iterator()) {
@Override
public Map<String, Object> toJson(DBObject element) {
return element.toMap();
}
};
List<Map<String, Object>> result = aggregation.aggregate(iterator);
Starting with version 0.4.0
, jaggr
supports aggregation of batches of data in a new class called BatchAggregation
. The following example shows BatchAggregation
in action:
Input Data:
{"_id": 1, "f": "a"}
{"_id": 2, "f": "a"}
{"_id": 3, "f": "a"}
{"_id": 4, "f": "a"}
{"_id": 5, "f": "a"}
{"_id": 6, "f": "b"}
{"_id": 7, "f": "b"}
{"_id": 8, "f": "b"}
{"_id": 9, "f": "b"}
{"_id": 10, "f": "b"}
Aggregation:
BatchAggregation aggregation = new AggregationBuilder()
.setGroupBy("f")
.addOperation("count", new CountOperation())
.getBatchAggregation();
aggregation.aggregateBatch(jsonData);
List<Map<String, Object>> result = aggregation.getFinalResult();
Result:
[
{"_id":"b","count":5},
{"_id":"a","count":5}
]
The aggregateBatch()
method can be called several times with more data. It can also be chained.
result = aggregation
.aggregateBatch(batch1)
.aggregateBatch(batch2)
.getFinalResult();
However the getFinalResult()
method must be called just once to get the final result of the aggregation, after which the BatchAggregation
object is reset. It can then be used to aggregate fresh batches of data.
jaggr
provides the following aggregations:
- Count
- Sum
- Minimum
- Maximum
- Average
- Collect as List
- Collect as Set
- First Object
- Last Object
- Standard Deviation (Population)
- Top N Objects
There are extensive tests for each of the aggregations which can be checked out in the https://github.com/caffinc/jaggr/blob/master/jaggr/jaggr/src/test file.
There are tests for the jaggr-utils module in https://github.com/caffinc/jaggr/blob/master/jaggr/jaggr-utils/src/test
These are not absolute, but are current (probably) as of 26th November, 2016. It should be trivial to upgrade or downgrade versions as required.
Both jaggr
and jaggr-utils
depend on junit
for tests:
<dependencies>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.12</version>
<scope>test</scope>
</dependency>
</dependencies>
jaggr
does not have any other external dependencies, but has a test dependency on jaggr-utils
.
jaggr-utils
has the following dependencies:
<dependencies>
<dependency>
<groupId>com.google.code.gson</groupId>
<artifactId>gson</artifactId>
<version>2.6.2</version>
</dependency>
</dependencies>
If you face any issues trying to get this to work for you, shoot me an email: [email protected].
Good luck!