This repository implements krowkee
, a toolkit for scalably and efficiently
summarizing many data streams in distributed memory.
krowkee
is intended for applications involving where one needs to summarize
huge loosely structured data, such as matrices or graphs, where individual
components such as rows/columns or vertex adjacency information are impractical
to store and directly inspect.
krowkee
ingests these objects as data streams - unstructured, arbitrarily
ordered lists of updates - and accumulates summaries thereof in the form of data
sketches.
Although there are many types of data sketches, the practical varieties encompassed by krowkee have many advantages compared to directly observing data:
- approximation guarantees on some stream statistic
- merge operator
- worst-case logarithmic memory usage
- ... and many even support sparse storage
krowkee
currently supports only
sparse subspace embeddings
to perform randomized and fast dimensionality reduction.
Future sketch support is planned, principally including cardinality sketches
such as the
HyperLogLog.
- C++17 - GCC verions 8-11 are tested. Your mileage may vary with other compilers.
- Optional dependencies:
- YGM 0.3 or greater for distributed memory
communication.
Toggle with CMake option
KROWKEE_USE_YGM
. DefaultON
. If packageygm
is not installed,krowkee
will attempt to clone and install via FetchContent. Includes additional dependencies:- Cereal - C++ serialization
library.
If package
cereal
is not installed,krowkee
will attempt FetchContent. - MPI
- Cereal - C++ serialization
library.
If package
- Boost 1.48 or greater for
boost::container::flatmap
. Toggle with CMake optionKROWKEE_USE_BOOST
. DefaultON
.
- YGM 0.3 or greater for distributed memory
communication.
Toggle with CMake option
krowkee
is a header-only library that is simple to incorporate into dependent
projects using CMake.
Add the following to your CMakeLists.txt
to "find-else-fetch" krowkee
,
cloning it and its dependencies and preparing their headers for installation as
a part of your project.
set(DESIRED_KROWKEE_VERSION 0.1)
find_package(krowkee ${DESIRED_KROWKEE_VERSION} CONFIG)
if (NOT krowkee_FOUND)
FetchContent_Declare(
krowkee
GIT_REPOSITORY https://github.com/LLNL/krowkee
GIT_TAG v${DESIRED_KROWKEE_VERSION}
)
FetchContent_GetProperties(krowkee)
if (krowkee_POPULATED)
message(STATUS "Found already populated krowkee dependency: "
${krowkee_SOURCE_DIR}
)
else ()
set(JUST_INSTALL_KROWKEE ON)
set(KROWKEE_INSTALL ON)
set(KROWKEE_USE_YGM ON) # or OFF if local-only
FetchContent_Populate(krowkee)
add_subdirectory(${krowkee_SOURCE_DIR} ${krowkee_BINARY_DIR})
message(STATUS "Cloned krowkee dependency " ${krowkee_SOURCE_DIR})
endif ()
else ()
message(STATUS "Found installed krowkee dependency " ${krowkee_DIR})
endif ()
These instructions assume that you have a relatively modern C++ compiler
(C++17 required, only tested using GCC).
If included, krowkee
's CMake build makes use of find-else-fetch semantics for
its ygm
and cereal
dependencies.
krowkee
will try to find local installations of the libraries, and will clone
and link the repositories internally if none are found.
One can build a local-only version of krowkee
by passing
to cmake .. -DKROWKEE_USE_YGM=OFF
.
spack
is a convenient means to include cereal
and manage compilers, but
is not required to build krowkee
.
Clone the project and make the build directory
$ git clone ssh://[email protected]:7999/krowkee/krowkee.git
$ mkdir krowkee/build
$ cd krowkee/build
Option 1: use spack
$ spack load gcc # tested at >=8.3.1.
$ spack load boost # optional, at least version 1.75
$ spack load cereal # optional, at least version 1.3.0
Option 2: use module
$ module load gcc/8.3.1 # or desired gcc version
Build krowkee
.
$ cmake ..
$ make
Alternately, we can build the local-only version of krowkee
which will not
use ygm
, cereal
, or MPI.
$ cmake .. -DKROWKEE_USE_YGM=OFF
$ make
krowkee
build system supports local installation of it and all dependent
header-only libraries (ygm
and cereal
) via make install
.
If operating on a system without write permissions to CMake's default
installtion locations, set the CMAKE_INSTALL PREFIX
:
$ export CMAKE_USER_INSTALL_PREFIX='/desired/install/path'
$ cmake .. -DCMAKE_INSTALL_PREFIX=${CMAKE_USER_INSTALL_PREFIX}
$ make
$ make install
It is most likely worth setting this CMAKE_USER_INSTALL_PREFIX
and adding it
to your CMAKE_PREFIX_PATH
in your .profile
, .bashrc
, or equivalent.
It is easy to run all test cases once krowkee is built by running
make test
Alternately, one can directly run individual test cases with more options and verbose outputs, e.g.
$ ./test/SEQ_local_linearsketch_test
All tests support an -h
flag listing options.
- Benjamin W. Priest (priest2 at llnl dot gov)
- Alec Dunton (dunton1 at llnl dot gov)
krowkee
is distributed under the MIT license.
All new contributions must be made under the MIT license.
See LICENSE-MIT, NOTICE, and COPYRIGHT for details.
SPDX-License-Identifier: MIT
LLNL-CODE-827987