Pilosa Dev Kit - implementation tooling and use case examples are here!
Documentation is here: https://www.pilosa.com/docs/pdk/
-
A running instance of Pilosa. See: https://www.pilosa.com/docs/getting-started/
-
A recent version of Go.
-
libpcap with development headers. On Ubuntu/Debian, you can install it using
sudo apt install libpcap-dev
.
We assume you are on a UNIX-like operating system. Otherwise adapt the following instructions for your platform.
go get github.com/pilosa/pdk
cd $GOPATH/src/github.com/pilosa/pdk
make install
To get started immediately, run this:
pdk taxi
This will create and fill an index called taxi
, using the short url list in usecase/taxi/urls-short.txt.
If you want to try out the full data set, run this:
pdk taxi -i taxi-big -f usecase/taxi/greenAndYellowUrls.txt
There are a number of other options you can tweak to affect the speed and memory usage of the import (or point it to a remote pilosa instance). Use pdk taxi -help
to see all the options.
Note that this url file represents 1+ billion columns of data - depending on your hardware this will probably take well over 3 hours, and consume quite a bit of memory (and CPU). You can make a file with fewer URLs if you just want to get a sample.
After importing, you can try a few example queries at https://github.com/alanbernstein/pilosa-notebooks/blob/master/taxi-use-case.ipynb .
To get started immediately, run this:
pdk net -i en0
which will capture traffic on the interface en0
(see available interfaces with ifconfig
).
The Star Schema Benchmark is a benchmark based on TPC-H but tweaked for a somewhat difference use case. It has been implemented by some big data projects such as https://hortonworks.com/blog/sub-second-analytics-hive-druid/ .
To execute the star schema benchmark with Pilosa, you must.
- Generate the SSB data at a particular scale factor.
- Import the data into Pilosa.
- Run the
demo-ssb
application for convenience which has all of the SSB queries pre-written.
Use https://github.com/electrum/ssb-dbgen.git to generate the raw SSB data. This can be a bit finicky to work with - hit up @tgruben for tips (or maybe he'll update this section 😉.
When generating the data, you have to select a particular scale factor - the size of the generated data will be about 600MB * SF(scale factor), so SF=100 will generate about 60GB of data.
Use pdk ssb
to import the data into Pilosa. You must specify the directory containing the .tbl
files generated in the first step as well as the location of your pilosa cluster. There are a few other options which you can tweak which may help import performance. See pdk ssb -h
for more information.
This repo https://github.com/pilosa/demo-ssb.git contains a small Go program which packages up the different queries which comprise the benchmark. Running demo-ssb starts a web server which executes queries against pilosa on your behalf. You can simply run (e.g.) curl localhost:8000/query/1.1
to run an SSB query.