-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Persist node and edge indexes in partition header #613
base: master
Are you sure you want to change the base?
Changes from 6 commits
44d14d0
28bd5ec
c848444
a6fcbe6
733d8d0
d62edf7
a85f245
ff38a7e
679d473
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -219,6 +219,8 @@ katana::PropertyGraph::Make( | |
katana::GraphTopology topo = | ||
KATANA_CHECKED(MapTopology(rdg.topology_file_storage())); | ||
|
||
std::unique_ptr<katana::PropertyGraph> property_graph; | ||
|
||
if (rdg.IsEntityTypeIDsOutsideProperties()) { | ||
KATANA_LOG_DEBUG("loading EntityType data from outside properties"); | ||
|
||
|
@@ -236,24 +238,26 @@ katana::PropertyGraph::Make( | |
EntityTypeManager edge_type_manager = | ||
KATANA_CHECKED(rdg.edge_entity_type_manager()); | ||
|
||
return std::make_unique<PropertyGraph>( | ||
property_graph = std::make_unique<PropertyGraph>( | ||
std::move(rdg_file), std::move(rdg), std::move(topo), | ||
std::move(node_type_ids), std::move(edge_type_ids), | ||
std::move(node_type_manager), std::move(edge_type_manager)); | ||
|
||
} else { | ||
// we must construct id_arrays and managers from properties | ||
|
||
auto pg = std::make_unique<PropertyGraph>( | ||
property_graph = std::make_unique<PropertyGraph>( | ||
std::move(rdg_file), std::move(rdg), std::move(topo), | ||
MakeDefaultEntityTypeIDArray(topo.num_nodes()), | ||
MakeDefaultEntityTypeIDArray(topo.num_edges()), EntityTypeManager{}, | ||
EntityTypeManager{}); | ||
|
||
KATANA_CHECKED(pg->ConstructEntityTypeIDs()); | ||
|
||
return MakeResult(std::move(pg)); | ||
KATANA_CHECKED(property_graph->ConstructEntityTypeIDs()); | ||
} | ||
|
||
KATANA_CHECKED(property_graph->RecreatePropertyIndexes()); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: Do we want to put this behind a config option - might make it easier to debug/test. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That sounds like a great idea - what do we use for that kind of config? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thats a great question - currently we have at least a couple of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The two main option structures are I did intend to ask @ddn0 if there is a reason to avoid a per-thread ambient options structure, like we have for the ProgressTracer object. I think Galois used to have this for stats, but something about it was bad. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I actually prefer explicitly passing things around over global variables (or thread-local variables). (old) Galois stats are bad because it predates and does not interop with tracing (and future metrics). I think tracing, logging and metrics is a little different than configuration for loading files. In the former, the policy and behavior is supposed to orthogonal to the code being instrumented. |
||
|
||
return MakeResult(std::move(property_graph)); | ||
} | ||
|
||
katana::Result<std::unique_ptr<katana::PropertyGraph>> | ||
|
@@ -468,6 +472,19 @@ katana::PropertyGraph::DoWrite( | |
? KATANA_CHECKED(WriteEntityTypeIDsArray(edge_entity_type_ids_)) | ||
: nullptr; | ||
|
||
// Update lists of node and edge index columns. | ||
std::vector<std::string> node_index_columns(node_indexes_.size()); | ||
std::transform( | ||
node_indexes_.begin(), node_indexes_.end(), node_index_columns.begin(), | ||
[](const auto& index) { return index->column_name(); }); | ||
rdg_.set_node_property_index_columns(std::move(node_index_columns)); | ||
|
||
std::vector<std::string> edge_index_columns(edge_indexes_.size()); | ||
std::transform( | ||
edge_indexes_.begin(), edge_indexes_.end(), edge_index_columns.begin(), | ||
[](const auto& index) { return index->column_name(); }); | ||
rdg_.set_edge_property_index_columns(std::move(edge_index_columns)); | ||
|
||
return rdg_.Store( | ||
handle, command_line, versioning_action, std::move(topology_res), | ||
std::move(node_entity_type_id_array_res), | ||
|
@@ -1289,3 +1306,16 @@ katana::PropertyGraph::GetNodePropertyIndex( | |
} | ||
return KATANA_ERROR(katana::ErrorCode::NotFound, "node index not found"); | ||
} | ||
|
||
katana::Result<void> | ||
katana::PropertyGraph::RecreatePropertyIndexes() { | ||
for (const std::string& column_name : rdg_.node_property_index_columns()) { | ||
KATANA_CHECKED(MakeNodeIndex(column_name)); | ||
} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are we creating property indices for all properties, or only properties loaded into memory? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Currently a request to make an index only looks at the PropertyGraph, so will fail unless the property is loaded into memory. I had intended to defer any questions of in/out-of-core properties, however it would be simple here to only try to make the index if the column is loaded (otherwise this will fail the entire graph load) - I'll make that change. There will still exist the bugaboo which sounds like something you were describing a while ago - we load indexes only for in-core properties, then only persist those, losing indexes for out-of-core properties. That I would like to defer for this change - having a basic index flow in place, as well as me understanding more, will help do that correctly in the future. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, the in-core/out-of-core logic does get tricky. I like your plan--only make the index if the column is loaded. For the grand and glorious future, you can look at |
||
|
||
for (const std::string& column_name : rdg_.edge_property_index_columns()) { | ||
KATANA_CHECKED(MakeEdgeIndex(column_name)); | ||
} | ||
|
||
return katana::ResultSuccess(); | ||
} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we want to check the property type to ensure that we create indexes for types supported. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think that MakeNodeIndex should do the right thing and return an error for an unsupported type (if not it's a bug). The larger question is what to do when that happens, or some other error in index-building. Should we try to create all of the indexes possible and proceed, or treat any failure as a failure in Make? At this point I think I'd prefer the former since a problem will manifest as a clear failure to load the graph, rather than just future slowness. Thoughts? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I encountered some issues when i was testing the code. For instance, a column of type There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Crash or return an error? If it's crashing then that's bad, and I'd like to fix that. But returning an error here seems like the right thing to do when asking to index a column for an as-yet-unsupported data type, or a column that does not exist. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, we can leave it as return an error for now, given that we should be checking for validity when the user creates an index (or we do). |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,6 +1,7 @@ | ||
#include <arrow/api.h> | ||
#include <arrow/type.h> | ||
#include <arrow/type_traits.h> | ||
#include <boost/filesystem.hpp> | ||
|
||
#include "TestTypedPropertyGraph.h" | ||
#include "katana/Logging.h" | ||
|
@@ -11,8 +12,12 @@ template <typename node_or_edge> | |
struct NodeOrEdge { | ||
static katana::Result<katana::PropertyIndex<node_or_edge>*> MakeIndex( | ||
katana::PropertyGraph* pg, const std::string& column_name); | ||
static katana::Result<katana::PropertyIndex<node_or_edge>*> GetIndex( | ||
katana::PropertyGraph* pg, const std::string& column_name); | ||
static katana::Result<void> AddProperties( | ||
katana::PropertyGraph* pg, std::shared_ptr<arrow::Table> properties); | ||
static std::shared_ptr<arrow::Array> GetProperty( | ||
katana::PropertyGraph* pg, const std::string& column_name); | ||
static size_t num_entities(katana::PropertyGraph* pg); | ||
}; | ||
|
||
|
@@ -21,12 +26,7 @@ using Edge = NodeOrEdge<katana::GraphTopology::Edge>; | |
|
||
template <> | ||
katana::Result<katana::PropertyIndex<katana::GraphTopology::Node>*> | ||
Comment on lines
27
to
28
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You need these empty template declarations to make the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think these are required for template specializations like these, unless I'm mistaken. I think it would be hard to do this with normal overloading, especially since it's.just the return type. I'm totally open to something else, this was just the limit of my imagination. |
||
Node::MakeIndex(katana::PropertyGraph* pg, const std::string& column_name) { | ||
auto result = pg->MakeNodeIndex(column_name); | ||
if (!result) { | ||
return result.error(); | ||
} | ||
|
||
Node::GetIndex(katana::PropertyGraph* pg, const std::string& column_name) { | ||
for (const auto& index : pg->node_indexes()) { | ||
if (index->column_name() == column_name) { | ||
return index.get(); | ||
|
@@ -37,13 +37,15 @@ Node::MakeIndex(katana::PropertyGraph* pg, const std::string& column_name) { | |
} | ||
|
||
template <> | ||
katana::Result<katana::PropertyIndex<katana::GraphTopology::Edge>*> | ||
Edge::MakeIndex(katana::PropertyGraph* pg, const std::string& column_name) { | ||
auto result = pg->MakeEdgeIndex(column_name); | ||
if (!result) { | ||
return result.error(); | ||
} | ||
katana::Result<katana::PropertyIndex<katana::GraphTopology::Node>*> | ||
Node::MakeIndex(katana::PropertyGraph* pg, const std::string& column_name) { | ||
KATANA_CHECKED(pg->MakeNodeIndex(column_name)); | ||
return Node::GetIndex(pg, column_name); | ||
} | ||
|
||
template <> | ||
katana::Result<katana::PropertyIndex<katana::GraphTopology::Edge>*> | ||
Edge::GetIndex(katana::PropertyGraph* pg, const std::string& column_name) { | ||
for (const auto& index : pg->edge_indexes()) { | ||
if (index->column_name() == column_name) { | ||
return index.get(); | ||
|
@@ -53,6 +55,13 @@ Edge::MakeIndex(katana::PropertyGraph* pg, const std::string& column_name) { | |
return KATANA_ERROR(katana::ErrorCode::NotFound, "Created index not found"); | ||
} | ||
|
||
template <> | ||
katana::Result<katana::PropertyIndex<katana::GraphTopology::Edge>*> | ||
Edge::MakeIndex(katana::PropertyGraph* pg, const std::string& column_name) { | ||
KATANA_CHECKED(pg->MakeEdgeIndex(column_name)); | ||
return Edge::GetIndex(pg, column_name); | ||
} | ||
|
||
template <> | ||
size_t | ||
Node::num_entities(katana::PropertyGraph* pg) { | ||
|
@@ -79,6 +88,22 @@ Edge::AddProperties( | |
return pg->AddEdgeProperties(properties); | ||
} | ||
|
||
template <> | ||
std::shared_ptr<arrow::Array> | ||
Node::GetProperty(katana::PropertyGraph* pg, const std::string& column_name) { | ||
auto prop_result = pg->GetNodeProperty(column_name); | ||
KATANA_LOG_ASSERT(prop_result); | ||
return prop_result.value()->chunk(0); | ||
} | ||
|
||
template <> | ||
std::shared_ptr<arrow::Array> | ||
Edge::GetProperty(katana::PropertyGraph* pg, const std::string& column_name) { | ||
auto prop_result = pg->GetEdgeProperty(column_name); | ||
KATANA_LOG_ASSERT(prop_result); | ||
return prop_result.value()->chunk(0); | ||
} | ||
|
||
template <typename c_type> | ||
std::shared_ptr<arrow::Table> | ||
CreatePrimitiveProperty( | ||
|
@@ -200,11 +225,8 @@ TestPrimitiveIndex(size_t num_nodes, size_t line_width) { | |
} | ||
|
||
template <typename node_or_edge> | ||
void | ||
TestStringIndex(size_t num_nodes, size_t line_width) { | ||
using IndexType = katana::StringPropertyIndex<node_or_edge>; | ||
using ArrayType = arrow::LargeStringArray; | ||
|
||
std::unique_ptr<katana::PropertyGraph> | ||
MakeStringGraph(size_t num_nodes, size_t line_width) { | ||
LinePolicy policy{line_width}; | ||
|
||
std::unique_ptr<katana::PropertyGraph> g = | ||
|
@@ -230,6 +252,32 @@ TestStringIndex(size_t num_nodes, size_t line_width) { | |
nonuniform_index_result, "Could not create index: {}", | ||
nonuniform_index_result.error()); | ||
|
||
return g; | ||
} | ||
|
||
template <typename node_or_edge> | ||
std::unique_ptr<katana::PropertyGraph> | ||
TestStringIndex( | ||
std::unique_ptr<katana::PropertyGraph> g, size_t num_nodes, | ||
size_t line_width) { | ||
using IndexType = katana::StringPropertyIndex<node_or_edge>; | ||
using ArrayType = arrow::LargeStringArray; | ||
|
||
if (!g) { | ||
g = MakeStringGraph<node_or_edge>(num_nodes, line_width); | ||
} | ||
|
||
auto uniform_index_result = | ||
NodeOrEdge<node_or_edge>::GetIndex(g.get(), "uniform"); | ||
KATANA_LOG_VASSERT( | ||
uniform_index_result, "Could not get index: {}", | ||
uniform_index_result.error()); | ||
auto nonuniform_index_result = | ||
NodeOrEdge<node_or_edge>::GetIndex(g.get(), "nonuniform"); | ||
KATANA_LOG_VASSERT( | ||
nonuniform_index_result, "Could not get index: {}", | ||
nonuniform_index_result.error()); | ||
|
||
auto* uniform_index = static_cast<IndexType*>(uniform_index_result.value()); | ||
auto* nonuniform_index = | ||
static_cast<IndexType*>(nonuniform_index_result.value()); | ||
|
@@ -253,8 +301,8 @@ TestStringIndex(size_t num_nodes, size_t line_width) { | |
} | ||
|
||
// The non-uniform index starts at "aaaa" and increases by 2. | ||
auto typed_prop = | ||
std::static_pointer_cast<ArrayType>(nonuniform_prop->column(0)->chunk(0)); | ||
auto typed_prop = std::static_pointer_cast<ArrayType>( | ||
NodeOrEdge<node_or_edge>::GetProperty(g.get(), "nonuniform")); | ||
it = nonuniform_index->Find("aaaj"); | ||
KATANA_LOG_ASSERT(it == nonuniform_index->end()); | ||
it = nonuniform_index->LowerBound("aaaj"); | ||
|
@@ -263,6 +311,31 @@ TestStringIndex(size_t num_nodes, size_t line_width) { | |
it = nonuniform_index->UpperBound("aaak"); | ||
KATANA_LOG_ASSERT(it != nonuniform_index->end()); | ||
KATANA_LOG_ASSERT(typed_prop->GetView(*it) == "aaam"); | ||
|
||
return g; | ||
} | ||
|
||
std::unique_ptr<katana::PropertyGraph> | ||
ReloadGraph(std::unique_ptr<katana::PropertyGraph> g) { | ||
auto uri_res = katana::Uri::MakeRand("/tmp/propertyfilegraph"); | ||
KATANA_LOG_ASSERT(uri_res); | ||
std::string rdg_dir(uri_res.value().path()); | ||
|
||
auto write_result = g->Write(rdg_dir, "test command line"); | ||
|
||
if (!write_result) { | ||
boost::filesystem::remove_all(rdg_dir); | ||
KATANA_LOG_FATAL("writing result: {}", write_result.error()); | ||
} | ||
|
||
katana::Result<std::unique_ptr<katana::PropertyGraph>> make_result = | ||
katana::PropertyGraph::Make(rdg_dir, tsuba::RDGLoadOptions()); | ||
boost::filesystem::remove_all(rdg_dir); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not crazy about our dependence on There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fair, I'll also say that in tests I tend to be much more permissive for my own coding style and grabbing arbitrary dependencies to just get the job done. |
||
if (!make_result) { | ||
KATANA_LOG_FATAL("making result: {}", make_result.error()); | ||
} | ||
|
||
return std::move(make_result.value()); | ||
} | ||
|
||
int | ||
|
@@ -274,8 +347,14 @@ main() { | |
TestPrimitiveIndex<katana::GraphTopology::Node, double_t>(10, 3); | ||
TestPrimitiveIndex<katana::GraphTopology::Edge, double_t>(10, 3); | ||
|
||
TestStringIndex<katana::GraphTopology::Node>(10, 3); | ||
TestStringIndex<katana::GraphTopology::Edge>(10, 3); | ||
auto node_g = TestStringIndex<katana::GraphTopology::Node>(nullptr, 10, 3); | ||
auto edge_g = TestStringIndex<katana::GraphTopology::Edge>(nullptr, 10, 3); | ||
|
||
node_g = ReloadGraph(std::move(node_g)); | ||
edge_g = ReloadGraph(std::move(edge_g)); | ||
|
||
TestStringIndex<katana::GraphTopology::Node>(std::move(node_g), 10, 3); | ||
TestStringIndex<katana::GraphTopology::Edge>(std::move(edge_g), 10, 3); | ||
|
||
return 0; | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -33,6 +33,9 @@ const char* kEdgeEntityTypeIDDictionaryKey = | |
const char* kNodeEntityTypeIDNameKey = "kg.v1.node_entity_type_id_name"; | ||
// Name maps from Atomic Edge Entity Type ID to set of string names for the Edge Entity Type ID | ||
const char* kEdgeEntityTypeIDNameKey = "kg.v1.edge_entity_type_id_name"; | ||
// List of node and edge indexed columns | ||
const char* kNodePropertyIndexColumnsKey = "kg.v1.node_property_index_columns"; | ||
const char* kEdgePropertyIndexColumnsKey = "kg.v1.edge_property_index_columns"; | ||
Comment on lines
+37
to
+38
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hey @tylershunt, given that these key names seem to have no bearing on the RDG version, is it wise to keep calling them v1? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. These are versions of the key namespace. I don't see a reason to change them at this time, we're just adding keys. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Everything is an IDspace to you these days. It makes me feel a bit better about it though. |
||
|
||
// | ||
//constexpr std::string_view mirror_nodes_prop_name = "mirror_nodes"; | ||
|
@@ -288,6 +291,8 @@ tsuba::to_json(json& j, const tsuba::RDGPartHeader& header) { | |
{kEdgeEntityTypeIDDictionaryKey, header.edge_entity_type_id_dictionary_}, | ||
{kNodeEntityTypeIDNameKey, header.node_entity_type_id_name_}, | ||
{kEdgeEntityTypeIDNameKey, header.edge_entity_type_id_name_}, | ||
{kNodePropertyIndexColumnsKey, header.node_property_index_columns_}, | ||
{kEdgePropertyIndexColumnsKey, header.edge_property_index_columns_}, | ||
}; | ||
} | ||
|
||
|
@@ -319,6 +324,16 @@ tsuba::from_json(const json& j, tsuba::RDGPartHeader& header) { | |
j.at(kNodeEntityTypeIDNameKey).get_to(header.node_entity_type_id_name_); | ||
j.at(kEdgeEntityTypeIDNameKey).get_to(header.edge_entity_type_id_name_); | ||
} | ||
|
||
header.node_property_index_columns_ = {}; | ||
if (auto it = j.find(kNodePropertyIndexColumnsKey); it != j.end()) { | ||
it->get_to(header.node_property_index_columns_); | ||
} | ||
|
||
header.edge_property_index_columns_ = {}; | ||
if (auto it = j.find(kEdgePropertyIndexColumnsKey); it != j.end()) { | ||
it->get_to(header.edge_property_index_columns_); | ||
} | ||
} | ||
|
||
void | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indexes or indices?
More substantively, our serialization format for indexes is json??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been using indexes, both are grammatically correct. The top search result suggests 'indices' are used in the mathematical sense, while 'indexes' for e.g. written documents, and I think this use is from the idea of a written index of a document. As an added bonus, text search for 'index' will find 'indexes'.
Currently we're not persisting anything about the contents of the index, just the metadata - list of columns indexed. If we have an index which has some representation in storage that helps build it faster than just rebuilding from the property, then yes json would be silly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd maybe say index names from json in the comment.