-
Notifications
You must be signed in to change notification settings - Fork 183
[WIP] Lighter graphs with edge iterators #163
Conversation
Current coverage is
|
end | ||
|
||
edge_it(g::Graph) = EdgeIt(ne(g), g.fadjlist, EdgeItState(1,1,1), false) | ||
edge_it(g::DiGraph) = EdgeIt(ne(g), g.fadjlist, EdgeItState(1,1,1), true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't this allocate a ton of memory from a copy of g.fadjlist
? Test memory allocation on some graph with 10^9 edges and see what happens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is not copying, adj points to the same memory of g.fadjlist
This looks really interesting. My go-to benchmark is
|
Edited: I'm not sure how close we are to merging sparsemx; something like this might make more sense to incorporate since it appears to have a significant memory benefit. |
n = length(eit.adj) | ||
!(1 <= s <= n) && return false | ||
!(1 <= t <= n) && return false | ||
return t in eit.adj[s] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is O(n) and will be a performance bottleneck. The reason we went with edge sets is so we can do membership tests in O(1).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is O(deg(s)), that is O(1) in sparse graphs and O(n) in dense graphs. Since I usually work with sparse graphs I was not considering the latter case.
Here there is a trade-off beetween saving 2*m*64bit
space ( 3*m*64bit
if we also remove g.badjlist in Graph) and going from O(1) to O(n) in edge existence check in dense graphs. How common is the in
operation? which of the algorithms uses it?
In case maybe we can add this modified Graph as another type in LightGraphs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're quite right about the tradeoffs. It's one of the reasons there are so many different ways to represent graphs. :)
Most of my work is in sparse graphs as well, so it won't affect ME too much, but I don't want it to be a surprise for anyone else. If you find calls to has_edge()
that's probably where you'll get tripped up.
Using master, the test
in ten minutes ate all my memory (8Gb) and I had to kill it. Is this normal or there is some memory leak? If it's normal, which less memory expensive other test can I try? |
Oh, that's normal. Sorry... betweenness is a memory hog. I'd try with a subgraph. try We want a test that runs for at least 5 minutes. though, to really detect a performance difference. |
A small test looks promising.
after this commit
I'll try with a larger graph |
Wow - that DOES look promising. I wonder if we need to tweak the readgraph. Could you also check out lots of |
@CarloLucibello one thing to benchmark if you haven't already is induced subgraph:
especially with the latest master, where we've made significant improvements (see #173). |
From a design perspective:I think it is a good idea to remove the If some algorithm actually needs We can also use binary search to reduce From a performance perspective:ConstructorsWe need to make sure that constructing the graph isn't slower. IIRC, the majority of time in graph construction goes to inserting the edges into the Equality testing.There is this warning in the patch.
Can this be replaced by a short circuiting operation? function ==(g::Digraph, h::Digraph)
vertices(g) == vertices(h) || return false
ne(g) == ne(h) || return false
hedges = edge_it(h)
gedges = edge_it(g)
hstate = start(hedges)
gstate = start(gedges)
for i in 1:ne(g)
a, gstate = next(gedges, gstate)
b, hstate = next(hedges, hstate)
a == b || return false
end
return true
end Something like that should work. That way if two graphs are different we find out without having to look at all the edges? |
@jpfairbanks the Set{Edge} was originally used for O(1) testing of edge membership. Any call to From a design perspective, sure - it's cleaner not to have edges specified in two places. However, there are always tradeoffs, and the underlying storage format is abstracted so that the user doesn't need to worry about it so much. Specifically for LightGraphs, I disagree with the assertion that design trumps performance: LG was created precisely because the nice clean design of the alternative(s) resulted in subpar real-world results. We sacrifice cleanness for some smoking-hot speed. I'm ok with that. As for performance, my use case might be different, but I'm constructing a graph once (or maybe twice) so edge insertion performance isn't such a big deal. Operations on the graphs, however are a huge deal, so anything that decreases performance there will be problematic. ETA: going back to design, look at your comments re: equality testing. Certainly the one liner for equality comparing edges and vertices is cleaner than the short-circuiting function you created. My guess is that similar design tradeoffs will be required should we remove edge sets, which makes the "it's a cleaner design!" argument less compelling. ETA2: The previous ETA wasn't intended to be critical of your code or ideas; apologies if it came across that way. I think that "clean design" should include both the structures and the functions that operate on the structures, which is why I'm having second thoughts about the move to sparse matrices: while it makes the structure cleaner, the amount of non-obvious manipulation you need to do to derive any performance gains makes the overall result much more complex. IMO, a simple test of graph equality should not require custom iteration states to be performant. The test of cleanliness to me is the ability to write performant functions that are simple enough to understand at a quick glance. The sparse matrix stuff and the equality test above don't meet that threshold to me. |
Hi @sbromberger, I think my comments might be poorly expressed. Here are some clarifications. My comments on design aren't about aesthetics, they are about how our design decisions affect the code that our users write. We should give the user functions that let them write fast code. If the functions we give them cause them to write slow code, then we should reconsider the functions of our library. Perhaps providing the slow version that is easy to use and a faster version for when they need more speed when necessary.
I think these functions can avoid
I agree with this although I think you are using design to mean "elegent/clean design" where I use it just to mean "the choices we make about Interfaces and Implementations."
A problem is that edge insertion comes up often in algorithms. When you want BFS to give you a DiGraph it needs to do edge insertion. When collecting connected components into subgraphs such as
It isn't about cleanliness. It is about performance. Short circuiting is faster because you don't have to look at all the edges (unless g==h, but no way around that).
In haskell I would write equals g h = (foldl && ((zipWith ==) (edges g) (edges h))) and by lazy functional compiler magic we would get a short circuiting implementation on one line with elegance and panache |
Aside: It would be very cool (in the future) to use traits to dispatch on different complexities (and so raise/whatever if it's going to be too slow).... |
@hayd I was looking at Traits.jl a while back and it seemed like a significant addition of complexity for dubious gain. Granted, I wasn't looking to use traits as a method of warning on performance issues. I wonder how that would work. |
@sbromberger This is very long-term future, when most likely traits will end up in base (in some form). Honestly I'm not yet sure how/whether this would even work... :) |
@jpfairbanks thank you for the comments and clarification. I agree with you regarding giving the users ability to write fast code (or at least not unduly slowing them down) but I think there's a tradeoff between general usefulness and specific use cases. That is, if you're a spectral graph person, then matrices are great. If, however, you're more of a structural graph person, then matrices aren't the ideal representation of the data. In any case, while I agree that
, I also think we should not make the library so cumbersome or complex that users can't write simple, easily-understood code. In some cases, this results in a tradeoff between performance and simplicity (see below).
I'm pretty sure I disagree with this approach (even though I've perhaps not stuck to that principle given recent commits), at least as it applies to incorporating multiple functionally-equivalent methods inside LG. LG is not designed to be a Swiss Army knife; we have that already in Graphs.jl which is more flexible than anything I've seen. It comes at a cost, though. My goal with LG is (has been) to err on the side of simplicity and performance while sacrificing flexibility (the reason we don't allow multigraphs, hypergraphs, or metadata). Here's the current landscape, as I see it: So, in short: ignoring flexibility, if there's a way to improve performance without sacrificing too much simplicity, I'm all in favor. However, at some point we have diminishing returns on modest increases in performance (relative to simplicity). I think we might be at that point with respect to the change contemplated here, though in reverse. The proposed change does not make the code faster; in fact, it demonstrably reduces performance with existing functions (equality and operators) while improving performance with just one ( Even if we were to optimize the regressions, they still would not be as performant as what we have now, and it would also result in tremendous complexities to the codebase (see equality). At this point, my preference would be to see whether, instead, we could optimize graph creation instead of optimizing several other functions. That seems like a reasonable tradeoff. |
PS: profiling some graph creation, it looks like the bulk of the time spent in
The set insertion represents 21% of the time spent in PPS: @jpfairbanks - That haskell code does look elegant :) |
this won't work, since it is sensible to the order of edge insertion. One should compare two sorted versions of the adjlist of each node for equality test. That is also easy to implement, and still efficient for sparse graph. It is not for dense graph but on the other hand I imagine that graph equality check it is not an operation one ever does in real life. |
Hi, I'm sorry I didn't reply for a while, I've been a little busy. I did some more test. CompleteGraph generation after patch it is really slow. In fact it should scale as O(n^3), while on master it should be O(n^2 log n). The problem disappears if one uses unsafe_add_edge instead of add_edge, that is eliminating edge existence checks. In truth this should be done for every standard graph constructor in LighGraphs, whether this patch gets merged or not, since constructors are appropriately tested and it is useless to waste user's time. Betweenness centrality benchmaks for comple graphs from master and patch gave identical times. Regarding the discussion on whether to eleminate or not the edge set, since there is no ground truth and the right thing to do depends strongly on the use case scenario, I will put on the table another user story: mine. The reason i commited this patch is that I was looking for a julia graph library which was either performant, easy to use and allowed for rich graph structures. Graphs.jl ,aspiring to high generality through decoupling, was too painfull to use and not actively developd, while LightGraphs was easy enough to learn and to understand, but lacked some feature that i needed, that is graph/edge/vertex properties. Then I thought that I could create another library, FatGraphs.jl or Networks.jl, with an heavier data structure, but the could produce a LightGraphs.jl graph in notime and fallback to LightGraphs.jl functions. At this point the new library would have all the LightGraphs functionality for free. This can be done only eliminating the edgeset, since an hypothetical FatGraphs library would not have an edge set but most probaby a dict or a vector. In the end, eliminating the edgeset and mantaining a minimal data structure has a strong appeal (at least to me), in front of some performance loss in some corner cases (edge existence test in dense graphs) and some (minor) performance gain in others. |
@CarloLucibello Thank you for describing your use case. I don't claim a full understanding of what you want to do, but this:
is puzzling to me. If you're using LG as a backend to this FatGraphs library, then why would you not use the LG API to construct your LightGraphs? That is, if you're proposing to manually create LG from FG data, this is bound to fail you at some point down the road if the underlying data structures change. If there's an issue with the API (as you've found with construction of smallgraphs - we'll get that change in ASAP, and thanks!), let's identify it. Use of LG as a backend or companion library should not rely on the underlying data structures. If there were such a capability in Julia, these structural attributes would be declared private. As it appears now, though, while this change may appeal to your specific use case, it will degrade - perhaps significantly - performance for others. Without an appreciation of why the current structure won't provide you with what you need (again, maybe with some API additions/changes), I don't see a compelling reason to merge this. I feel I'm missing an appreciation of the larger issue - help me understand. |
Though we haven't mapped it out yet, I imagine graph equality checks will be a key component of isomorphism. |
Use of LG as a backend would be costless only if I wouldn't need to generate an edge set to contruct an LG graph for each LG function call. After this patch it should be possible to do something like betweenness_centrality(g::FatGraph) = betweenness_centrality(LightGraph(g.fadjlist,g.badjlist)) with very little overhead. On the other hand considerations on LG design should not rely on other (non exisisting) libraries, so if no consesus is reached over the elimination of the edge set then let's stay with it. I have two loosely related questions:
Vertex List interface that is every one except for the Edge Map interface. Sorry if these questions have been previously discussed. |
But here's the thing: you should NOT be constructing LightGraphs this way. There is no "approved" constructor that allows you to specify the internal data structures (edge set,
That's a good question that deserves more thought. Sets are effectively Dicts with void values (see my post to julia-dev, and Stefan's response here: https://groups.google.com/d/msg/julia-dev/_2bF02Kp1fQ/Sy7jV3iiBOUJ). Therefore, a Dict{Edge, Int} might be just as performant space wise, and when testing membership. The problem is that finding an edge given an index will be an inefficient operation.
There are a variety of reasons, some probably more valid than others. The original reason was because there wasn't a Graphs.jl structure that allowed me to optimize a bunch of things very generally, and I didn't want to create specific methods that only worked on my objects. It seemed wrong, for some reason. I gave up after trying to get GraphCentrality working with good performance across all possible Graphs.jl types. It was a daunting task and I still don't think I'm up for it. As time went on, we found issues such as JuliaAttic/OldGraphs.jl#131, which led to JuliaAttic/OldGraphs.jl#143, and then there was this: JuliaAttic/OldGraphs.jl#163 and finally JuliaAttic/OldGraphs.jl#171. Bottom line, it was easier just to start fresh with known limitations rather than struggle to figure out limitations through parameterization and other things. There was no guarantee that an algorithm that worked on one type of graph (with some set of traits) would perform well on any other, and for me it was a struggle to get it working at all. It was faster for my work to start from scratch. |
Well, I would like to rely on LG underlying storage, and I don't expect it to change often, in order to have a O(1) constructor. All members of a LG would be members of a FG, and that would be feasible only if LG remains as lean as formed by two adjlists.
Sure, I was not meaning to remodel the api to accomodate and edge index, I was wondering if assigning an unique id to each edge could be useful for users, maybe to store edge properties in vectors. |
Here it is an interesting example, the python library (with C++ internals) graph-tool It seems to be composed as a class GraphInterface which has a member of class AdjacencyList. All the topological work is done by AdjList, while GraphInterface keeps track of graph properties. https://git.skewed.de/count0/graph-tool/blob/master/src/graph/graph.hh This is sort of the relation I was looking for between LG and other libraries with an heavier data stucture. graph-tool's adjacency list allows also for multiedges |
As the creator of the package, I am stating as emphatically as I can: you should not rely on LG's underlying storage to remain unchanged. There have been (and are currently) several efforts to change elements of how we represent graphs internally. LG will never guarantee that accessing internal fields directly will be 1) performant, or 2) non-breaking across even minor updates, and while we will make every effort to notify users of structural changes, users should not rely on that notification in order to correct their code. The APIs are there for a reason. If we need another constructor, we can write it.
Honestly (and I hate to say this because this is the first time I've had to), it sounds like LightGraphs isn't ideal for your use case. Given what I understand your needs to be, I'd recommend you look at Graphs.jl. You'll be able to do everything you need with it, and it should support all the features you're missing in LG. I can pretty definitely say that LG will never support metadata on nodes and edges. This is an explicit design decision that we call out in the README and the docs. (Where I've needed labels or other data, I've used Redis or in-memory structures (vectors or dicts) outside of the graph structure.) I appreciate the discussion and am very interested in what you're doing with respect to graph modeling, but it sounds as if you're looking for LG to do something that it simply wasn't designed to do (and in fact was explicitly NOT designed to do). |
Per #221, I want to incorporate / test some of this code. @CarloLucibello, do you still have the branch? Would you mind filing another PR? (Don't worry about merge conflicts; we'll sort those out later.) |
I thought that we can signficantly reduce memory comsumption without sensible loss in performance if we remove the g.edges member and make edges(g) return an edge iterator.
The other necessary ingredient was the overlad of Base.in
This is my attempt at it. All test pass, but I'm still not sure if it has performance issues though.
TODO
-[ ] documentation
-[ ] more tests
-[ ] benchmarking
I would appreciate comments and a careful review, since merging this would be an important design decision.
Bye,
Carlo