Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: connecting the dots, wiring ngdi algo with NebulaGraph UDF #18

Open
wey-gu opened this issue Mar 2, 2023 · 13 comments
Open

RFC: connecting the dots, wiring ngdi algo with NebulaGraph UDF #18

wey-gu opened this issue Mar 2, 2023 · 13 comments

Comments

@wey-gu
Copy link
Owner

wey-gu commented Mar 2, 2023

Simplifying things in surprising ways.

Native Query experience to leverage ngdi.

API

execution engine/mode

  • networkx
  • spark

Call Syntax

  • Scan based(read mode):
RETURN ngdi("pagerank", ["follow"], ["degree"]) // this means to call in parallel execution mode: spark
RETURN ngdi("pagerank", ["follow"], ["degree"], "compact") // default execution mode, call the single process version in NetworkX
  • Query based:

Option 0:

MATCH ()-[:follow]->() RETURN e LIMIT 10000 | YIELD collect($.e) AS graph |
RETURN ngdi.query("pagerank", $-.graph)
MATCH ()-[:follow]->() RETURN e LIMIT 10000
WITH collect(e) AS graph
RETURN ngdi.query("pagerank", graph)

Option 1:

YIELD "MATCH ()-[:follow]->() RETURN e LIMIT 10000" AS query |
YIELD ngdi("pagerank", $-.query)

Write Mode

  • return mode, the function will return the records(ideally in a streaming way)
  • update mode, the result will be written to the calculated vertices as prop(s), in update way
  • insert mode, the result will be written to the calculated vertices as prop(s), in insert way

Design

  • Setup ngdi-api-server listening on 9999(thrift) or 19999(http)
  • Call ngdi-api-server from UDF
  • Support to call Compact(networkx) or Parallel(spark) mode on demand with hint

ref: vesoft-inc/nebula#4804

@wey-gu wey-gu changed the title RFC: wiring ngdi algo with NebulaGraph UDF RFC: connecting the dots, wiring ngdi algo with NebulaGraph UDF Mar 2, 2023
@wey-gu
Copy link
Owner Author

wey-gu commented Mar 6, 2023

The minimal PoC implementation will be:

  • execution mode: parallel(spark)
  • read mode: scan&& option 1-query-based
  • write mode: insert

WIP on:

wey-gu added a commit that referenced this issue Mar 7, 2023
@whitewum
Copy link

whitewum commented Mar 7, 2023

let's choose the cypher way, not the ngql way.

MATCH ()-[:follow]->() RETURN e LIMIT 10000
WITH collect(e) AS graph
RETURN ngdi.query("pagerank", graph)

@whitewum
Copy link

whitewum commented Mar 7, 2023

I think not necessary to say networkx as compact and spark as parallel .
The name "networkx" and "spark" are fine. Probably, we can introduce more graph engines in the future.

@wey-gu
Copy link
Owner Author

wey-gu commented Mar 7, 2023

let's choose the cypher way, not the ngql way.

MATCH ()-[:follow]->() RETURN e LIMIT 10000
WITH collect(e) AS graph
RETURN ngdi.query("pagerank", graph)

Sure, this is much better, but a little hard to implement, but will eventually implement it in this way.

@wey-gu
Copy link
Owner Author

wey-gu commented Mar 7, 2023

I think not necessary to say networkx as compact and spark as parallel . The name "networkx" and "spark" are fine. Probably, we can introduce more graph engines in the future.

make sense, then we don't have to introduce other option of compact but just another mode.

@whitewum
Copy link

whitewum commented Mar 7, 2023

let's choose the cypher way, not the ngql way.

MATCH ()-[:follow]->() RETURN e LIMIT 10000
WITH collect(e) AS graph
RETURN ngdi.query("pagerank", graph)

Sure, this is much better, but a little hard to implement, but will eventually implement it in this way.

Ok, for now, let's choose the easiest way. We can change the DSL later. It is not determined.

@whitewum
Copy link

whitewum commented Mar 7, 2023

I don't get. is this udf implemented in c++ or python in Nebula?

Call ngdi-api-server from UDF

the udf seems like a c++ client of ngdi-api-server?

@wey-gu
Copy link
Owner Author

wey-gu commented Mar 7, 2023

I don't get. is this udf implemented in c++ or python in Nebula?

Call ngdi-api-server from UDF

the udf seems like a c++ client of ngdi-api-server?

Exactly, thus for the query-based reader, in spark mode, passing the query string rather than evaluating is much easier to implement in the initial fast PoC version.

UDF (c++) make calls from graphd to ngdi api server(run in either spark cluster or as a single process in python)

@whitewum
Copy link

whitewum commented Mar 7, 2023

why not add a python udf in nebula instead?

@wey-gu
Copy link
Owner Author

wey-gu commented Mar 7, 2023

why not add a python udf in nebula instead?

Because it's the merged implementation(easier) for now 😭

@wey-gu
Copy link
Owner Author

wey-gu commented Mar 7, 2023

why not add a python udf in nebula instead?

We could try adding FFI in UDF to call python code though from graphd directly, but that will benefit the non-spark version only.

But there should be some dirty work on binding this from current UDF infra(basically just the existing function manager, pure c++ by nature)

@whitewum
Copy link

whitewum commented Mar 7, 2023

ok. the c++ client of ngdi-api-server is the easiest way so far.

But, If the udf call is introduced in DSL, the syntax check is not easy.

For example, is it a correct graph structure in this page_udf(graph)? who will check the correctness? graphD or spark?

@wey-gu
Copy link
Owner Author

wey-gu commented Mar 7, 2023

the syntax check is not easy.

For example, is it a correct graph structure in this page_udf(graph)? who will check the correctness? graphD or spark?

Indeed, for now, I put all validation that the UDF could do in UDF because it should fail early and explicitly hint at where it goes wrong as much as possible(we could see there are a lot of checks in the current poc UDF in that branch).

From the ngdi_gateway side, there should be as much as possible early check(when needed) and exception handling to not confuse users, too.

For instance, in match query read_mode, in the fast track of implementation(option 2), graphd only treats it as a string, which has to be evaluated by spark connector ngql reader, but I will do my best to make it smoother/clear/lovely to use.

in production/future delivery version of the UDF for ndgi calling, it'll be quite heavy to do enough checks before calling remote ngdi.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants