Skip to content

BVGraph graph file format introduction tutorial CLI hello world #198

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
cirosantilli opened this issue Mar 24, 2025 · 0 comments
Open
Labels
stack-exchange Posts that were deleted by in that stupid website that allows post deletion.

Comments

@cirosantilli
Copy link
Owner

cirosantilli commented Mar 24, 2025

Best/more standard graph representation file format? (GraphSON, Gexf, GraphML? )

https://stackoverflow.com/questions/31321009/best-more-standard-graph-representation-file-format-graphson-gexf-graphml

BVGraph

For large graphs where storage efficiency matters, this binary format could be another option to consider: https://webgraph.di.unimi.it/docs/it/unimi/dsi/webgraph/BVGraph.html

It is for example the format to which Common Crawl currently dumps its graphs to.

The format has open source implementations in Java and Rust as part of the "WebGraph Framework": https://webgraph.di.unimi.it/

The implementations are at:

This page gives a reasonable introduction to the file format itself: https://zom.wtf/posts/webgraph-bvgraph/

A more complete description can be found on this paper: https://vigna.di.unimi.it/ftp/papers/WebGraphI.pdf

Playing with Common Crawl web graph with the WebGraph Rust implementation

Being a forward looking millennial I had to give the Rust implementation a quick try. I downloaded cc-main-2024-25-dec-jan-feb-domain.graph and cc-main-2024-25-dec-jan-feb-domain.properties from from https://data.commoncrawl.org/projects/hyperlinkgraph/cc-main-2024-25-dec-jan-feb/index.html:

wget https://data.commoncrawl.org/projects/hyperlinkgraph/cc-main-2024-25-dec-jan-feb/domain/cc-main-2024-25-dec-jan-feb-domain.properties
wget https://data.commoncrawl.org/projects/hyperlinkgraph/cc-main-2024-25-dec-jan-feb/domain/cc-main-2024-25-dec-jan-feb-domain.graph

The format seems to be split across two files each time:

  • one .properties file which describes the exact data format via some parameters
  • one .graph file which contains the bulk of the data

so the format offers multiple ways to encode any given graph, so that a correct choice of encoding parameters based on your knowledge of the graph structure could lead to a much more efficient compression.

I then built the software with:

git clone https://github.com/vigna/webgraph-rs
cd webgraph-rs
git checkout c77b4c9970996f4d2c8037e2ea48f151389b72e0
cargo build

and now one simple thing we can do is to convert the BVGraph to csv with:

cargo run to csv cc-main-2024-25-dec-jan-feb-domain.graph

The first 10 lines contain:

44,31272884
46,89485870
51,85584515
53,10820780
53,11294987
53,11295261
53,12765779
53,18600892
53,31280705
53,42941755

so these must be edges of the graph going from one node to the other. To see which domain corresponds to each edge we also have to download cc-main-2024-25-dec-jan-feb-domain-vertices.txt.gz:

wget https://data.commoncrawl.org/projects/hyperlinkgraph/cc-main-2024-25-dec-jan-feb/domain/cc-main-2024-25-dec-jan-feb-domain-vertices.txt.gz
gunzip --keep cc-main-2024-25-dec-jan-feb-domain-vertices.txt.gz

and that contains the number to domain map with lines such as:

44      aarp.nic        2
46      aarp.takeontoday        1
51      abb.ability     7

Creating a BVGraph from the CLI

In order to ingest a BVGraph from the Rust CLI, we need the graph in CSV from node -> to node format, e.g. given:

in.csv

a,b
b,c
c,d
d,b

which represents the graph:

enter image description here

.dot source:

digraph {
  a -> b;
  b -> c;
  c -> d;
  d -> b;
}

we can convert it to out.graph with:

cargo run from arcs --num-nodes 4 out < in.csv

We can then verify that the generated is correct with:

cargo run to csv out

which gives the graph with our arbitrary labels converted to numbers:

0,1
1,2
2,3
3,1

and the number to label mapping can be found at out.nodes which contains:

a
b
c
d

Now we just have to wait for them to expose PageRank on the CLI!

Tested on Ubuntu 24.10, rust 1.80.1.

@cirosantilli cirosantilli added the stack-exchange Posts that were deleted by in that stupid website that allows post deletion. label Mar 24, 2025
@cirosantilli cirosantilli changed the title Best/more standard graph representation file format? (GraphSON, Gexf, GraphML? BVGraph graph file format introduction tutorial CLI hello world Mar 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stack-exchange Posts that were deleted by in that stupid website that allows post deletion.
Projects
None yet
Development

No branches or pull requests

1 participant