You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The format seems to be split across two files each time:
one .properties file which describes the exact data format via some parameters
one .graph file which contains the bulk of the data
so the format offers multiple ways to encode any given graph, so that a correct choice of encoding parameters based on your knowledge of the graph structure could lead to a much more efficient compression.
I then built the software with:
git clone https://github.com/vigna/webgraph-rs
cd webgraph-rs
git checkout c77b4c9970996f4d2c8037e2ea48f151389b72e0
cargo build
and now one simple thing we can do is to convert the BVGraph to csv with:
cargo run to csv cc-main-2024-25-dec-jan-feb-domain.graph
so these must be edges of the graph going from one node to the other. To see which domain corresponds to each edge we also have to download cc-main-2024-25-dec-jan-feb-domain-vertices.txt.gz:
cirosantilli
changed the title
Best/more standard graph representation file format? (GraphSON, Gexf, GraphML?
BVGraph graph file format introduction tutorial CLI hello world
Mar 24, 2025
Best/more standard graph representation file format? (GraphSON, Gexf, GraphML? )
https://stackoverflow.com/questions/31321009/best-more-standard-graph-representation-file-format-graphson-gexf-graphml
BVGraph
For large graphs where storage efficiency matters, this binary format could be another option to consider: https://webgraph.di.unimi.it/docs/it/unimi/dsi/webgraph/BVGraph.html
It is for example the format to which Common Crawl currently dumps its graphs to.
The format has open source implementations in Java and Rust as part of the "WebGraph Framework": https://webgraph.di.unimi.it/
The implementations are at:
This page gives a reasonable introduction to the file format itself: https://zom.wtf/posts/webgraph-bvgraph/
A more complete description can be found on this paper: https://vigna.di.unimi.it/ftp/papers/WebGraphI.pdf
Playing with Common Crawl web graph with the WebGraph Rust implementation
Being a forward looking millennial I had to give the Rust implementation a quick try. I downloaded
cc-main-2024-25-dec-jan-feb-domain.graph
andcc-main-2024-25-dec-jan-feb-domain.properties
from from https://data.commoncrawl.org/projects/hyperlinkgraph/cc-main-2024-25-dec-jan-feb/index.html:The format seems to be split across two files each time:
.properties
file which describes the exact data format via some parameters.graph
file which contains the bulk of the dataso the format offers multiple ways to encode any given graph, so that a correct choice of encoding parameters based on your knowledge of the graph structure could lead to a much more efficient compression.
I then built the software with:
and now one simple thing we can do is to convert the BVGraph to csv with:
The first 10 lines contain:
so these must be edges of the graph going from one node to the other. To see which domain corresponds to each edge we also have to download
cc-main-2024-25-dec-jan-feb-domain-vertices.txt.gz
:and that contains the number to domain map with lines such as:
Creating a BVGraph from the CLI
In order to ingest a BVGraph from the Rust CLI, we need the graph in CSV from node -> to node format, e.g. given:
in.csv
which represents the graph:
.dot source:
we can convert it to
out.graph
with:We can then verify that the generated is correct with:
which gives the graph with our arbitrary labels converted to numbers:
and the number to label mapping can be found at
out.nodes
which contains:Now we just have to wait for them to expose PageRank on the CLI!
Tested on Ubuntu 24.10, rust 1.80.1.
The text was updated successfully, but these errors were encountered: