This repository is aimed at doing link prediction in investors-startup bipartite graph.
I began with the work of Zheng which I illustrate and compare to my results. Basically, Zheng's work is just predicting with Preferential Attachement (Degree(venture) x Degree(investor)).
I wanted to leverage the information of graphs which I believe to represent - to a certain extent - the social interactions which drive venture capitalists, which bounded rationality and bounded geographical influence should push towards closer ventures. To measure this closeness, I computed the LENGTH feature, which represent the geodesic distance in the graph between the two nodes. In a second part I computed also the textual closeness of start-up to investor's portfolio using cosine distance on W2V features.
As expected, this feature shows strong tendance to make the model predict a match between venture and investors. We show that the model with LENGTH is globally stronger than without on PR-curves, and also that LENGTH is the SHAP-most-important feature (see B.2).
Along with that, quite uncorrelated, we added information about ventures descriptions. Those descriptions were converted to vectors thanks to W2V and then compared with cosine distance. We see that the startups invested in YEAR+1 by an investor are on average semantically closer to the previous YEAR portfolio of this same investor. This is a hint that investor are specialized in some areas (which we know of course). Inspired from Basole's work.
■ Illustrating paper illustrates Zheng's paper
■ Preparing Real Data prepares the data and puts it in a utils.p file that can be used to work.
■ Generate DFA prepares the DataFrame to be used in later steps.
■ Predictions tests a fine-tuned RandomForestClassifier and gives Precision-Recall Curves for it with different features. We can see a good improvement with LENGTH and W2V features.
Investors tend to invest in firms which description looks like the one it invested in before. By comparing the DIST_MEAN to a RANDOM distance we see that the start-up in the portfolio on the second half of the dataset is correlated to the first half (before and after 2016). If an investor was investing in random firms then it would be around RANDOM which was calculated by picking randomly 10 000 couples (Di, Dj) of start-ups i and j in the dataset.
See : Investors profile from year to another
Investors profile from year to another shows that the most impactful feature is LENGTH according to SHAP feature importance.
The paper I inspire of is Zheng's which you can find in /base_papers