Link Prediction Investor-Startup

This repository is aimed at doing link prediction in investors-startup bipartite graph.

I began with the work of Zheng which I illustrate and compare to my results. Basically, Zheng's work is just predicting with Preferential Attachement (Degree(venture) x Degree(investor)).

I wanted to leverage the information of graphs which I believe to represent - to a certain extent - the social interactions which drive venture capitalists, which bounded rationality and bounded geographical influence should push towards closer ventures. To measure this closeness, I computed the LENGTH feature, which represent the geodesic distance in the graph between the two nodes. In a second part I computed also the textual closeness of start-up to investor's portfolio using cosine distance on W2V features.

As expected, this feature shows strong tendance to make the model predict a match between venture and investors. We show that the model with LENGTH is globally stronger than without on PR-curves, and also that LENGTH is the SHAP-most-important feature (see B.2).

Along with that, quite uncorrelated, we added information about ventures descriptions. Those descriptions were converted to vectors thanks to W2V and then compared with cosine distance. We see that the startups invested in YEAR+1 by an investor are on average semantically closer to the previous YEAR portfolio of this same investor. This is a hint that investor are specialized in some areas (which we know of course). Inspired from Basole's work.

■ Illustrating paper illustrates Zheng's paper

■ Preparing Real Data prepares the data and puts it in a utils.p file that can be used to work.

■ Generate DFA prepares the DataFrame to be used in later steps.

■ Predictions tests a fine-tuned RandomForestClassifier and gives Precision-Recall Curves for it with different features. We can see a good improvement with LENGTH and W2V features.

B - Insights

■ 1 Investors knowledge

Investors tend to invest in firms which description looks like the one it invested in before. By comparing the DIST_MEAN to a RANDOM distance we see that the start-up in the portfolio on the second half of the dataset is correlated to the first half (before and after 2016). If an investor was investing in random firms then it would be around RANDOM which was calculated by picking randomly 10 000 couples (Di, Dj) of start-ups i and j in the dataset.

See : Investors profile from year to another

■ 2 Feature importance

Investors profile from year to another shows that the most impactful feature is LENGTH according to SHAP feature importance.

The paper I inspire of is Zheng's which you can find in /base_papers

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.ipynb_checkpoints		.ipynb_checkpoints
base_data		base_data
base_papers		base_papers
content		content
insights		insights
optimized_computation		optimized_computation
tools		tools
.gitignore		.gitignore
README.md		README.md
analyzing-G-and-descs.ipynb		analyzing-G-and-descs.ipynb
generate-DFA.ipynb		generate-DFA.ipynb
illustrating_paper.ipynb		illustrating_paper.ipynb
test-different-models.ipynb		test-different-models.ipynb
test-feature-importance.ipynb		test-feature-importance.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Link Prediction Investor-Startup

B - Insights

■ 1 Investors knowledge

■ 2 Feature importance

About

Releases

Packages

Contributors 2

Languages

malaville/link_prediction

Folders and files

Latest commit

History

Repository files navigation

Link Prediction Investor-Startup

B - Insights

■ 1 Investors knowledge

■ 2 Feature importance

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages