Contextual proximity: #6

Luke-Chesley · 2023-11-12T17:28:53Z

Luke-Chesley
Nov 12, 2023

Background:
Q:
Could you explain your rationale on contextual proximity? As I understand it the contextual proximity weighs topics that occur in the same text chuck more than topics not in the same chunk. Does this have a negative impact on identifying concepts that are spread out over the whole corpus and pop up throughout? Is there another way to weigh the connections between nodes that doesn't account for proximity? Would it make sense to take dfg1, feed it back though an LLM and have it group similar nodes/edges together and go from there?
Also what do you think is the main purpose of the visualization? In your example about healthcare in India, the largest node is "doctors" and it has a strong connection to "India", this is on the nose and does not provide any more insight than simply reading the title of the article. Would dropping the largest 1 - 5% of nodes leave room for more subtle connections? The same issues happen in my graph as well and it seems like the more interesting things happen on the edges of the graph that represent non-obvious connections between concepts. Am I missing a different interpretation of the graph?

A:
Hey Luke, you have raised several great questions and ideas here

Contextual proximity:
This idea is somewhat of a guess work. The concepts that are in the same text chunk may be related, and when we implement a RAG based approach on KG, It may be beneficial to map and many relation to a text chunk as possible.
Contextual proximity actually has a positive impact on identifying the concepts that are spread across the text. Because every chunk that the concept appears in, increases the degree of the concept, sometimes undesirably so.
But I too feel it is not the best way to do it.

Your idea of feeding dfg1 back to LLM to identify non-proximity connections is excellent. We should experiment on this.

Your idea of dropping two 1% nodes is also very good. Some concepts like 'India' and 'Doctors' are bound to be ubiquitous in the body of work. This can be easily implemented by identifying the outlier nodes based on their degree, and remove all of them from the KG.

Would you mind if we take this discussion on GitHub Repo?

Also what do you think is the main purpose of the visualization?

I feel the main purpose of visualisation is artistic gratification.
Visualisation is used just to demonstrate the possibilities. A good use of KG or a Concept Graph will be to improve upon RAG or recursive RAG, and create a better AI based Agent.

Luke-Chesley · 2023-11-12T19:15:55Z

Luke-Chesley
Nov 12, 2023
Author

I must preface that I have no real background in LLMs, graph theory, etc. Currently in grad school for data science trying to learn all I can.

Contextual proximity:

I'm not sure what direction this will take but I have some thoughts and I want to get out there, again I speak on no authority I'm just interested in this stuff. The contextual proximity certainly does a great job connecting ideas, but it also strips some of the nuance that the LLM has given us.

graph.csv this is the output from an input of the first 6 chapters of this book History of Economic Thought. Also for reference this is the dfg_merged.csv after it has gone through the contextual proximity function.

The graph G that is produced has the following centrality measures. These are the top 10 nodes from each measure, `

	Degree Centrality	Betweenness Centrality	Closeness Centrality	Eigenvector Centrality	PageRank
0	aquinas	economic thought	economic thought	state	mariana
1	usury	usury	aquinas	justice	usury
2	mariana	aquinas	money	kingdom	aquinas
3	just price	money	usury	kingdoms	aristotle
4	money	just price	just price	pirate	plato
5	state	mariana	natural law	plague	money
6	theologians	state	aristotle	robber band	economic thought
7	aristotle	natural law	adam smith	robber bands	just price
8	government	aristotle	mariana	alexander the great	jesuit order
9	economic thought	theologians	giles of lessines	cicero's parable	government

`

Some of these are very good and, in my opinion(having read the book), ARE the most central concepts of this text corpus. Economic thought, state, usury, money, Aristotle, government, are good and relevant. Some are less so, Aquinas(st. Thomas), Pirate, plague, Cicero's parable. These are definitely relevant but I feel like they are being over-weighted somewhere.
Additionaly, there are some concepts that definitely should be combined into the same node. State and government, kingdom and kingdoms, etc.

This same analysis on dfg1 yields much worse results, less relevant central nodes, somewhat proving the usefulness of contextual proximity.

A possible next step could to do some preprocessing of dfg_merged.csv. Combine similar terms is an easy one, I'm not sure what else.

Visualization:

I am really trying to find a justification for the graph beyond its artistic satisfaction, there is a lot of value here for quickly viewing complex relationships but I think it has to be refined. Not sure how. I will continue thinking about this.

I made an ontology of this same text with webProtege, here is a small snip-it.

I like this because it quickly shows the flow of ideas and contributions throughout history. It's much more expansive in total and can be seen at the link above. Watching these ideas evolve from the point of view of their contributors is very interesting to me and I think has applications beyond learning/teaching, I will have to do some thinking about what those applications are.

I think we are way off from having an LLM do this but maybe finding a way to incorporate hierarchical relationships could make the visualization more useful and could be used as a teaching tool. Some of these same relationships are being caught by the model which is exciting.

On the knowledge graph, most of the edges are just contextual proximity but some maintain their unique edge, and some of these are quite insightful and interesting. Could restricting the choices for relationships that the model can choose from be possible? Only let it connect ideas with certain phrases/ideas like ones in the ontology. I have not done much prompt engineering, most of my experience with LLMs so far has been with pretraining and fine-tuning much smaller model for text classification, not sure how feasible this would be. I will try to work on this.

Somehow combining contextual proximity and hierarchical relationships could open up some new paths for the visualization.

1 reply

rahulnyk Nov 17, 2023
Maintainer

Hey @Luke-Chesley
This is a great analysis, thanks for sharing this.
I share most of the concerns you have discussed here. An LLM when used out of the box to extract a KG is a good first step, but to mature this idea and make it useful, we may need to tune an LLM.

Your analysis of the value of contextual proximity is a validation of my less rigorous hit-and-trial method.

Frankly, I started this project as an experiment. It has shown a potential to be more useful than just having fun with graphs. I would like to try the following things (possibly with some help from you and others):

Train a model to extract KGs. Here in this repo, we have some good example data to use for training and testing: https://github.com/cenguix/Text2KGBench
Would like to create an FE that can help us show a tree of concepts similar to the image you shared here.

Let me know if you would like to contribute. I can add you as a contributor to this repo.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Contextual proximity: #6

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Contextual proximity: #6

Luke-Chesley Nov 12, 2023

Replies: 1 comment · 1 reply

Luke-Chesley Nov 12, 2023 Author

Contextual proximity:

Visualization:

rahulnyk Nov 17, 2023 Maintainer

Luke-Chesley
Nov 12, 2023

Replies: 1 comment 1 reply

Luke-Chesley
Nov 12, 2023
Author

rahulnyk Nov 17, 2023
Maintainer