🎉 Similar charts #3708

Marigold · 2024-12-09T09:48:39Z

A Streamlit prototype for finding similar charts to the selected one. Initially, it was searching for similar charts based solely on the semantic similarity of chart titles and subtitles, but later, many more "features" were added and combined into a similarity score. It was created to provide an intuition for how well automatic chart recommendations could work. However, the selection will likely evolve into a semi-automatic process (this app could then be leveraged for labeling).

It's not yet clear what "similar chart" means. Should we show very similar charts ("narrow"), a "wide" selection, or charts that are not so similar but perhaps more interesting?

Scoring

Similarity is calculated as a weighted average of the following sub-scores (all between 0 and 1):

title: semantic similarity of the title to other charts
subtitle: semantic similarity of the subtitle to other charts
tags: 1 if a chart shares at least one tag, 0 otherwise
share_indicator: 1 if a chart shares an indicator, 0 otherwise
pageviews: normalized log(365d pageviews) between 0 and 1

Weights were chosen intuitively. Analytical data to make this less subjective would be helpful, but first, we need to agree on what "similar" actually means.

Diversity

Some recommendations, such as those for political-regime, show many charts that are essentially the same chart, just from different providers. While it could be useful to inform users about different providers, it would be better to present more diverse charts. Similarly, some recommended charts, such as for armed-forces-personnel, often include the same indicator, presented as a share of the total population or a similar derivation.

To address this, I asked GPT to find a set of 5 diverse charts among the top 30 similar charts. These charts are marked with a 🤖 symbol, and the reason for their selection is displayed alongside them. This approach helps improve diversity and is relatively inexpensive ($0.01 per query, so doing it for 5000 charts wouldn't be prohibitively costly).

How to review

Check out a few random charts with Diversity with GPT turned on. Are these recommendations reasonable? Is GPT providing additional value? Do you have any ideas for improving either the scoring system or the UI?

Other use cases

This tool could be used to find "duplicate charts" or modified to assist with reviewing the least viewed charts.

owidbot · 2024-12-09T09:50:08Z

Quick links (staging server):

Site Dev	Site Preview	Admin	Wizard	Docs

Login: ssh owid@staging-site-similar-charts

chart-diff: ❌

0/1 reviewed charts
Modified: 0/1
New: 0/0
Rejected: 0

data-diff: ✅ No differences found

Legend: +New  ~Modified  -Removed  =Identical  Details
Hint: Run this locally with etl diff REMOTE data/ --include yourdataset --verbose --snippet

Edited: 2024-12-23 10:05:40 UTC
Execution time: 6.97 seconds

lucasrodes · 2024-12-19T11:37:42Z

This looks very interesting. I find that the "diversity" option can be quite useful in the future, nice one!

In the future, I'd like to understand the prototype a bit more to help with the UI.

Some questions from the top of my mind listed below. No need to address them now, just for the record.

E.g. is it mostly for development purposes (e.g. for engineering team) or also for folks in D&R? I may weigh in more on the UI front depending on this (which, FWIW, I'm keen to help with).
What's the difference between the chart at the top (without borders) and those that follow? Is it like the "top match"? I'm a bit confused since it is shown even when the search query is just a text (i.e. doesn't match a chart slug), and it doesn't show the summary table with analytics)
I guess the above point relates to a more general question: what is this app trying to facilitate? It looks to me as if it is attacking two things now: (i) Free-search to find charts similar to the queried thing. (ii) Select a specific chart from a slug and present similar charts.
- For (i) I would not show the current top chart, since there is no actual chart "selected".
- For (ii), I would use a dropdown selectbox with just a limited list of options (not free-text).
- Maybe we can define two modes?

Having said this, I think it's good to go with this and experiment, too.

Marigold · 2024-12-20T08:09:10Z

We'll discuss the future of this app in January. This was an initial spike intended to prove that it's feasible, not overly complex, and potentially valuable. Now the hard part will be figuring out how much we can get out of it.

What's the difference between the chart at the top (without borders) and those that follow?

Great UI, right? 😁 The top chart is the one found by the text search on the right. If the input contains a valid slug or ID, we match on that, otherwise we match using semantic search on the title.

I guess the above point relates to a more general question: what is this app trying to facilitate? It looks to me as if it is attacking two things now: (i) Free-search to find charts similar to the queried thing. (ii) Select a specific chart from a slug and present similar charts.
For (i) I would not show the current top chart, since there is no actual chart "selected".
For (ii), I would use a dropdown selectbox with just a limited list of options (not free-text).
Maybe we can define two modes?

That's a good point. I'm going to remove mode (i) to make it less confusing. Free-search for charts wasn't in the scope and is not that useful (as far as I know).

(If you think this is a good start, could you approve the PR? I wouldn't want to keep it hanging as PR for too long, and we can always delete it.)

lucasrodes

This is totally good to go, Mojmir!

Read your comment and makes sense to me.

Thanks for pushing for this.

* 🎉 Similar charts

github-actions bot assigned Marigold Dec 9, 2024

Marigold force-pushed the similar-charts branch 7 times, most recently from 8889d92 to 6813c7b Compare December 19, 2024 06:19

Marigold changed the base branch from master to indicator-search December 19, 2024 06:19

Marigold force-pushed the similar-charts branch 2 times, most recently from 5e560fd to 2ea4df5 Compare December 19, 2024 06:23

Marigold requested a review from lucasrodes December 19, 2024 08:30

lucasrodes approved these changes Dec 20, 2024

View reviewed changes

Base automatically changed from indicator-search to master December 27, 2024 08:43

Marigold added 4 commits December 27, 2024 13:15

🎉 Similar charts

4bbaab9

wip

834c34b

wip

b69bb4f

✨ Update chart selection to use dropdown instead of text input

68680c7

Marigold force-pushed the similar-charts branch from f8fcc1b to 68680c7 Compare December 27, 2024 11:15

Marigold merged commit 2fd4c92 into master Dec 27, 2024
6 of 8 checks passed

Marigold deleted the similar-charts branch December 27, 2024 11:33

antea04 pushed a commit that referenced this pull request Feb 5, 2025

🎉 Similar charts (#3708)

0e511a5

* 🎉 Similar charts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🎉 Similar charts #3708

🎉 Similar charts #3708

Marigold commented Dec 9, 2024 •

edited

Loading

owidbot commented Dec 9, 2024 •

edited

Loading

lucasrodes commented Dec 19, 2024 •

edited

Loading

Marigold commented Dec 20, 2024

lucasrodes left a comment

🎉 Similar charts #3708

🎉 Similar charts #3708

Conversation

Marigold commented Dec 9, 2024 • edited Loading

Scoring

Diversity

How to review

Other use cases

owidbot commented Dec 9, 2024 • edited Loading

lucasrodes commented Dec 19, 2024 • edited Loading

Marigold commented Dec 20, 2024

lucasrodes left a comment

Choose a reason for hiding this comment

Marigold commented Dec 9, 2024 •

edited

Loading

owidbot commented Dec 9, 2024 •

edited

Loading

lucasrodes commented Dec 19, 2024 •

edited

Loading