Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🎉 Similar charts #3708

Merged
merged 4 commits into from
Dec 27, 2024
Merged

🎉 Similar charts #3708

merged 4 commits into from
Dec 27, 2024

Conversation

Marigold
Copy link
Collaborator

@Marigold Marigold commented Dec 9, 2024

A Streamlit prototype for finding similar charts to the selected one. Initially, it was searching for similar charts based solely on the semantic similarity of chart titles and subtitles, but later, many more "features" were added and combined into a similarity score. It was created to provide an intuition for how well automatic chart recommendations could work. However, the selection will likely evolve into a semi-automatic process (this app could then be leveraged for labeling).

It's not yet clear what "similar chart" means. Should we show very similar charts ("narrow"), a "wide" selection, or charts that are not so similar but perhaps more interesting?

Scoring

Similarity is calculated as a weighted average of the following sub-scores (all between 0 and 1):

  • title: semantic similarity of the title to other charts
  • subtitle: semantic similarity of the subtitle to other charts
  • tags: 1 if a chart shares at least one tag, 0 otherwise
  • share_indicator: 1 if a chart shares an indicator, 0 otherwise
  • pageviews: normalized log(365d pageviews) between 0 and 1

Weights were chosen intuitively. Analytical data to make this less subjective would be helpful, but first, we need to agree on what "similar" actually means.

Diversity

Some recommendations, such as those for political-regime, show many charts that are essentially the same chart, just from different providers. While it could be useful to inform users about different providers, it would be better to present more diverse charts. Similarly, some recommended charts, such as for armed-forces-personnel, often include the same indicator, presented as a share of the total population or a similar derivation.

To address this, I asked GPT to find a set of 5 diverse charts among the top 30 similar charts. These charts are marked with a 🤖 symbol, and the reason for their selection is displayed alongside them. This approach helps improve diversity and is relatively inexpensive ($0.01 per query, so doing it for 5000 charts wouldn't be prohibitively costly).

How to review

Check out a few random charts with Diversity with GPT turned on. Are these recommendations reasonable? Is GPT providing additional value? Do you have any ideas for improving either the scoring system or the UI?

Other use cases

This tool could be used to find "duplicate charts" or modified to assist with reviewing the least viewed charts.

@owidbot
Copy link
Contributor

owidbot commented Dec 9, 2024

Quick links (staging server):

Site Dev Site Preview Admin Wizard Docs

Login: ssh owid@staging-site-similar-charts

chart-diff: ❌
  • 0/1 reviewed charts
  • Modified: 0/1
  • New: 0/0
  • Rejected: 0
data-diff: ✅ No differences found
Legend: +New  ~Modified  -Removed  =Identical  Details
Hint: Run this locally with etl diff REMOTE data/ --include yourdataset --verbose --snippet

Automatically updated datasets matching weekly_wildfires|excess_mortality|covid|fluid|flunet|country_profile|garden/ihme_gbd/2019/gbd_risk are not included

Edited: 2024-12-23 10:05:40 UTC
Execution time: 6.97 seconds

@Marigold Marigold force-pushed the similar-charts branch 7 times, most recently from 8889d92 to 6813c7b Compare December 19, 2024 06:19
@Marigold Marigold changed the base branch from master to indicator-search December 19, 2024 06:19
@Marigold Marigold force-pushed the similar-charts branch 2 times, most recently from 5e560fd to 2ea4df5 Compare December 19, 2024 06:23
@Marigold Marigold requested a review from lucasrodes December 19, 2024 08:30
@lucasrodes
Copy link
Member

lucasrodes commented Dec 19, 2024

This looks very interesting. I find that the "diversity" option can be quite useful in the future, nice one!

In the future, I'd like to understand the prototype a bit more to help with the UI.

Some questions from the top of my mind listed below. No need to address them now, just for the record.

  • E.g. is it mostly for development purposes (e.g. for engineering team) or also for folks in D&R? I may weigh in more on the UI front depending on this (which, FWIW, I'm keen to help with).
  • What's the difference between the chart at the top (without borders) and those that follow? Is it like the "top match"? I'm a bit confused since it is shown even when the search query is just a text (i.e. doesn't match a chart slug), and it doesn't show the summary table with analytics)
  • I guess the above point relates to a more general question: what is this app trying to facilitate? It looks to me as if it is attacking two things now: (i) Free-search to find charts similar to the queried thing. (ii) Select a specific chart from a slug and present similar charts.
    • For (i) I would not show the current top chart, since there is no actual chart "selected".
    • For (ii), I would use a dropdown selectbox with just a limited list of options (not free-text).
    • Maybe we can define two modes?

Having said this, I think it's good to go with this and experiment, too.

@Marigold
Copy link
Collaborator Author

We'll discuss the future of this app in January. This was an initial spike intended to prove that it's feasible, not overly complex, and potentially valuable. Now the hard part will be figuring out how much we can get out of it.

What's the difference between the chart at the top (without borders) and those that follow?

Great UI, right? 😁 The top chart is the one found by the text search on the right. If the input contains a valid slug or ID, we match on that, otherwise we match using semantic search on the title.

I guess the above point relates to a more general question: what is this app trying to facilitate? It looks to me as if it is attacking two things now: (i) Free-search to find charts similar to the queried thing. (ii) Select a specific chart from a slug and present similar charts.
For (i) I would not show the current top chart, since there is no actual chart "selected".
For (ii), I would use a dropdown selectbox with just a limited list of options (not free-text).
Maybe we can define two modes?

That's a good point. I'm going to remove mode (i) to make it less confusing. Free-search for charts wasn't in the scope and is not that useful (as far as I know).

(If you think this is a good start, could you approve the PR? I wouldn't want to keep it hanging as PR for too long, and we can always delete it.)

Copy link
Member

@lucasrodes lucasrodes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is totally good to go, Mojmir!

Read your comment and makes sense to me.

Thanks for pushing for this.

Base automatically changed from indicator-search to master December 27, 2024 08:43
@Marigold Marigold merged commit 2fd4c92 into master Dec 27, 2024
6 of 8 checks passed
@Marigold Marigold deleted the similar-charts branch December 27, 2024 11:33
antea04 pushed a commit that referenced this pull request Feb 5, 2025
* 🎉 Similar charts
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants