-
-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🎉 Similar charts #3708
🎉 Similar charts #3708
Conversation
Quick links (staging server):
Login: chart-diff: ❌
data-diff: ✅ No differences foundLegend: +New ~Modified -Removed =Identical Details
Hint: Run this locally with etl diff REMOTE data/ --include yourdataset --verbose --snippet Automatically updated datasets matching weekly_wildfires|excess_mortality|covid|fluid|flunet|country_profile|garden/ihme_gbd/2019/gbd_risk are not included Edited: 2024-12-23 10:05:40 UTC |
8889d92
to
6813c7b
Compare
5e560fd
to
2ea4df5
Compare
This looks very interesting. I find that the "diversity" option can be quite useful in the future, nice one! In the future, I'd like to understand the prototype a bit more to help with the UI. Some questions from the top of my mind listed below. No need to address them now, just for the record.
Having said this, I think it's good to go with this and experiment, too. |
We'll discuss the future of this app in January. This was an initial spike intended to prove that it's feasible, not overly complex, and potentially valuable. Now the hard part will be figuring out how much we can get out of it.
Great UI, right? 😁 The top chart is the one found by the text search on the right. If the input contains a valid slug or ID, we match on that, otherwise we match using semantic search on the title.
That's a good point. I'm going to remove mode (i) to make it less confusing. Free-search for charts wasn't in the scope and is not that useful (as far as I know). (If you think this is a good start, could you approve the PR? I wouldn't want to keep it hanging as PR for too long, and we can always delete it.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is totally good to go, Mojmir!
Read your comment and makes sense to me.
Thanks for pushing for this.
f8fcc1b
to
68680c7
Compare
A Streamlit prototype for finding similar charts to the selected one. Initially, it was searching for similar charts based solely on the semantic similarity of chart titles and subtitles, but later, many more "features" were added and combined into a similarity score. It was created to provide an intuition for how well automatic chart recommendations could work. However, the selection will likely evolve into a semi-automatic process (this app could then be leveraged for labeling).
It's not yet clear what "similar chart" means. Should we show very similar charts ("narrow"), a "wide" selection, or charts that are not so similar but perhaps more interesting?
Scoring
Similarity is calculated as a weighted average of the following sub-scores (all between 0 and 1):
log(365d pageviews)
between 0 and 1Weights were chosen intuitively. Analytical data to make this less subjective would be helpful, but first, we need to agree on what "similar" actually means.
Diversity
Some recommendations, such as those for political-regime, show many charts that are essentially the same chart, just from different providers. While it could be useful to inform users about different providers, it would be better to present more diverse charts. Similarly, some recommended charts, such as for armed-forces-personnel, often include the same indicator, presented as a share of the total population or a similar derivation.
To address this, I asked GPT to find a set of 5 diverse charts among the top 30 similar charts. These charts are marked with a 🤖 symbol, and the reason for their selection is displayed alongside them. This approach helps improve diversity and is relatively inexpensive ($0.01 per query, so doing it for 5000 charts wouldn't be prohibitively costly).
How to review
Check out a few random charts with
Diversity with GPT
turned on. Are these recommendations reasonable? Is GPT providing additional value? Do you have any ideas for improving either the scoring system or the UI?Other use cases
This tool could be used to find "duplicate charts" or modified to assist with reviewing the least viewed charts.