Skip to content

Latest commit

 

History

History
99 lines (75 loc) · 5.31 KB

README.md

File metadata and controls

99 lines (75 loc) · 5.31 KB

Mercury

Mercury is a data enrichment service for Analogue. It's primarily used to extract rich data and images for use on Analogue (people, topics, information etc)

Endpoint

Live endpoint can be found: https://analogue-mercury.herokuapp.com/get

Pass in the parameter url with a valid URL to get data.

GET https://analogue-mercury.herokuapp.com/get?url=https://www.youtube.com/watch?v=dzqpfu5izjE

Running locally

Install Python3 (setup guide) and follow the Flask installation guide.

  1. Create virtualenv
python3 -m venv mercury
  1. Activate virtualenv
source mercury/bin/activate
  1. Install requirements
pip install -r requirements.txt
  1. Add to .env locally to run in debugger mode
FLASK_ENV=development
  1. Run app.py from the root to start Flask
python3 app.py
  1. Copy paste these keys into your .env file

Project Scope

From a UX perspective, the idea solution is to get back data as fast as possible when someone adds a URL. So it would spit back the simple data first (url, image, description, medium type), and if it's new and needs to be enriched, we enrich it in the background by hitting the appropriated APIs.

So maybe there are two endpoints, one with a quick response (no enriching) and one that does the full enrichment. We can discuss and figure out the best solution together.

Supports the following URLs and APIs. Example URLs linked.

Medium URLs APIs
Book https://goodreads.com
https://amazon.com
Google Books API for data, authors, topics
OpenLibrary for image covers
Amazon solution TBD
Music
Podcast
Spotify (song, album)
Apple (show, episode)
Spotify API
Apple TBD
Film
TV
IMDB (film, show, episode) OMDB API for data
TMDB API for people, trailers, etc
Art
WikiArt
Artsy
WikiArt
Artsy API
WikiArt API

Quick response endpoint /get

This endpoint will be used to get the initial data as quickly as possible. Ideally it doesn't even hit APIs, as to save time for the user. But you might have to hit APIs to get the specific medium and form type (e.g. for IMDB links, films vs TV shows)

JSON response:

{
  title: 'url title from og or twitter or <title> tag'
  url: 'CANONICAL_URL_NORMALIZED', // shouldn't have query params, except for youtube (e.g. ?v=afdsafxxx)
  medium: 'one of the medium types mapped below',
  form: 'one of the form types mapped below',
  image: 'url to image from og or twitter tags or first image in html',
  description: 'short description from og or twitter or meta tags or first paragraph of html'
}

Medium mapping

Form Medium URLs
video video_link youtube.com, vimeo.com, ted.com
video film imdb.com film url (example)
video tv imdb.com show url (example)
video tv_episode imdb.com episode url (example)
audio song spotify.com song url (example)
audio album spotify.com album url (example)
audio playlist spotify.com playlist url (example)
audio podcast spotify.com podcast show url (example)
audio podcast_episode spotify.com podcast episode url (example)
audio audio_link soundcloud.com
text book amazon.com
text link default form and medium (most urls)

Rich response endpoint: /enrich

This endpoint will be used to enrich the data (through a background job in Rails). So this will provide full rich responses, including related data (e.g. authors for books from Google, director for films from IMDB).

Additional Notes

  • Leverages Open Graph and Twitter meta tags
  • The scraper will do selective parsing which means it will create a parsing tree only for some specific tags, not for all the tags in the HTML doc.
  • A link url will be sent using the GET Method only as it is faster than POST and PUT.