This is a the api and database for the explorational project Parla. This is not production ready. Currently we explore if we can make the parliamentary documentation provided by the "The Abgeordnetenhaus" of Berlin as open data https://www.parlament-berlin.de/dokumente/open-data more accessible by embedding all the data and do search it using vector similarity search. The project is heavily based on this example from the supabase community. Built with Fastify and deployed to render.com using docker.
- docker
- vercel.com account
- supabase.com account
- openai.com account
- running instance of the related frontend https://github.com/technologiestiftung/parla-frontend
- running instance of the database, defined in ./supabase
- populated database. Using these tools https://github.com/technologiestiftung/parla-document-processor
See .envrc.sample
for the required environment variables.
Hint. We use direnv
for development environment variables. See https://direnv.net/
Install dependencies:
npm ci
Setup environment variables:
cp .envrc.sample .envrc
Change variables in .envrc according to your needs and load the env:
direnv allow
Startup a local Supabase database:
npx supabase start
Run the API:
npm run dev
API is now running (default on http://127.0.0.1:8080)
Currently we deploy using docker on render.com.
- Go to render.com
- allow render to access your github repository
- create a new web service (type should be docker)
- populate the environment variables
- deploy
The indices on the processed_document_chunks
and processed_document_summaries
tables need be regenerated upon arrival of new data.
This is because the lists
parameter should be changed accordingly to https://github.com/pgvector/pgvector. To do this, we use the pg_cron
extension available: https://github.com/citusdata/pg_cron. To schedule the regeneration of indices, we create two jobs which use functions defined in the API and database definition: https://github.com/technologiestiftung/parla-api. As those jobs run for quite a long time, we have to execute them in a session wrapped in BEGIN
and COMMIT
with the statement_timeout
set to a high value (in our case, we use 600.000ms = 10min).
select cron.schedule (
'regenerate_embedding_indices_for_summaries',
'30 5 * * *',
$$ BEGIN; SET statement_timeout = '600000'; select * from regenerate_embedding_indices_for_summaries(); COMMIT; $$
);
select cron.schedule (
'regenerate_embedding_indices_for_chunks',
'30 4 * * *',
$$ BEGIN; SET statement_timeout = '600000'; select * from regenerate_embedding_indices_for_chunks(); COMMIT; $$
);
To have feedback types and tags in the initial version you can use this snippet
INSERT INTO feedbacks (kind, tag)
values('positive', NULL), ('negative', 'Antwort inhaltlich falsch oder missverständlich'), ('negative', 'Es gab einen Fehler'), ('negative', 'Antwort nicht ausführlich genug'), ('negative', 'Dokumente unpassend');
It is also present in the supabase/seed.sql
npm t
Before you create a pull request, write an issue so we can discuss your changes.
Thanks goes to these wonderful people (emoji key):
Fabian Morón Zirfas 💻 🚇 🎨 📖 |
Jonas Jaszkowic 💻 🤔 📖 |
Ingo Hinterding 📆 💻 🤔 |
This project follows the all-contributors specification. Contributions of any kind welcome!
Made by
|
A project by
|
Supported by
|