docsbox
is a standalone service that allows you convert different types of files to PDF/A formats, including document, presentation, spreadsheet and image formats.
Set the required configuration by copying the .env.example to .env and setting up the parameters accordingly.
docker build -t oikeusministerio/common-conversion:test docsbox
docker-compose up -d
The service can be configurable through the yml file docsbox/config/config.yml
. These can be overridden with environment variables. Some examples:
REDIS_URL - Redis Server url (default: redis://redis:6379/0)
VIA_URL - VIA service url
VIA_CERT_PATH - Certificate path for VIA connection
VIA_ALLOWED_USERS - Allowed users for VIA
GRAYLOG_HOST - Graylog server (default: localhost)
GRAYLOG_PORT - Graylog server input port (default: 12001)
GRAYLOG_PATH - Graylog server input path (default: '/gelf')
GRAYLOG_SOURCE - Graylog name for the logger Host
For local devolpment we use dev containers since libreoffice features that we use for document conversion are very linux dependent. You can find dev container definition on .devcontainer folder.
To prepare local development
- Install "Dev Containers" extension for VS Code or something similar.
- Build the image to use for local development with
docker build -t oikeusministerio/common-conversion:test docsbox
, if not already done. - Reopen folder in a dev container with the .devcontainer/devcontainer.json file. It will automatically start a redis container for that dev container to user with.
- Open 2 consoles, initiate in one the rq worker by entering the command
rq worker -c docsbox
and in the other you can enter the commandgunicorn --config=docsbox/config/gunicorn.conf docsbox:app
to initiate the Flask app. - Whatever modifications you make inside a container to the files will automatically reflect in your local project's folder.
- If packages are added/removed or versions changed then execute
pip3 install --break-system-packages -r docsbox/requirements.txt
to reflect the new changes.
You can now repeat the seventh step on any modification you do in the code to represent the changes made and test.
The conversion can be made using VIA or by sending the file appended to the request.
If there is no file appended the conversion service connects with VIA fileservice where requests the file with the given id. To test it will be needed some VIA file id
The service will read the file and return information of about it.
If used with VIA, it will respect the Content-Type
header provided by VIA.
If it is not provided, the service will scan the file.
POST /conversion-service/get-file-type/{file_id}
If used without VIA, set the file_id to 0 and send the file in request body
{
convertable: true,
fileType: "Microsoft Word 2007/2010 XML",
mimeType: "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
pdfVersion: ""
}
Type | Description |
---|---|
convertable | boolean whether or not the service is able to convert the file. Will also be false if the file is already in PDF/A format |
fileType | string a human-readable representation of the file's mimetype. Will be returned only if the file is convertable by the service, otherwise will return "Unknown/Corrupted" |
mimeType | string the file's mimetype |
pdfVersion | string PDF version, if the mimetype is application/pdf , otherwise empty. Will only be returned if the service scanned the file. |
Status | Description |
---|---|
200 | OK |
400 | No file or valid VIA file id was received |
404 | File with the specified VIA file id was not found |
500 | Unhandled server error |
504 | Downloading file from VIA timed out |
The service will queue the specified file to be converted.
If used with VIA, it will respect the Content-Type
header provided by VIA. If it is not provided and the file is not previously scanned with the get-file-type
API, the file will be scanned.
POST /conversion-service/v2/convert/{file_id}
If used without VIA, set the file_id to 0 and send the file in request body
You may provide additional settings through headers, all of which are optional.
Header | Description | Possible values | Default |
---|---|---|---|
Conversion-Format | One of the supported file types | ["pdf", "docx", "xlsx", "pptx", "jpeg", "png"] |
pdf |
Output-Pdf-Version | The PDF version you wish to receive | [1, 2, 3] |
1 |
Via-Allowed-Users | If VIA is used, this will be the allowed users provided to VIA for a conversion result. This should be the CN of your client certificate | example.com |
|
Content-Disposition | You may provide a filename for conversion service through this header | example.pdf |
{
taskId: "123e4567-e89b-12d3-a456-426614174000",
status: "queued"
}
Type | Description |
---|---|
taskId | string UUID specific for this task, use this when polling the status of the conversion |
status | string queued or started |
Status | Description |
---|---|
200 | OK |
400 | No file or valid VIA file id was received |
404 | File with the specified VIA file id was not found |
500 | Unhandled server error |
504 | Downloading file from VIA timed out |
Check the status of a conversion task with the task id.
GET /conversion-service/status/{task_id}
Successful
{
taskId: "123e4567-e89b-12d3-a456-426614174000",
status: "finished",
fileType: "PDF/A",
mimeType: "application/pdf",
pdfVersion: "1A",
}
Queued
{
taskId: "123e4567-e89b-12d3-a456-426614174000",
status: "queued",
}
Type | Description |
---|---|
taskId | string UUID specific for this task, use this when polling the status of the conversion |
status | string the status of the conversion, can be queued , started , finished , failed , corrupted , non-convertable |
fileType | string a human-readable representation of the file's mimetype. Will be returned only if the file is convertable by the service, otherwise will return "Unknown/Corrupted" |
mimeType | string the file's mimetype |
pdfVersion | string PDF version, if the mimetype is application/pdf , otherwise empty. Will only be returned if the service scanned the file. |
Status | Description |
---|---|
200 | OK |
404 | No task with the specified task if was found |
500 | Unhandled server error |
If the conversion service used with VIA, the converted file will also be saved in VIA will return the VIA file if of the converted file. If the file was sent directly to the conversion service, the converted file is sent when it's requested.
GET /conversion-service/get-converted-file/{task_id}
{
convertable: true,
fileId: "0297b05c-5a8e-4c88-a6f2-649e3a971597",
fileName: "example.pdf",
mimeType: application/pdf,
fileType: "PDF/A",
pdfVersion: "1A",
status: "finished",
taskId: "123e4567-e89b-12d3-a456-426614174000",
fileSize: 123456
}
Content-Type: application/pdf
Content-Disposition: attachment; filename=example.pdf
Body: file bytes
Type | Description |
---|---|
convertable | boolean always true |
fileType | string a human-readable representation of the file's mimetype. Will be returned only if the file is convertable by the service, otherwise will return "Unknown/Corrupted" |
mimeType | string the file's mimetype |
pdfVersion | string PDF version, if the mimetype is application/pdf , otherwise empty. Will only be returned if the service scanned the file, meaning that |
fileId | string the file if with which the converted file can be downloaded with from VIA |
taskId | string UUID specific for this task, use this when polling the status of the conversion |
fileName | string the file name |
pdfVersion | string the PDF version of the converted file, if it was converted into a PDF |
fileSize | number the file size in bytes |
status | string always finished |
Status | Description |
---|---|
200 | OK |
404 | No task with the specified task if was found |
500 | Unhandled server error |
Tests can be run with VIA or without, if connection to VIA is not possible, TEST_VIA must be set to False when running tests.
The input files are saved in the /docsbox/docs/tests/inputs and the conversion outputs will be saved to the /docsbox/docs/tests/inputs directory.
TEST_VIA=False docker-compose -f docker-compose.yml -f docker-compose.test.yml up --exit-code-from test
Type | Format |
---|---|
Document | .docx .doc .pages .rtf .pdf .sxw .odt |
Presentation | .pptx .ppt .key .sxi .odp |
Spreadsheet | .xlsx .xls .numbers .sxc .ods |
Images | .jpg .png .tiff .webp .heif .heic |
Others | .sxd .sxg .odg |