SlideSpeak · pulgamecanica · Nov 22, 2024 · Nov 22, 2024 · Nov 22, 2024 · Nov 22, 2024
diff --git a/README.md b/README.md
@@ -1,41 +1,32 @@
-# SlideSpeak coding challenge: Build a PowerPoint to PDF marketing tool
+# SlideSpeak coding challenge: Build a PowerPoint Video Extractor Tool
 
 ## The challenge!
 
-Build a front-end implementation as well as a back-end service to convert PowerPoint documents to PDF format. This
-should be done by implementing a simple **Next.js** front-end that posts a file to a **Python** server. You don’t have
-to do the converting logic yourself as you can use unoconv or unoserver to do this (you can see more about this in the
-acceptance criteria). The front-end is also already implemented in the /frontend folder. You only need to add the
+Build a front-end implementation as well as a back-end service to extract videos from PowerPoint documents. This
+should be done by implementing a simple **Next.js** front-end that posts a file to a **Python** server.
+The front-end is also already implemented in the /frontend folder. You only need to add the
 necessary logic to switch between the steps and convert the file via the API that you're going to build.
 
-- Webpage for the
-  tool: [https://slidespeak.co/free-tools/convert-powerpoint-to-pdf/](https://slidespeak.co/free-tools/convert-powerpoint-to-pdf/)
-- Design: [https://www.figma.com/file/CRfT0MVMqIV8rAK6HgSnKA/SlideSpeak-Coding-Challenge?type=design&t=6m2fFDaRs72CowZH-6](https://www.figma.com/file/CRfT0MVMqIV8rAK6HgSnKA/SlideSpeak-Coding-Challenge?type=design&t=6m2fFDaRs72CowZH-6)
+- The tool will be on a webpage similar to: [https://slidespeak.co/free-tools/convert-powerpoint-to-pdf/](https://slidespeak.co/free-tools/convert-powerpoint-to-pdf/)
+- Figma Design: [https://www.figma.com/design/CRfT0MVMqIV8rAK6HgSnKA/SlideSpeak-Coding-Challenge?node-id=798-61](https://www.figma.com/design/CRfT0MVMqIV8rAK6HgSnKA/SlideSpeak-Coding-Challenge?node-id=798-61)
 
 ## Acceptance criteria
 
 ### Back-end API
 
 - Should be implemented in Python.
-- Converting PowerPoints to PDF can be done with `unoconv` or `unoserver` via Docker if you want to be fancy 😀. You
-  don’t need to implement the converting logic yourself.
-    - [Documentation on how to use unoconv and spawn a process](https://pypi.org/project/unoconv/)
-        - Note: `unoconv` is deprecated but thats ok for this challenge
-    - [How to use unoserver via docker](https://gist.github.com/kgoedecke/44955d0b0b1ed4112bcfd3e237e135c0), this will
-      create an API that you can use based on [this](https://github.com/libreofficedocker/unoserver-rest-api)
-      documentation.
-        - Using unoserver is nice-to-have (but the preferred way), if you find unoconv easier use it instead
-- The API should consist of one endpoint (POST /convert), which should do the following:
-    1. Converts the attached file to PDF
-    2. Uploads the PowerPoint and PDF file to Amazon S3
+- Extracting Videos from PowerPoint using a zip utility. This should support multiple processes in parallel. Preferably with a queue.
+- The API should consist of one endpoint (POST /extract), which should do the following:
+    1. Extracts the videos from the PowerPoint
+    2. Uploads the videos to Amazon S3
        via [boto3](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html)
     3. Creates a presigned URL for the user to download
 
        [https://boto3.amazonaws.com/v1/documentation/api/latest/guide/s3-presigned-urls.html](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/s3-presigned-urls.html)
 
        [https://medium.com/@aidan.hallett/securing-aws-s3-uploads-using-presigned-urls-aa821c13ae8d](https://medium.com/@aidan.hallett/securing-aws-s3-uploads-using-presigned-urls-aa821c13ae8d)
 
-    4. Returns the presigned S3 url to the client which allows the user to download the file (by opening the url in new
+    4. Returns the presigned S3 url/urls to the client which allows the user to download the file (by opening the url in new
        tab)
 
 ### Front-end app
@@ -45,11 +36,11 @@ necessary logic to switch between the steps and convert the file via the API tha
 
 ## Nice to haves / tips
 
-- Uses unoserver to convert PowerPoint to PDF via docker compose
+- Uses a queuing system like Celery and Redis
 - The logic of the front-end ideally should not rely on useEffect too much since it can be difficult to track what is
   happening
 - Tests
 - Use conventional commit message style: https://www.conventionalcommits.org/en/v1.0.0/
 - Lint your code
 - Keep commits clean
-- If you want to be really fancy you can add queuing with Celery
+- Setup with Docker Compose
diff --git a/backend/.env.example b/backend/.env.example
@@ -0,0 +1,6 @@
+AWS_ACCESS_KEY=<your-aws-access-key>
+AWS_SECRET_KEY=<your-aws-secret-key>
+S3_BUCKET_NAME=<your-bucket-name>
+AWS_REGION=<your-aws-bucket-region>
+CELERY_BROKER_URL=redis://redis:6379/0
+CELERY_BACKEND_URL=redis://redis:6379/0
diff --git a/backend/.gitignore b/backend/.gitignore
@@ -0,0 +1,3 @@
+.env
+app/__pycache__
+app/__pycache__/*
diff --git a/backend/Dev.Dockerfile b/backend/Dev.Dockerfile
@@ -0,0 +1,16 @@
+FROM python:3.12-slim-bookworm
+
+WORKDIR /app
+
+# Install system dependencies (unzip)
+RUN apt-get update && apt-get install -y unzip
+
+# Install Python dependencies
+COPY requirements.txt .
+RUN pip install --no-cache-dir -r requirements.txt
+
+# Copy application code
+COPY ./app /app
+
+# Run the FastAPI app
+CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000", "--reload"]
diff --git a/backend/Dev.Dockerfile.celery b/backend/Dev.Dockerfile.celery
@@ -0,0 +1,24 @@
+FROM python:3.12-slim-bookworm
+
+WORKDIR /app
+
+# Install system dependencies
+RUN apt-get update && apt-get install -y \
+  unzip \
+  gcc \
+  libffi-dev \
+  musl-dev \
+  && apt-get clean \
+  && rm -rf /var/lib/apt/lists/*
+
+# Copy & Install Python dependencies
+COPY requirements.txt requirements.txt
+COPY requirements-dev.txt requirements-dev.txt
+RUN pip install --no-cache-dir -r requirements.txt
+RUN pip install --no-cache-dir -r requirements-dev.txt
+
+# Copy application code
+COPY ./app /app
+
+# Run Celery worker with watchdog
+CMD ["watchmedo", "auto-restart", "-d", ".", "-R", "-p", "*.py", "--debug-force-polling", "--", "celery", "-A", "video_extractor.celery", "worker", "--loglevel=info", "-c", "16"]
diff --git a/backend/Dockerfile b/backend/Dockerfile
@@ -0,0 +1,16 @@
+FROM python:3.12-slim-bookworm
+
+WORKDIR /app
+
+# Install system dependencies (unzip)
+RUN apt-get update && apt-get install -y unzip
+
+# Install Python dependencies
+COPY requirements.txt .
+RUN pip install --no-cache-dir -r requirements.txt
+
+# Copy application code
+COPY ./app /app
+
+# Run the FastAPI app
+CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000", "--proxy-headers"]
diff --git a/backend/Dockerfile.celery b/backend/Dockerfile.celery
@@ -0,0 +1,22 @@
+FROM python:3.12-slim-bookworm
+
+WORKDIR /app
+
+# Install system dependencies
+RUN apt-get update && apt-get install -y \
+  unzip \
+  gcc \
+  libffi-dev \
+  musl-dev \
+  && apt-get clean \
+  && rm -rf /var/lib/apt/lists/*
+
+# Install Python dependencies
+COPY requirements.txt requirements.txt
+RUN pip install --no-cache-dir -r requirements.txt
+
+# Copy application code
+COPY ./app /app
+
+# Set the default command to run Celery worker
+CMD ["celery", "-A", "video_extractor.celery", "worker", "--loglevel=info", "-c", "16"]
diff --git a/backend/README.md b/backend/README.md
@@ -2,8 +2,201 @@
 
 - Having Docker installed on your system.
 
-## Running the docker container
+## Running the docker containers
 
-```bash
-docker compose up --build
-```
+The project includes a `Makefile` to simplify common Docker Compose tasks. You can use the following commands to manage your development and production environments:
+
+### Available Commands
+
+| Command       | Description                                                                                 |
+|---------------|---------------------------------------------------------------------------------------------|
+| `make start`  | Starts the development environment using `docker-compose.yml`.                             |
+| `make build`  | Builds the Docker images and starts the development environment.                           |
+| `make detach` | Builds the Docker images, starts the development environment in detached mode (background). |
+| `make down`   | Stops and removes all containers, networks, and volumes defined in `docker-compose.yml`.   |
+| `make prod`   | Starts the production environment using `docker-compose.prod.yml`.                         |
+| `make prod-down` | Stops and removes all containers, networks, and volumes defined in `docker-compose.prod.yml`. |
+
+---
+
+### How to Use
+
+1. **Start Development Environment**:
+   - Run the following command to start your development environment:
+     ```bash
+     make start
+     ```
+   - This command will spin up the containers as defined in `docker-compose.yml`.
+
+2. **Build and Start Containers**:
+   - If you’ve made changes to your Dockerfile or dependencies, rebuild the containers with:
+     ```bash
+     make build
+     ```
+
+3. **Run in Detached Mode**:
+   - To run the containers in the background, use:
+     ```bash
+     make detach
+     ```
+
+4. **Shut Down the Environment**:
+   - To stop and clean up all containers, networks, and volumes, run:
+     ```bash
+     make down
+     ```
+
+5. **Start Production Environment**:
+   - Use this command to start the production environment defined in `docker-compose.prod.yml`:
+     ```bash
+     make prod
+     ```
+
+6. **Shut Down Production Environment**:
+   - To stop and clean up the production containers, use:
+     ```bash
+     make prod-down
+     ```
+
+***
+
+## Pulgamecanica walkthoguh
+
+
+### How Videos Are Stored in PowerPoint Files?
+
+	PowerPoint files with the .pptx extension are essentially ZIP archives that follow the Office Open XML standard. 
+
+	They contain various directories and XML files for slides, images, audio, and video. Videos embedded in a PowerPoint are usually stored as media files within the ppt/media folder inside the ZIP archive.
+
+
+***
+
+### How SlideSpeak Extracts Videos from .pptx Files
+
+    Treat .pptx as a ZIP Archive: Use unzip tool extract its contents.
+
+```py
+subprocess.run(
+        ["unzip", "-j", pptx_file_path, "ppt/media/*", "-d", output_path],
+        check=True,
+        shell=False,
+    )
+```
+
+As seen here: https://github.com/SlideSpeak/image-extractor-cli/blob/30c5ad96ffbc3aaea63b01928630b3efe87e62e9/image_extractor.py#L110C9-L114C10
+
+
+	unzip: Unzips the file.
+	-j: Junk the directory structure (extracts files without retaining folder hierarchy).
+	pptx_file_path: Path to the .pptx file.
+	ppt/media/*: Extracts only files from the ppt/media/ directory.
+	-d output_path: Specifies the directory to extract the files.
+
+
+    Any file with video formats such as .mp4, .mov, .avi, etc., is likely a video.
+
+    Extract Relevant Files: Extract the video files into a temporary directory for further processing or upload.
+
+***
+
+### Celery
+
+Celery is running on a docker container.
+You can use celery by implementing the decorator on the function you wish to queue.
+Then you should use `.delay` to trigger the celery queue which will return the task id.
+After that you can query the celery worker for the result, if any.
+
+### Coding
+
+Now we know what the python script should look like.
+
+Fast API route:
+**POST** _/extract_
+
+- @params:
+	- file: file.pptx
+
+##### NOTE:
+	How to make a test with postman when a post requires a named file parameter?
+	You should go to postman and change the request to a post.
+	Then on the Body section choose form-data.
+	Then hover on the `Key` section and choose "File" on the dropdown.
+	Then you can put the key `file` and choose the file for the value.
+
+### Structure
+
+We will use fastapi to create an endpoint where we can `post` pptx files and get a response appropiate for the desired output. (likely to be a ref. to the S3 bucket were we will store the videos)
+
+```
+backend/
+├── app/
+│   ├── __init__.py
+│   ├── main.py  # Entry point for FastAPI app
+│   ├── tasks.py # Logic for video extraction and S3 upload
+├── requirements.txt
+├── Dockerfile
+├── docker-compose.yml
+├── README.md
+└── .env
+```
+
+### Dependencies
+
+- fastapi: Framework for building the backend API.
+- uvicorn: ASGI server for running the FastAPI application.
+- boto3: AWS SDK for Python to interact with Amazon S3.
+- python-multipart: Required by FastAPI to handle file uploads.
+- celery: Task queue system for parallel processing.
+- redis: Backend for Celery and message broker.
+- aiofiles: Asynchronous file I/O for FastAPI when saving uploaded files
+
+
+## Zipping Videos for S3 Upload
+
+When multiple video files are extracted from the PowerPoint presentation, the project uses Python's `shutil.make_archive()` to create a compressed ZIP file for efficient storage and upload to S3.
+
+### How It Works
+
+1. **Single Video**:
+   - If only one video is found, it is uploaded directly to S3 without compression.
+   - The S3 link for the video is returned.
+
+2. **Multiple Videos**:
+   - If multiple videos are found, they are compressed into a single ZIP file using `shutil.make_archive()` before being uploaded to S3.
+   - The S3 link for the ZIP file is returned.
+
+3. **ZIP Creation**:
+   - The `shutil.make_archive()` function creates a standard ZIP file that is fully compatible with ZIP tools (e.g., Windows File Explorer, macOS Finder, `unzip` command).
+   - Compression ensures reduced file size for faster uploads and downloads.
+
+### Code Example
+
+Here’s how the ZIP archive is created in the project:
+
+```python
+shutil.make_archive(zip_path.replace(".zip", ""), "zip", output_dir)
+```
+
+- **`zip_path.replace(".zip", "")`**: Defines the name of the ZIP file without the `.zip` extension (automatically added by the function).
+- **`"zip"`**: Specifies the archive format (ZIP in this case).
+- **`output_dir`**: The directory containing the extracted video files to include in the ZIP.
+
+
+### CORS
+
+This is the current CORS setup:
+
+```py
+# main.py
+
+app.add_middleware(
+    CORSMiddleware,
+    allow_origins=[<List of allowed origins>],  # Allowed origins
+    allow_credentials=True,
+    allow_methods=["*"],  # Allowed HTTP methods
+    allow_headers=["*"],  # Allowed headers
+)
+```
+
+If you want to run in production or test it in your local network, you will need to change the configuration accordingly
diff --git a/backend/app/__init__.py b/backend/app/__init__.py
diff --git a/backend/app/config.py b/backend/app/config.py
@@ -0,0 +1,25 @@
+import os
+
+# S3 Configuration
+AWS_ACCESS_KEY = os.getenv("AWS_ACCESS_KEY")
+AWS_SECRET_KEY = os.getenv("AWS_SECRET_KEY")
+S3_BUCKET_NAME = os.getenv("S3_BUCKET_NAME")
+AWS_REGION = os.getenv("AWS_REGION")
+
+# Celery Configuration
+CELERY_BROKER_URL = os.getenv("CELERY_BROKER_URL", "redis://redis:6379/0")
+CELERY_BACKEND_URL = os.getenv("CELERY_BACKEND_URL", "redis://redis:6379/0")
+
+# Local Directories
+LOCAL_DOCUMENTS_DIR = "shared_tmp"
+
+# Task Settings
+MAX_CONVERT_TRIES = 5
+SOFT_TIME_LIMIT = 120
+TIME_LIMIT = 300
+
+# CORS Settings
+ALLOWED_ORIGINS = [
+    "https://slidespeak.co",
+    "http://localhost:3000",
+]