Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PowerPoint Video Extractor #1

Open
wants to merge 26 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
a8b7487
docs: reword challenge for video extractor tool
kgoedecke Nov 22, 2024
0ed764c
chore: Dockerize bun application
pulgamecanica Nov 22, 2024
ecc96bc
docs: Update frontend README.md to add Docker instructions
pulgamecanica Nov 22, 2024
864615a
chore: Created FastAPI application directory and main
pulgamecanica Nov 22, 2024
eba5fa6
chore: Dockerized FastAPI server, Redis server + unoserver
pulgamecanica Nov 22, 2024
c7ff14d
docs: Added to README -> pulgamecanica walkthrough part 1
pulgamecanica Nov 22, 2024
325b595
chore: extract_videos.py tool is working
pulgamecanica Nov 22, 2024
d1f107d
fix: fixed typo on frontend README
pulgamecanica Nov 22, 2024
11a9c86
chore: Added __pycache__ to gitignore
pulgamecanica Nov 22, 2024
ef06331
chore: Enable celery
pulgamecanica Nov 22, 2024
ad72e1d
fix: Remove unused files
pulgamecanica Nov 22, 2024
15e3609
chore: Added frontend functionalities
pulgamecanica Nov 23, 2024
c9328f2
chore: Merge branch 'front' into pulga-challenge
pulgamecanica Nov 23, 2024
a1c3626
chore: Added CORS settings to allow frontend endpoint call
pulgamecanica Nov 23, 2024
5305d08
chore: Fixed typo and improoved videos list style
pulgamecanica Nov 23, 2024
5924827
chore: Replace PDF icon by VideoIcon
pulgamecanica Nov 23, 2024
07fbdbe
docs: Added CORS section to README
pulgamecanica Nov 23, 2024
d217cc6
feat(SLI-91): add celery worker to docker compose (#8)
pulgamecanica Dec 3, 2024
c829037
fix(sli-90): add celery queuing (#9)
pulgamecanica Dec 3, 2024
7b43885
feat(SLI-89): add robust try-except blocks (#10)
pulgamecanica Dec 3, 2024
7093ff7
feat(SLI-88) (#11)
pulgamecanica Dec 3, 2024
f367d8b
chore: implement uuid4() for unique file names (#12)
pulgamecanica Dec 3, 2024
3f132bf
chore: add versions to requirements.txt pip packages
pulgamecanica Dec 3, 2024
c29fec0
feat(SLI-85): conversion step with loading and disable states
pulgamecanica Dec 3, 2024
cbcbb13
chore: update python base image for docker
pulgamecanica Dec 3, 2024
0fdca9b
chore: add .env.local.example and implement backend-url as env var
pulgamecanica Dec 3, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
35 changes: 13 additions & 22 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,41 +1,32 @@
# SlideSpeak coding challenge: Build a PowerPoint to PDF marketing tool
# SlideSpeak coding challenge: Build a PowerPoint Video Extractor Tool

## The challenge!

Build a front-end implementation as well as a back-end service to convert PowerPoint documents to PDF format. This
should be done by implementing a simple **Next.js** front-end that posts a file to a **Python** server. You don’t have
to do the converting logic yourself as you can use unoconv or unoserver to do this (you can see more about this in the
acceptance criteria). The front-end is also already implemented in the /frontend folder. You only need to add the
Build a front-end implementation as well as a back-end service to extract videos from PowerPoint documents. This
should be done by implementing a simple **Next.js** front-end that posts a file to a **Python** server.
The front-end is also already implemented in the /frontend folder. You only need to add the
necessary logic to switch between the steps and convert the file via the API that you're going to build.

- Webpage for the
tool: [https://slidespeak.co/free-tools/convert-powerpoint-to-pdf/](https://slidespeak.co/free-tools/convert-powerpoint-to-pdf/)
- Design: [https://www.figma.com/file/CRfT0MVMqIV8rAK6HgSnKA/SlideSpeak-Coding-Challenge?type=design&t=6m2fFDaRs72CowZH-6](https://www.figma.com/file/CRfT0MVMqIV8rAK6HgSnKA/SlideSpeak-Coding-Challenge?type=design&t=6m2fFDaRs72CowZH-6)
- The tool will be on a webpage similar to: [https://slidespeak.co/free-tools/convert-powerpoint-to-pdf/](https://slidespeak.co/free-tools/convert-powerpoint-to-pdf/)
- Figma Design: [https://www.figma.com/design/CRfT0MVMqIV8rAK6HgSnKA/SlideSpeak-Coding-Challenge?node-id=798-61](https://www.figma.com/design/CRfT0MVMqIV8rAK6HgSnKA/SlideSpeak-Coding-Challenge?node-id=798-61)

## Acceptance criteria

### Back-end API

- Should be implemented in Python.
- Converting PowerPoints to PDF can be done with `unoconv` or `unoserver` via Docker if you want to be fancy 😀. You
don’t need to implement the converting logic yourself.
- [Documentation on how to use unoconv and spawn a process](https://pypi.org/project/unoconv/)
- Note: `unoconv` is deprecated but thats ok for this challenge
- [How to use unoserver via docker](https://gist.github.com/kgoedecke/44955d0b0b1ed4112bcfd3e237e135c0), this will
create an API that you can use based on [this](https://github.com/libreofficedocker/unoserver-rest-api)
documentation.
- Using unoserver is nice-to-have (but the preferred way), if you find unoconv easier use it instead
- The API should consist of one endpoint (POST /convert), which should do the following:
1. Converts the attached file to PDF
2. Uploads the PowerPoint and PDF file to Amazon S3
- Extracting Videos from PowerPoint using a zip utility. This should support multiple processes in parallel. Preferably with a queue.
- The API should consist of one endpoint (POST /extract), which should do the following:
1. Extracts the videos from the PowerPoint
2. Uploads the videos to Amazon S3
via [boto3](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html)
3. Creates a presigned URL for the user to download

[https://boto3.amazonaws.com/v1/documentation/api/latest/guide/s3-presigned-urls.html](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/s3-presigned-urls.html)

[https://medium.com/@aidan.hallett/securing-aws-s3-uploads-using-presigned-urls-aa821c13ae8d](https://medium.com/@aidan.hallett/securing-aws-s3-uploads-using-presigned-urls-aa821c13ae8d)

4. Returns the presigned S3 url to the client which allows the user to download the file (by opening the url in new
4. Returns the presigned S3 url/urls to the client which allows the user to download the file (by opening the url in new
tab)

### Front-end app
Expand All @@ -45,11 +36,11 @@ necessary logic to switch between the steps and convert the file via the API tha

## Nice to haves / tips

- Uses unoserver to convert PowerPoint to PDF via docker compose
- Uses a queuing system like Celery and Redis
- The logic of the front-end ideally should not rely on useEffect too much since it can be difficult to track what is
happening
- Tests
- Use conventional commit message style: https://www.conventionalcommits.org/en/v1.0.0/
- Lint your code
- Keep commits clean
- If you want to be really fancy you can add queuing with Celery
- Setup with Docker Compose
6 changes: 6 additions & 0 deletions backend/.env.example
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
AWS_ACCESS_KEY=<your-aws-access-key>
AWS_SECRET_KEY=<your-aws-secret-key>
S3_BUCKET_NAME=<your-bucket-name>
AWS_REGION=<your-aws-bucket-region>
CELERY_BROKER_URL=redis://redis:6379/0
CELERY_BACKEND_URL=redis://redis:6379/0
3 changes: 3 additions & 0 deletions backend/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
.env
app/__pycache__
app/__pycache__/*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add common folders like .idea and .vscode to this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea! will implement this.

16 changes: 16 additions & 0 deletions backend/Dev.Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
FROM python:3.12-slim-bookworm

WORKDIR /app

# Install system dependencies (unzip)
RUN apt-get update && apt-get install -y unzip

# Install Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy application code
COPY ./app /app

# Run the FastAPI app
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000", "--reload"]
24 changes: 24 additions & 0 deletions backend/Dev.Dockerfile.celery
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
FROM python:3.12-slim-bookworm

WORKDIR /app

# Install system dependencies
RUN apt-get update && apt-get install -y \
unzip \
gcc \
libffi-dev \
musl-dev \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/*

# Copy & Install Python dependencies
COPY requirements.txt requirements.txt
COPY requirements-dev.txt requirements-dev.txt
RUN pip install --no-cache-dir -r requirements.txt
RUN pip install --no-cache-dir -r requirements-dev.txt

# Copy application code
COPY ./app /app

# Run Celery worker with watchdog
CMD ["watchmedo", "auto-restart", "-d", ".", "-R", "-p", "*.py", "--debug-force-polling", "--", "celery", "-A", "video_extractor.celery", "worker", "--loglevel=info", "-c", "16"]
16 changes: 16 additions & 0 deletions backend/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
FROM python:3.12-slim-bookworm

WORKDIR /app

# Install system dependencies (unzip)
RUN apt-get update && apt-get install -y unzip

# Install Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy application code
COPY ./app /app

# Run the FastAPI app
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000", "--proxy-headers"]
22 changes: 22 additions & 0 deletions backend/Dockerfile.celery
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
FROM python:3.12-slim-bookworm

WORKDIR /app

# Install system dependencies
RUN apt-get update && apt-get install -y \
unzip \
gcc \
libffi-dev \
musl-dev \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/*

# Install Python dependencies
COPY requirements.txt requirements.txt
RUN pip install --no-cache-dir -r requirements.txt

# Copy application code
COPY ./app /app

# Set the default command to run Celery worker
CMD ["celery", "-A", "video_extractor.celery", "worker", "--loglevel=info", "-c", "16"]
201 changes: 197 additions & 4 deletions backend/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,201 @@

- Having Docker installed on your system.

## Running the docker container
## Running the docker containers

```bash
docker compose up --build
```
The project includes a `Makefile` to simplify common Docker Compose tasks. You can use the following commands to manage your development and production environments:

### Available Commands

| Command | Description |
|---------------|---------------------------------------------------------------------------------------------|
| `make start` | Starts the development environment using `docker-compose.yml`. |
| `make build` | Builds the Docker images and starts the development environment. |
| `make detach` | Builds the Docker images, starts the development environment in detached mode (background). |
| `make down` | Stops and removes all containers, networks, and volumes defined in `docker-compose.yml`. |
| `make prod` | Starts the production environment using `docker-compose.prod.yml`. |
| `make prod-down` | Stops and removes all containers, networks, and volumes defined in `docker-compose.prod.yml`. |

---

### How to Use

1. **Start Development Environment**:
- Run the following command to start your development environment:
```bash
make start
```
- This command will spin up the containers as defined in `docker-compose.yml`.

2. **Build and Start Containers**:
- If you’ve made changes to your Dockerfile or dependencies, rebuild the containers with:
```bash
make build
```

3. **Run in Detached Mode**:
- To run the containers in the background, use:
```bash
make detach
```

4. **Shut Down the Environment**:
- To stop and clean up all containers, networks, and volumes, run:
```bash
make down
```

5. **Start Production Environment**:
- Use this command to start the production environment defined in `docker-compose.prod.yml`:
```bash
make prod
```

6. **Shut Down Production Environment**:
- To stop and clean up the production containers, use:
```bash
make prod-down
```

***

## Pulgamecanica walkthoguh


### How Videos Are Stored in PowerPoint Files?

PowerPoint files with the .pptx extension are essentially ZIP archives that follow the Office Open XML standard.

They contain various directories and XML files for slides, images, audio, and video. Videos embedded in a PowerPoint are usually stored as media files within the ppt/media folder inside the ZIP archive.


***

### How SlideSpeak Extracts Videos from .pptx Files

Treat .pptx as a ZIP Archive: Use unzip tool extract its contents.

```py
subprocess.run(
["unzip", "-j", pptx_file_path, "ppt/media/*", "-d", output_path],
check=True,
shell=False,
)
```

As seen here: https://github.com/SlideSpeak/image-extractor-cli/blob/30c5ad96ffbc3aaea63b01928630b3efe87e62e9/image_extractor.py#L110C9-L114C10


unzip: Unzips the file.
-j: Junk the directory structure (extracts files without retaining folder hierarchy).
pptx_file_path: Path to the .pptx file.
ppt/media/*: Extracts only files from the ppt/media/ directory.
-d output_path: Specifies the directory to extract the files.


Any file with video formats such as .mp4, .mov, .avi, etc., is likely a video.

Extract Relevant Files: Extract the video files into a temporary directory for further processing or upload.

***

### Celery

Celery is running on a docker container.
You can use celery by implementing the decorator on the function you wish to queue.
Then you should use `.delay` to trigger the celery queue which will return the task id.
After that you can query the celery worker for the result, if any.

### Coding

Now we know what the python script should look like.

Fast API route:
**POST** _/extract_

- @params:
- file: file.pptx

##### NOTE:
How to make a test with postman when a post requires a named file parameter?
You should go to postman and change the request to a post.
Then on the Body section choose form-data.
Then hover on the `Key` section and choose "File" on the dropdown.
Then you can put the key `file` and choose the file for the value.

### Structure

We will use fastapi to create an endpoint where we can `post` pptx files and get a response appropiate for the desired output. (likely to be a ref. to the S3 bucket were we will store the videos)

```
backend/
├── app/
│ ├── __init__.py
│ ├── main.py # Entry point for FastAPI app
│ ├── tasks.py # Logic for video extraction and S3 upload
├── requirements.txt
├── Dockerfile
├── docker-compose.yml
├── README.md
└── .env
```

### Dependencies

- fastapi: Framework for building the backend API.
- uvicorn: ASGI server for running the FastAPI application.
- boto3: AWS SDK for Python to interact with Amazon S3.
- python-multipart: Required by FastAPI to handle file uploads.
- celery: Task queue system for parallel processing.
- redis: Backend for Celery and message broker.
- aiofiles: Asynchronous file I/O for FastAPI when saving uploaded files


## Zipping Videos for S3 Upload

When multiple video files are extracted from the PowerPoint presentation, the project uses Python's `shutil.make_archive()` to create a compressed ZIP file for efficient storage and upload to S3.

### How It Works

1. **Single Video**:
- If only one video is found, it is uploaded directly to S3 without compression.
- The S3 link for the video is returned.

2. **Multiple Videos**:
- If multiple videos are found, they are compressed into a single ZIP file using `shutil.make_archive()` before being uploaded to S3.
- The S3 link for the ZIP file is returned.

3. **ZIP Creation**:
- The `shutil.make_archive()` function creates a standard ZIP file that is fully compatible with ZIP tools (e.g., Windows File Explorer, macOS Finder, `unzip` command).
- Compression ensures reduced file size for faster uploads and downloads.

### Code Example

Here’s how the ZIP archive is created in the project:

```python
shutil.make_archive(zip_path.replace(".zip", ""), "zip", output_dir)
```

- **`zip_path.replace(".zip", "")`**: Defines the name of the ZIP file without the `.zip` extension (automatically added by the function).
- **`"zip"`**: Specifies the archive format (ZIP in this case).
- **`output_dir`**: The directory containing the extracted video files to include in the ZIP.


### CORS

This is the current CORS setup:

```py
# main.py

app.add_middleware(
CORSMiddleware,
allow_origins=[<List of allowed origins>], # Allowed origins
allow_credentials=True,
allow_methods=["*"], # Allowed HTTP methods
allow_headers=["*"], # Allowed headers
)
```

If you want to run in production or test it in your local network, you will need to change the configuration accordingly
Empty file added backend/app/__init__.py
Empty file.
25 changes: 25 additions & 0 deletions backend/app/config.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
import os

# S3 Configuration
AWS_ACCESS_KEY = os.getenv("AWS_ACCESS_KEY")
AWS_SECRET_KEY = os.getenv("AWS_SECRET_KEY")
S3_BUCKET_NAME = os.getenv("S3_BUCKET_NAME")
AWS_REGION = os.getenv("AWS_REGION")

# Celery Configuration
CELERY_BROKER_URL = os.getenv("CELERY_BROKER_URL", "redis://redis:6379/0")
CELERY_BACKEND_URL = os.getenv("CELERY_BACKEND_URL", "redis://redis:6379/0")

# Local Directories
LOCAL_DOCUMENTS_DIR = "shared_tmp"

# Task Settings
MAX_CONVERT_TRIES = 5
SOFT_TIME_LIMIT = 120
TIME_LIMIT = 300

# CORS Settings
ALLOWED_ORIGINS = [
"https://slidespeak.co",
"http://localhost:3000",
]
Loading