feat: Chrome PDF Generator #399

maharshivpatel · 2025-02-03T09:00:06Z

This pr adds chrome as option to generate pdf instead of wkhtmltopdf.

Features such as PDF header / footer (pypdf) and page numbers (chrome / js) are handled manually.

Production Test - 100 PDFs

Frappe Cloud Site - 25$

1 worker / 3-4 threads
request sent every 2 seconds

Method	CPU Time	Average PDF Generation Time
wkhtmltopdf	102.58s	1.06s
Chrome (Non-Persistent)	45.18s (2.27x less CPU usage)	0.48s (2.21x faster)
Chrome (Persistent)	37.58s (2.73x less CPU usage)	0.38s (2.79x faster)

( Not sure how cpu usage is calculated on frappe cloud ).

Note: Multi page pdfs should further increase efficiency in chrome as observed in local tests ( not tested in production )

PDF Used - Standard, Single Page, No external resources, 1 local public file ( SVG Image )

WKHTMLTOPDF

CPU Usage

Average PDF Generation Time

Chrome Normal ( Non Persistent )

CPU Usage

Average PDF Generation Time

Chrome Normal ( Persistent )

CPU Usage

Average PDF Generation Time

Architecture

pdf_generator/architecture.md

Chrome LifeCycle

Formats tested locally and seem to be drop in replacement for most cases
( It is highly recommended to test your formats with multiple documents )

old default builder formats ( without repeating header/footer )
old default builder formats ( with repeating header/footer )
old default builder formats ( without repeating header/footer and letter head )
old default builder formats ( with repeating header/footer and letter head )
print designer formats
custom formats

1. Init with websocket url - sets up the WebSocket URL and initializes the required attributes. - creates and sets asyncio event loop. 2 Connect to CDP WebSocket using connect(). - opens the WebSocket connection and starts listening by running _listen in loop until disconnected. 2a. _listen() is a asyncio task that keeps listening for messages from the WebSocket connection. - Waits for incoming messages and passes them to _handle_message() - Extracts message details (method, sessionId, targetId, frameId). - Matches responses to pending requests using: - Message ID (for direct responses). - Composite key (method, sessionId, targetId, frameId) (for events). - Calls registered event listeners (no. 6) if a matching event is found. 2. Send a command using send(). - Constructs a message with ID, method, and params - Stores a future (promise) in pending_messages to track response (id and composite key). - Sends message via WebSocket - If return_future is true returns future else waits for a response. - Handles timeouts and errors. - Destructres the message and returns the result and/or error. 3. Event Handling (start_listener, wait_for_event, remove_listener) - start_listener(): Registers a listener for CDP events (look at reference link). - wait_for_event(): Waits for a specific event's future to be set/fulfill with a timeout. - remove_listener(): Unregisters a listener. Wait for a response (handled via pending_messages). Listen for events (e.g., start_listener("Network.responseReceived", callback)). Disconnect cleanly using disconnect().

print_designer/install.py - Added logic to install chrome headless in bench's directory. - Added `bench setup-new-pdf-backend` command print_designer/pdf_generator/generator.py - Added logic to manage chrome headless. How It Works 1. Initialize the generator → FrappePDFGenerator() 2. Find the Chromium binary → _find_chromium_executable() 3. Verify the installation → _verify_chromium_installation() 4. Start the headless Chromium process → start_chromium_process() 5. Extract the DevTools WebSocket URL → _set_devtools_url() 6. Use the WebSocket url for communication (CDPSocketClient in cdp_connection.py) 7. Shut down Chromium when done → _close_browser() Added __new__ method to create only one running instance of chomium. If _instance already exists, it returns that instead of creating a new one. init calls _initialize_chromium() if not initailised already. - Fetches configuration settings from frappe.get_common_site_config() - Determines the Chromium executable path using _find_chromium_executable(). - Verifies the installation with _verify_chromium_installation(). - Starts chromium process if devtools_url is not set.

print_designer/pdf_generator/browser.py - The Browser class manages the entire PDF generation process. - Initializing Browser runs entire process. print_designer/pdf_generator/page.py This Page class manages a one headless Chrome tab using CDP to run and generate PDFs. - Initialize & Attach to Target. - Enables page lifecycle events ( load, DOMContentLoaded, networkIdle etc. ) and sets print media emulation. - Frame/Tab & Navigation Handling - Retrieves the frame ID - Logic to intercept requests and fulfilling responses. - TODO: Implement request interception for local resources. - Navigates to a URL and waits for the page to load. - Injects HTML content and waits for it to load. - Runs JavaScript evaluations inside the page. - fetch element height for headers/footers. - generate a PDF and read the output in chunks. print_designer/print_designer/page/print_designer/update_page_no.js - This script is injected into the page - it makes multiple copy of header / footer elements and updates the page number in the header/footer. - if no page numbers are used it just makes one header/footer element and merge it on pdf. - TODO: maybe we should generate header / footer elements for all pages if header / footer script is added ref: frappe/frappe#23263

print_designer/new_backend.py function that is called from framework to generate pdfs print_designer/pdf_generator/pdf_merge.py This PDFTransformer class is responsible for merging headers and footers into a body PDF while ensuring correct positioning. It uses pypdf for PDF manipulations. Retrieves header, body, and footer PDFs from the browser object. Determines whether the header/footer are dynamic (varying per page). Transform & Merge PDFs (transform_pdf) Adjusts page heights for correct positioning. Applies header/footer dynamically: If dynamic, picks the appropriate header/footer per page. If print designer mode, handles first, last, and alternating pages differently. Merges the transformed header/footer with body pages. Transform Page Heights (_transform) Uses coordinate transformations to reposition content. Generate Final PDF Output Writes the modified PDF to a byte stream. Returns the final PDF file data.

for local performance monitering of subprocesses ( mostly used to monitor chrome process ) this is not production code and haven't tested but should be used to check usage of subprocesses in local setup / machine

chrome browser is shared per worker, so we should not close it if any of the threads is still using it add browserId on pdf generation and remove at the end. close browser only if no browserId is present hence no thread is using it not saving browser instance (may contain sensitive information) as FrappePDFGenerator instance is shared across threads in worker.

current logic tries to merge first, odd, even, last footer page (print designer) of the pdf even if the they are dynamic. this is incorrect as incase of dynamic pages, we generate unique header/footer pdfs for each page and then merge them.

we are making request from server to nginx to get the local resources this is not needed, we can directly load the resources from the file system this will reduce no of requests on the nginx. we intercept the request and check if the request path is "assets/" or "files/" then we load the resources from the file system.

print_designer/pdf_generator/browser.py - if header / footer are static (meaning they don't have page no or total page no ) - start header/footer pdf generation (send printToPDF command) before setting body content. - after body pdf is generated using stream_id of header/footer get pdf data, merge it with body pdf. misc: - added frompage class to is_page_no_used - split static header/footer html cloning logic in case format is print_designer (first, last, even, odd) - created update_header_footer_page_pd and modified update_header_footer_page print_designer/pdf_generator/cdp_connection.py - wait_for_event now works by future object as well print_designer/pdf_generator/page.py - updated wait_for_navigate to use wait_for_event - split get_pdf_from_stream logic from generate_pdf - generate_pdf now returns asyncio task that will have future object containing stream_id if wait_for_pdf is False - stream_id can be used to get pdf data using get_pdf_from_stream print_designer/pdf_generator/pdf_merge.py - only transform once if header pdf is static - fix typo in checking if page is even ( modulous should be 2 instead of no_of_pages) print_designer/print_designer/page/print_designer/update_page_no.js - updated to work with early header/footer generation

print_designer/pdf_generator/generator.py - _find_chromium_executable is not used by install.py so doesn't need to be class method print_designer/install.py - using shutil to remove chromium directory - removed frappe methods and updated logic - updated chromium version - fixed linux arm url for chromium docker/init.sh - added setup-new-pdf-backend command pyproject.toml - added missing dependencies websockets

- chromium shell doesn't honor preferCSSPageSize option in printToPDF misc: - added link to cdp commands documentation - removed check for "page-size-found" in wkhtmltopdf options

reverted debug=True flag in generator.py remove zip before finding directory to rename. on failure, raise error instead of just printing message.

print_designer use to render empty div with header/footer id which caused chrome to error out when generating pdf. added logic to not render header/footer div if header/footer is empty or have height of 0.

svg wouldn't load properly in the browser because the Content-Type response header was not set to 'image/svg+xml'. This commit fixes that issue. type such as jpg, png, seems to be working fine without setting the Content-Type response header.

- GPU is not available in production environment. so added flag --disable-gpu - removed 0.1 sleep from _set_devtools_url as it was causing timeout in production due to so many warnings printed due to fonts

- renamed new_pdf_backend to chrome_pdf_generator - removed duplicate create custom feild patch and modified last patchs comment to rerun - renamed new_backend.py to pdf_generator/pdf.py - updated hooks - removed unused args - only check for chrome_pdf_generator if request path is download_pdf

createTarget ( cmd to create tab in chrome) was in browser however it is represented as page class instance in the code. So, moved it to page class init method.

overall flow, class diagram, and sequence diagram are added. Note: haven't added sequence diagram for PDFTransformer class as logic is simple.

added field in frappe framework as some of the fw logic requires the feild.

added cookies using cdp's Network.setCookie command to allow private images to be loaded in the browser

- if output (PDFWriter) is provided in args, use it - if password is available in options, use it for encrypt pdf

add proper command after all prerequisites are setup.

- we are checking if we should use new chrome_pdf_generator in before_request. - however, we also need same value later in fw so instead of running it twice added to the form_dict. - also, updated fw code to only look for chrome_pdf_generator value if not present in form_dict.

Initally i added scaling to match wkhtmltopdf formats. This is not correct as wkhtmltopdf scales content if it is larger than the page size ( weird ). So, commented scaling code if we face issue we can make it user configurable from print format. misc: removed incorrect comment

- renamed hook to pdf_generator - added chrome format for framework's default pdf fromats - updated pdf_header_footer_html function to send correct format path ( chrome ) to function - fix as per pdf_generator field being string

maharshivpatel · 2025-02-17T09:14:21Z

dependant frappe PR frappe/frappe#31069 is not merged causing to fail CI.

maharshivpatel · 2025-02-19T07:00:59Z

Ideally frappe/press#2487 should be finished before merging.
however, it will only break on FC when someone chooses chrome in pdf_generator field on print format.

merging this after FW PR so people can try it locally and find early bugs.

maharshivpatel added 9 commits January 2, 2025 18:57

chore: add New PDF Backend checkbox field

50ef653

chore: Merge branch 'develop' into pdf-backend

8c94eb0

fix: updated print templates and updated utils functions for new_backend

75c4fcc

chore: added monitor_subprocess.py

0d5f092

for local performance monitering of subprocesses ( mostly used to monitor chrome process ) this is not production code and haven't tested but should be used to check usage of subprocesses in local setup / machine

chore: Merge branch 'develop' into pdf-backend

6b31147

maharshivpatel changed the title ~~feat"~~ feat: New PDF Backend Feb 3, 2025

fix(minor): invalid click.echo

9e58bbe

This comment was marked as resolved.

Sign in to view

maharshivpatel added 8 commits February 4, 2025 20:09

chore: make new_pdf_backend field visible.

a2795c0

fix: pdf merge logic for dynamic pages

3600c00

current logic tries to merge first, odd, even, last footer page (print designer) of the pdf even if the they are dynamic. this is incorrect as incase of dynamic pages, we generate unique header/footer pdfs for each page and then merge them.

chore: removed total time from measure_time

6ca22d3

chore: code formatting and cleanup

c77ae12

maharshivpatel force-pushed the pdf-backend branch 2 times, most recently from 6588a54 to 225c3c0 Compare February 5, 2025 08:53

maharshivpatel added 3 commits February 6, 2025 15:36

fix: add @page with size and margin to page

e2ef580

- chromium shell doesn't honor preferCSSPageSize option in printToPDF misc: - added link to cdp commands documentation - removed check for "page-size-found" in wkhtmltopdf options

chore: added documenation link for chrome headless binary switches

964afb3

fix: install.py and typo in generator.py

ff3d3aa

reverted debug=True flag in generator.py remove zip before finding directory to rename. on failure, raise error instead of just printing message.

maharshivpatel force-pushed the pdf-backend branch 2 times, most recently from 3e5b99a to 6657ff0 Compare February 10, 2025 09:58

maharshivpatel added 3 commits February 10, 2025 19:06

fix: don't render div if header/footer is empty

256257b

print_designer use to render empty div with header/footer id which caused chrome to error out when generating pdf. added logic to not render header/footer div if header/footer is empty or have height of 0.

fix: chrome in production

d380da0

- GPU is not available in production environment. so added flag --disable-gpu - removed 0.1 sleep from _set_devtools_url as it was causing timeout in production due to so many warnings printed due to fonts

maharshivpatel changed the title ~~feat: New PDF Backend~~ feat: Chrome PDF Generator Feb 11, 2025

maharshivpatel added 2 commits February 13, 2025 17:25

chore: moved createTarget inside page class

1889a9b

createTarget ( cmd to create tab in chrome) was in browser however it is represented as page class instance in the code. So, moved it to page class init method.

chore: added architecture.md file

1777629

overall flow, class diagram, and sequence diagram are added. Note: haven't added sequence diagram for PDFTransformer class as logic is simple.

maharshivpatel marked this pull request as ready for review February 13, 2025 12:03

maharshivpatel added 10 commits February 13, 2025 18:40

chore: remove custom field chrome_pdf_generator

bdd72f2

added field in frappe framework as some of the fw logic requires the feild.

chore: renamed bench command

9680cc8

chore: minor code cleanup

2736cda

fix: added cookies for private images

520b229

added cookies using cdp's Network.setCookie command to allow private images to be loaded in the browser

fix: use output (PDFWriter) and password encryption

b1a6d7b

- if output (PDFWriter) is provided in args, use it - if password is available in options, use it for encrypt pdf

chore: remove command from docker file.

38fa02c

add proper command after all prerequisites are setup.

chore: add chrome option in pdf_generator select feild.

cd1397a

fix: changes as per pdf_generator field

1662c4b

- renamed hook to pdf_generator - added chrome format for framework's default pdf fromats - updated pdf_header_footer_html function to send correct format path ( chrome ) to function - fix as per pdf_generator field being string

maharshivpatel force-pushed the pdf-backend branch from bc89265 to 1662c4b Compare February 17, 2025 09:09

maharshivpatel mentioned this pull request Feb 19, 2025

feat: Add support for Chrome Pdf Generator frappe/press#2487

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Chrome PDF Generator #399

feat: Chrome PDF Generator #399

maharshivpatel commented Feb 3, 2025 •

edited

Loading

This comment was marked as resolved.

maharshivpatel commented Feb 17, 2025

maharshivpatel commented Feb 19, 2025

feat: Chrome PDF Generator #399

Are you sure you want to change the base?

feat: Chrome PDF Generator #399

Conversation

maharshivpatel commented Feb 3, 2025 • edited Loading

Production Test - 100 PDFs

PDF Used - Standard, Single Page, No external resources, 1 local public file ( SVG Image )

WKHTMLTOPDF

CPU Usage

Average PDF Generation Time

Chrome Normal ( Non Persistent )

CPU Usage

Average PDF Generation Time

Chrome Normal ( Persistent )

CPU Usage

Average PDF Generation Time

Architecture

Chrome LifeCycle

This comment was marked as resolved.

maharshivpatel commented Feb 17, 2025

maharshivpatel commented Feb 19, 2025

maharshivpatel commented Feb 3, 2025 •

edited

Loading