Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Chrome PDF Generator #399

Open
wants to merge 37 commits into
base: develop
Choose a base branch
from

Conversation

maharshivpatel
Copy link
Collaborator

@maharshivpatel maharshivpatel commented Feb 3, 2025

This pr adds chrome as option to generate pdf instead of wkhtmltopdf.

Features such as PDF header / footer (pypdf) and page numbers (chrome / js) are handled manually.

Production Test - 100 PDFs

Frappe Cloud Site - 25$

  • 1 worker / 3-4 threads
  • request sent every 2 seconds
Method CPU Time Average PDF Generation Time
wkhtmltopdf 102.58s 1.06s
Chrome (Non-Persistent) 45.18s (2.27x less CPU usage) 0.48s (2.21x faster)
Chrome (Persistent) 37.58s (2.73x less CPU usage) 0.38s (2.79x faster)

( Not sure how cpu usage is calculated on frappe cloud ).

Note: Multi page pdfs should further increase efficiency in chrome as observed in local tests ( not tested in production )

PDF Used - Standard, Single Page, No external resources, 1 local public file ( SVG Image )

PDF Used

WKHTMLTOPDF

CPU Usage

wkhtmltopdf cpu

Average PDF Generation Time

wkhtmltopdf avg request

Chrome Normal ( Non Persistent )

CPU Usage

normal chrome cpu

Average PDF Generation Time

normal chrome avg requests

Chrome Normal ( Persistent )

CPU Usage

persistent chrome cpu

Average PDF Generation Time

persistent chrome avg requests

Architecture

  • pdf_generator/architecture.md

Chrome LifeCycle

chrome lifecycle

Formats tested locally and seem to be drop in replacement for most cases
( It is highly recommended to test your formats with multiple documents )

  • old default builder formats ( without repeating header/footer )
  • old default builder formats ( with repeating header/footer )
  • old default builder formats ( without repeating header/footer and letter head )
  • old default builder formats ( with repeating header/footer and letter head )
  • print designer formats
  • custom formats

1. Init with websocket url
- sets up the WebSocket URL and initializes the required attributes.
- creates and sets asyncio event loop.

2 Connect to CDP WebSocket using connect().
- opens the WebSocket connection and starts listening by running _listen in loop until disconnected.

    2a. _listen() is a asyncio task that keeps listening for messages from the WebSocket connection.
    - Waits for incoming messages and passes them to _handle_message()
    - Extracts message details (method, sessionId, targetId, frameId).
    - Matches responses to pending requests using:
        - Message ID (for direct responses).
        - Composite key (method, sessionId, targetId, frameId) (for events).
    - Calls registered event listeners (no. 6) if a matching event is found.

2. Send a command using send().
- Constructs a message with ID, method, and params
- Stores a future (promise) in pending_messages to track response (id and composite key).
- Sends message via WebSocket
    - If return_future is true returns future else waits for a response.
- Handles timeouts and errors.
- Destructres the message and returns the result and/or error.

3. Event Handling (start_listener, wait_for_event, remove_listener)
- start_listener(): Registers a listener for CDP events (look at reference link).
- wait_for_event(): Waits for a specific event's future to be set/fulfill with a timeout.
- remove_listener(): Unregisters a listener.

Wait for a response (handled via pending_messages).
Listen for events (e.g., start_listener("Network.responseReceived", callback)).
Disconnect cleanly using disconnect().
print_designer/install.py
- Added logic to install chrome headless in bench's directory.

- Added `bench setup-new-pdf-backend` command

print_designer/pdf_generator/generator.py
- Added logic to manage chrome headless.

How It Works
1. Initialize the generator → FrappePDFGenerator()
2. Find the Chromium binary → _find_chromium_executable()
3. Verify the installation → _verify_chromium_installation()
4. Start the headless Chromium process → start_chromium_process()
5. Extract the DevTools WebSocket URL → _set_devtools_url()
6. Use the WebSocket url for communication (CDPSocketClient in cdp_connection.py)
7. Shut down Chromium when done → _close_browser()

Added __new__ method to create only one running instance of chomium.
If _instance already exists, it returns that instead of creating a new one.

init calls _initialize_chromium() if not initailised already.
    - Fetches configuration settings from frappe.get_common_site_config()
    - Determines the Chromium executable path using _find_chromium_executable().
    - Verifies the installation with _verify_chromium_installation().
    - Starts chromium process if devtools_url is not set.
print_designer/pdf_generator/browser.py
- The Browser class manages the entire PDF generation process.
- Initializing Browser runs entire process.

print_designer/pdf_generator/page.py
This Page class manages a one headless Chrome tab using CDP to run and generate PDFs.
- Initialize & Attach to Target.
- Enables page lifecycle events ( load, DOMContentLoaded, networkIdle etc. ) and sets print media emulation.
- Frame/Tab & Navigation Handling
- Retrieves the frame ID
- Logic to intercept requests and fulfilling responses.
    - TODO: Implement request interception for local resources.
- Navigates to a URL and waits for the page to load.
- Injects HTML content and waits for it to load.
- Runs JavaScript evaluations inside the page.
- fetch element height for headers/footers.
- generate a PDF and read the output in chunks.

print_designer/print_designer/page/print_designer/update_page_no.js
- This script is injected into the page
- it makes multiple copy of header / footer elements and updates the page number in the header/footer.
- if no page numbers are used it just makes one header/footer element and merge it on pdf.
    - TODO: maybe we should generate header / footer elements for all pages if header / footer script is added
        ref: frappe/frappe#23263
print_designer/new_backend.py
function that is called from framework to generate pdfs

print_designer/pdf_generator/pdf_merge.py
This PDFTransformer class is responsible for merging headers and footers into a body PDF while ensuring correct positioning.
It uses pypdf for PDF manipulations.

Retrieves header, body, and footer PDFs from the browser object.
Determines whether the header/footer are dynamic (varying per page).
Transform & Merge PDFs (transform_pdf)

Adjusts page heights for correct positioning.
Applies header/footer dynamically:
If dynamic, picks the appropriate header/footer per page.
If print designer mode, handles first, last, and alternating pages differently.
Merges the transformed header/footer with body pages.
Transform Page Heights (_transform)

Uses coordinate transformations to reposition content.
Generate Final PDF Output

Writes the modified PDF to a byte stream.
Returns the final PDF file data.
for local performance monitering of subprocesses ( mostly used to monitor chrome process )
this is not production code and haven't tested but should be used to check usage of subprocesses in local setup / machine
@maharshivpatel maharshivpatel changed the title feat" feat: New PDF Backend Feb 3, 2025
@maharshivpatel

This comment was marked as resolved.

chrome browser is shared per worker, so we should not close it if any of the threads is still using it
add browserId on pdf generation and remove at the end.
close browser only if no browserId is present hence no thread is using it

not saving browser instance (may contain sensitive information) as FrappePDFGenerator instance is shared across threads in worker.
current logic tries to merge first, odd, even, last footer page (print designer) of the pdf even if the they are dynamic.

this is incorrect as incase of dynamic pages, we generate unique header/footer pdfs for each page and then merge them.
we are making request from server to nginx to get the local resources
this is not needed, we can directly load the resources from the file system
this will reduce no of requests on the nginx.

we intercept the request and check if the request path is "assets/" or "files/" then we load the resources from the file system.
print_designer/pdf_generator/browser.py
    - if header / footer are static (meaning they don't have page no or total page no )
        - start header/footer pdf generation (send printToPDF command) before setting body content.
        - after body pdf is generated using stream_id of header/footer get pdf data, merge it with body pdf.

misc:
    - added frompage class to is_page_no_used
    - split static header/footer html cloning logic in case format is print_designer (first, last, even, odd)
        - created update_header_footer_page_pd and modified update_header_footer_page

print_designer/pdf_generator/cdp_connection.py
    - wait_for_event now works by future object as well

print_designer/pdf_generator/page.py
    - updated wait_for_navigate to use wait_for_event
    - split get_pdf_from_stream logic from generate_pdf
    - generate_pdf now returns asyncio task that will have future object containing stream_id if wait_for_pdf is False
    - stream_id can be used to get pdf data using get_pdf_from_stream

print_designer/pdf_generator/pdf_merge.py
    - only transform once if header pdf is static
    - fix typo in checking if page is even ( modulous should be 2 instead of no_of_pages)

print_designer/print_designer/page/print_designer/update_page_no.js
    - updated to work with early header/footer generation
print_designer/pdf_generator/generator.py
    - _find_chromium_executable is not used by install.py so doesn't need to be class method

print_designer/install.py
    - using shutil to remove chromium directory
    - removed frappe methods and updated logic
    - updated chromium version
    - fixed linux arm url for chromium

docker/init.sh
    - added setup-new-pdf-backend command

pyproject.toml
    - added missing dependencies websockets
@maharshivpatel maharshivpatel force-pushed the pdf-backend branch 2 times, most recently from 6588a54 to 225c3c0 Compare February 5, 2025 08:53
- chromium shell doesn't honor preferCSSPageSize option in printToPDF

misc:
- added link to cdp commands documentation
- removed check for "page-size-found" in wkhtmltopdf options
reverted debug=True flag in generator.py

remove zip before finding directory to rename.

on failure, raise error instead of just printing message.
@maharshivpatel maharshivpatel force-pushed the pdf-backend branch 2 times, most recently from 3e5b99a to 6657ff0 Compare February 10, 2025 09:58
print_designer use to render empty div with header/footer id which caused chrome to error out when generating pdf.
added logic to not render header/footer div if header/footer is empty or have height of 0.
svg wouldn't load properly in the browser because the Content-Type response header was not set to 'image/svg+xml'.
This commit fixes that issue. type such as jpg, png, seems to be working fine without setting the Content-Type response header.
- GPU is not available in production environment. so added flag --disable-gpu
- removed 0.1 sleep from _set_devtools_url as it was causing timeout in production due to so many warnings printed due to fonts
@maharshivpatel maharshivpatel changed the title feat: New PDF Backend feat: Chrome PDF Generator Feb 11, 2025
- renamed new_pdf_backend to chrome_pdf_generator
- removed duplicate create custom feild patch and modified last patchs comment to rerun
- renamed new_backend.py to pdf_generator/pdf.py
- updated hooks
- removed unused args
- only check for chrome_pdf_generator if request path is download_pdf
createTarget ( cmd to create tab in chrome) was in browser however it is represented as page class instance in the code. So, moved it to page class init method.
overall flow, class diagram, and sequence diagram are added.

Note: haven't added sequence diagram for PDFTransformer class as logic is simple.
@maharshivpatel maharshivpatel marked this pull request as ready for review February 13, 2025 12:03
added field in frappe framework as some of the fw logic requires the feild.
added cookies using cdp's Network.setCookie command to allow private images to be loaded in the browser
- if output (PDFWriter) is provided in args, use it
- if password is available in options, use it for encrypt pdf
add proper command after all prerequisites are setup.
- we are checking if we should use new chrome_pdf_generator in before_request.
- however, we also need same value later in fw so instead of running it twice added to the form_dict.
- also, updated fw code to only look for chrome_pdf_generator value if not present in form_dict.
Initally i added scaling to match wkhtmltopdf formats.
This is not correct as wkhtmltopdf scales content if it is larger than the page size ( weird ).
So, commented scaling code if we face issue we can make it user configurable from print format.

misc: removed incorrect comment
- renamed hook to pdf_generator
- added chrome format for framework's default pdf fromats
- updated pdf_header_footer_html function to send correct format path ( chrome ) to function
- fix as per pdf_generator field being string
@maharshivpatel
Copy link
Collaborator Author

dependant frappe PR frappe/frappe#31069 is not merged causing to fail CI.

@maharshivpatel
Copy link
Collaborator Author

Ideally frappe/press#2487 should be finished before merging.
however, it will only break on FC when someone chooses chrome in pdf_generator field on print format.

merging this after FW PR so people can try it locally and find early bugs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant