-
Notifications
You must be signed in to change notification settings - Fork 144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Chrome PDF Generator #399
Open
maharshivpatel
wants to merge
37
commits into
frappe:develop
Choose a base branch
from
maharshivpatel:pdf-backend
base: develop
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
1. Init with websocket url - sets up the WebSocket URL and initializes the required attributes. - creates and sets asyncio event loop. 2 Connect to CDP WebSocket using connect(). - opens the WebSocket connection and starts listening by running _listen in loop until disconnected. 2a. _listen() is a asyncio task that keeps listening for messages from the WebSocket connection. - Waits for incoming messages and passes them to _handle_message() - Extracts message details (method, sessionId, targetId, frameId). - Matches responses to pending requests using: - Message ID (for direct responses). - Composite key (method, sessionId, targetId, frameId) (for events). - Calls registered event listeners (no. 6) if a matching event is found. 2. Send a command using send(). - Constructs a message with ID, method, and params - Stores a future (promise) in pending_messages to track response (id and composite key). - Sends message via WebSocket - If return_future is true returns future else waits for a response. - Handles timeouts and errors. - Destructres the message and returns the result and/or error. 3. Event Handling (start_listener, wait_for_event, remove_listener) - start_listener(): Registers a listener for CDP events (look at reference link). - wait_for_event(): Waits for a specific event's future to be set/fulfill with a timeout. - remove_listener(): Unregisters a listener. Wait for a response (handled via pending_messages). Listen for events (e.g., start_listener("Network.responseReceived", callback)). Disconnect cleanly using disconnect().
print_designer/install.py - Added logic to install chrome headless in bench's directory. - Added `bench setup-new-pdf-backend` command print_designer/pdf_generator/generator.py - Added logic to manage chrome headless. How It Works 1. Initialize the generator → FrappePDFGenerator() 2. Find the Chromium binary → _find_chromium_executable() 3. Verify the installation → _verify_chromium_installation() 4. Start the headless Chromium process → start_chromium_process() 5. Extract the DevTools WebSocket URL → _set_devtools_url() 6. Use the WebSocket url for communication (CDPSocketClient in cdp_connection.py) 7. Shut down Chromium when done → _close_browser() Added __new__ method to create only one running instance of chomium. If _instance already exists, it returns that instead of creating a new one. init calls _initialize_chromium() if not initailised already. - Fetches configuration settings from frappe.get_common_site_config() - Determines the Chromium executable path using _find_chromium_executable(). - Verifies the installation with _verify_chromium_installation(). - Starts chromium process if devtools_url is not set.
print_designer/pdf_generator/browser.py - The Browser class manages the entire PDF generation process. - Initializing Browser runs entire process. print_designer/pdf_generator/page.py This Page class manages a one headless Chrome tab using CDP to run and generate PDFs. - Initialize & Attach to Target. - Enables page lifecycle events ( load, DOMContentLoaded, networkIdle etc. ) and sets print media emulation. - Frame/Tab & Navigation Handling - Retrieves the frame ID - Logic to intercept requests and fulfilling responses. - TODO: Implement request interception for local resources. - Navigates to a URL and waits for the page to load. - Injects HTML content and waits for it to load. - Runs JavaScript evaluations inside the page. - fetch element height for headers/footers. - generate a PDF and read the output in chunks. print_designer/print_designer/page/print_designer/update_page_no.js - This script is injected into the page - it makes multiple copy of header / footer elements and updates the page number in the header/footer. - if no page numbers are used it just makes one header/footer element and merge it on pdf. - TODO: maybe we should generate header / footer elements for all pages if header / footer script is added ref: frappe/frappe#23263
print_designer/new_backend.py function that is called from framework to generate pdfs print_designer/pdf_generator/pdf_merge.py This PDFTransformer class is responsible for merging headers and footers into a body PDF while ensuring correct positioning. It uses pypdf for PDF manipulations. Retrieves header, body, and footer PDFs from the browser object. Determines whether the header/footer are dynamic (varying per page). Transform & Merge PDFs (transform_pdf) Adjusts page heights for correct positioning. Applies header/footer dynamically: If dynamic, picks the appropriate header/footer per page. If print designer mode, handles first, last, and alternating pages differently. Merges the transformed header/footer with body pages. Transform Page Heights (_transform) Uses coordinate transformations to reposition content. Generate Final PDF Output Writes the modified PDF to a byte stream. Returns the final PDF file data.
for local performance monitering of subprocesses ( mostly used to monitor chrome process ) this is not production code and haven't tested but should be used to check usage of subprocesses in local setup / machine
This comment was marked as resolved.
This comment was marked as resolved.
chrome browser is shared per worker, so we should not close it if any of the threads is still using it add browserId on pdf generation and remove at the end. close browser only if no browserId is present hence no thread is using it not saving browser instance (may contain sensitive information) as FrappePDFGenerator instance is shared across threads in worker.
current logic tries to merge first, odd, even, last footer page (print designer) of the pdf even if the they are dynamic. this is incorrect as incase of dynamic pages, we generate unique header/footer pdfs for each page and then merge them.
we are making request from server to nginx to get the local resources this is not needed, we can directly load the resources from the file system this will reduce no of requests on the nginx. we intercept the request and check if the request path is "assets/" or "files/" then we load the resources from the file system.
print_designer/pdf_generator/browser.py - if header / footer are static (meaning they don't have page no or total page no ) - start header/footer pdf generation (send printToPDF command) before setting body content. - after body pdf is generated using stream_id of header/footer get pdf data, merge it with body pdf. misc: - added frompage class to is_page_no_used - split static header/footer html cloning logic in case format is print_designer (first, last, even, odd) - created update_header_footer_page_pd and modified update_header_footer_page print_designer/pdf_generator/cdp_connection.py - wait_for_event now works by future object as well print_designer/pdf_generator/page.py - updated wait_for_navigate to use wait_for_event - split get_pdf_from_stream logic from generate_pdf - generate_pdf now returns asyncio task that will have future object containing stream_id if wait_for_pdf is False - stream_id can be used to get pdf data using get_pdf_from_stream print_designer/pdf_generator/pdf_merge.py - only transform once if header pdf is static - fix typo in checking if page is even ( modulous should be 2 instead of no_of_pages) print_designer/print_designer/page/print_designer/update_page_no.js - updated to work with early header/footer generation
print_designer/pdf_generator/generator.py - _find_chromium_executable is not used by install.py so doesn't need to be class method print_designer/install.py - using shutil to remove chromium directory - removed frappe methods and updated logic - updated chromium version - fixed linux arm url for chromium docker/init.sh - added setup-new-pdf-backend command pyproject.toml - added missing dependencies websockets
6588a54
to
225c3c0
Compare
- chromium shell doesn't honor preferCSSPageSize option in printToPDF misc: - added link to cdp commands documentation - removed check for "page-size-found" in wkhtmltopdf options
reverted debug=True flag in generator.py remove zip before finding directory to rename. on failure, raise error instead of just printing message.
3e5b99a
to
6657ff0
Compare
print_designer use to render empty div with header/footer id which caused chrome to error out when generating pdf. added logic to not render header/footer div if header/footer is empty or have height of 0.
svg wouldn't load properly in the browser because the Content-Type response header was not set to 'image/svg+xml'. This commit fixes that issue. type such as jpg, png, seems to be working fine without setting the Content-Type response header.
- GPU is not available in production environment. so added flag --disable-gpu - removed 0.1 sleep from _set_devtools_url as it was causing timeout in production due to so many warnings printed due to fonts
- renamed new_pdf_backend to chrome_pdf_generator - removed duplicate create custom feild patch and modified last patchs comment to rerun - renamed new_backend.py to pdf_generator/pdf.py - updated hooks - removed unused args - only check for chrome_pdf_generator if request path is download_pdf
This was referenced Feb 12, 2025
createTarget ( cmd to create tab in chrome) was in browser however it is represented as page class instance in the code. So, moved it to page class init method.
overall flow, class diagram, and sequence diagram are added. Note: haven't added sequence diagram for PDFTransformer class as logic is simple.
added field in frappe framework as some of the fw logic requires the feild.
added cookies using cdp's Network.setCookie command to allow private images to be loaded in the browser
- if output (PDFWriter) is provided in args, use it - if password is available in options, use it for encrypt pdf
add proper command after all prerequisites are setup.
- we are checking if we should use new chrome_pdf_generator in before_request. - however, we also need same value later in fw so instead of running it twice added to the form_dict. - also, updated fw code to only look for chrome_pdf_generator value if not present in form_dict.
Initally i added scaling to match wkhtmltopdf formats. This is not correct as wkhtmltopdf scales content if it is larger than the page size ( weird ). So, commented scaling code if we face issue we can make it user configurable from print format. misc: removed incorrect comment
- renamed hook to pdf_generator - added chrome format for framework's default pdf fromats - updated pdf_header_footer_html function to send correct format path ( chrome ) to function - fix as per pdf_generator field being string
bc89265
to
1662c4b
Compare
dependant frappe PR frappe/frappe#31069 is not merged causing to fail CI. |
Ideally frappe/press#2487 should be finished before merging. merging this after FW PR so people can try it locally and find early bugs. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This pr adds chrome as option to generate pdf instead of wkhtmltopdf.
Features such as PDF header / footer (pypdf) and page numbers (chrome / js) are handled manually.
Production Test - 100 PDFs
Frappe Cloud Site - 25$
( Not sure how cpu usage is calculated on frappe cloud ).
Note: Multi page pdfs should further increase efficiency in chrome as observed in local tests ( not tested in production )
PDF Used - Standard, Single Page, No external resources, 1 local public file ( SVG Image )
WKHTMLTOPDF
CPU Usage
Average PDF Generation Time
Chrome Normal ( Non Persistent )
CPU Usage
Average PDF Generation Time
Chrome Normal ( Persistent )
CPU Usage
Average PDF Generation Time
Architecture
Chrome LifeCycle
Formats tested locally and seem to be drop in replacement for most cases
( It is highly recommended to test your formats with multiple documents )