Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Playwright OS processes remain active after finishing parsing #25

Open
gllona opened this issue Nov 11, 2024 · 4 comments
Open

Playwright OS processes remain active after finishing parsing #25

gllona opened this issue Nov 11, 2024 · 4 comments

Comments

@gllona
Copy link

gllona commented Nov 11, 2024

Hi!

Today I am running a process parsing ~500 web pages. The parser runs sequentially, no two webpage parsings run concurrently. Parsera is running as a docker container in my local machine with Ubuntu 24.04 on amd64.

I am noticing that after each webpage is parsed, 5 new firefox processes are kept active in my local OS (host level, not inside the docker container. For example:

$ ps uax | grep firefox | wc -l
123

Each process have this ps signature:

$ ps uax | grep firefox | head -1
997      1965790  1.7  0.8 2796868 272632 ?      Ssl  15:05   0:11 /home/tools/.cache/ms-playwright/firefox-1463/firefox/firefox -no-remote -headless -profile /tmp/playwright_firefoxdev_profile-lrmtlv -juggler-pipe -silent

And the parsera process is running with docker:

$ docker ps | head -2
CONTAINER ID   IMAGE                              COMMAND                   CREATED       STATUS                  PORTS          NAMES
1c62a310fb3b   tools                "./run.sh"                4 hours ago   Up 4 hours                            tools

My question is: could be possible that Parsera is not closing the playwright process after the parse finishes? Or should I close the playwright instance explicitly? If I should, how can I do it?

Notes: using parsera==0.1.8 with python 3.12

Thank you,

Gorka Llona

@raznem
Copy link
Owner

raznem commented Nov 12, 2024

Hi @gllona .
Could you share the code you use to run Parsera?

@gllona
Copy link
Author

gllona commented Nov 12, 2024

This is the Dockerfile:

FROM python:3.10-slim

RUN apt update -y && apt install libpq-dev gcc -y

RUN python -m pip install --upgrade pip
RUN python -m pip install playwright==1.47.0
RUN python -m playwright install-deps

RUN groupadd -r tools && useradd -r -m -g tools tools
USER tools

WORKDIR /home/tools
ENV PATH="${PATH}:/home/tools/.local/bin"

RUN python -m pip install --upgrade pip
RUN python -m pip install playwright==1.47.0
RUN python -m playwright install
COPY requirements.txt .
RUN python -m pip install -r requirements.txt

ADD *.py *.ini /home/tools/
ADD routes /home/tools/routes
ADD migrations /home/tools/migrations
ADD bin/run.sh /home/tools/run.sh

EXPOSE 5000

ENTRYPOINT [ "./run.sh" ]

This is run.sh:

#!/bin/bash

export PYTHONPATH=/home/tools/.local/lib/python3.10/site-packages
python -m alembic upgrade head
if [ $? -ne 0 ]
then
  echo "alembic: failed to run migrations"
  exit 1
fi
unset PYTHONPATH

gunicorn -k uvicorn.workers.UvicornWorker --config gunicorn_config.py app:app

This is the function that does the actual scraping using a custom GPT4oModel:

async def scrape(site: str, page: str, page_number: Optional[int] = None, url: Optional[str] = None, these_sites: Optional[dict] = None):
    parsera = ParseraScript(
        model=GPT4oModel(),
        initial_script=initial_script,
        extractor=get_extractor_type(site, page, these_sites)
    )
    params = None if page_number is None else {"page": page_number}
    result = await parsera.arun(
        url=build_url(get_url(site, page, url, these_sites)),
        elements=get_elements(site, page, these_sites),
        playwright_script=build_repeating_script(site, page, params, these_sites),
    )
    return result

Other functions:

async def initial_script(page: Page) -> Page:
    # await page.wait_for_load_state("networkidle")
    return page

def build_repeating_script(site: str, this_page: str, params: dict = None, these_sites: Optional[dict] = None):
    # noinspection PyBroadException
    async def repeating_script(page: Page) -> Page:
        # await page.wait_for_timeout(3000)
        # await page.wait_for_load_state("networkidle")
        # await page.wait_for_selector("eui-menu")
        wait_seconds = get_wait_seconds(site, this_page, these_sites)
        wait_for_element = get_wait_for_element(site, this_page, these_sites)
        wait_for_network_idle = get_wait_for_network_idle(site, this_page, these_sites)
        if wait_seconds is not None:
            try:
                await page.wait_for_timeout(wait_seconds * 1000)
            except Exception as e:
                print("Error waiting for timeout: " + str(e))
        if wait_for_element is not None:
            try:
                await page.wait_for_selector(wait_for_element)
            except Exception as e:
                print(f"Error waiting for selector '{wait_for_element}': " + str(e))
        if wait_for_network_idle is not None:
            try:
                await page.wait_for_load_state("networkidle")
            except Exception as e:
                print("Error waiting for network idle: " + str(e))
        return page

Note that the scraper is working file, only that there are playwright processes in the host (outside the docker container) after the scraping finishes.

Thanks!

@danyathecoder
Copy link
Collaborator

Hey, @gllona try to use PageLoader.close() function to after you're finishing your request. It will help you to optimize workload

@danyathecoder
Copy link
Collaborator

I added this feature in that PR: #26

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants