Playwright OS processes remain active after finishing parsing #25

gllona · 2024-11-11T14:22:28Z

Hi!

Today I am running a process parsing ~500 web pages. The parser runs sequentially, no two webpage parsings run concurrently. Parsera is running as a docker container in my local machine with Ubuntu 24.04 on amd64.

I am noticing that after each webpage is parsed, 5 new firefox processes are kept active in my local OS (host level, not inside the docker container. For example:

$ ps uax | grep firefox | wc -l
123

Each process have this ps signature:

$ ps uax | grep firefox | head -1
997      1965790  1.7  0.8 2796868 272632 ?      Ssl  15:05   0:11 /home/tools/.cache/ms-playwright/firefox-1463/firefox/firefox -no-remote -headless -profile /tmp/playwright_firefoxdev_profile-lrmtlv -juggler-pipe -silent

And the parsera process is running with docker:

$ docker ps | head -2
CONTAINER ID   IMAGE                              COMMAND                   CREATED       STATUS                  PORTS          NAMES
1c62a310fb3b   tools                "./run.sh"                4 hours ago   Up 4 hours                            tools

My question is: could be possible that Parsera is not closing the playwright process after the parse finishes? Or should I close the playwright instance explicitly? If I should, how can I do it?

Notes: using parsera==0.1.8 with python 3.12

Thank you,

Gorka Llona

The text was updated successfully, but these errors were encountered:

raznem · 2024-11-12T10:51:48Z

Hi @gllona .
Could you share the code you use to run Parsera?

gllona · 2024-11-12T11:57:06Z

This is the Dockerfile:

FROM python:3.10-slim

RUN apt update -y && apt install libpq-dev gcc -y

RUN python -m pip install --upgrade pip
RUN python -m pip install playwright==1.47.0
RUN python -m playwright install-deps

RUN groupadd -r tools && useradd -r -m -g tools tools
USER tools

WORKDIR /home/tools
ENV PATH="${PATH}:/home/tools/.local/bin"

RUN python -m pip install --upgrade pip
RUN python -m pip install playwright==1.47.0
RUN python -m playwright install
COPY requirements.txt .
RUN python -m pip install -r requirements.txt

ADD *.py *.ini /home/tools/
ADD routes /home/tools/routes
ADD migrations /home/tools/migrations
ADD bin/run.sh /home/tools/run.sh

EXPOSE 5000

ENTRYPOINT [ "./run.sh" ]

This is run.sh:

#!/bin/bash

export PYTHONPATH=/home/tools/.local/lib/python3.10/site-packages
python -m alembic upgrade head
if [ $? -ne 0 ]
then
  echo "alembic: failed to run migrations"
  exit 1
fi
unset PYTHONPATH

gunicorn -k uvicorn.workers.UvicornWorker --config gunicorn_config.py app:app

This is the function that does the actual scraping using a custom GPT4oModel:

async def scrape(site: str, page: str, page_number: Optional[int] = None, url: Optional[str] = None, these_sites: Optional[dict] = None):
    parsera = ParseraScript(
        model=GPT4oModel(),
        initial_script=initial_script,
        extractor=get_extractor_type(site, page, these_sites)
    )
    params = None if page_number is None else {"page": page_number}
    result = await parsera.arun(
        url=build_url(get_url(site, page, url, these_sites)),
        elements=get_elements(site, page, these_sites),
        playwright_script=build_repeating_script(site, page, params, these_sites),
    )
    return result

Other functions:

async def initial_script(page: Page) -> Page:
    # await page.wait_for_load_state("networkidle")
    return page

def build_repeating_script(site: str, this_page: str, params: dict = None, these_sites: Optional[dict] = None):
    # noinspection PyBroadException
    async def repeating_script(page: Page) -> Page:
        # await page.wait_for_timeout(3000)
        # await page.wait_for_load_state("networkidle")
        # await page.wait_for_selector("eui-menu")
        wait_seconds = get_wait_seconds(site, this_page, these_sites)
        wait_for_element = get_wait_for_element(site, this_page, these_sites)
        wait_for_network_idle = get_wait_for_network_idle(site, this_page, these_sites)
        if wait_seconds is not None:
            try:
                await page.wait_for_timeout(wait_seconds * 1000)
            except Exception as e:
                print("Error waiting for timeout: " + str(e))
        if wait_for_element is not None:
            try:
                await page.wait_for_selector(wait_for_element)
            except Exception as e:
                print(f"Error waiting for selector '{wait_for_element}': " + str(e))
        if wait_for_network_idle is not None:
            try:
                await page.wait_for_load_state("networkidle")
            except Exception as e:
                print("Error waiting for network idle: " + str(e))
        return page

Note that the scraper is working file, only that there are playwright processes in the host (outside the docker container) after the scraping finishes.

Thanks!

danyathecoder · 2024-11-19T23:56:40Z

Hey, @gllona try to use PageLoader.close() function to after you're finishing your request. It will help you to optimize workload

danyathecoder · 2024-11-19T23:57:14Z

I added this feature in that PR: #26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Playwright OS processes remain active after finishing parsing #25

Playwright OS processes remain active after finishing parsing #25

gllona commented Nov 11, 2024

raznem commented Nov 12, 2024

gllona commented Nov 12, 2024

danyathecoder commented Nov 19, 2024

danyathecoder commented Nov 19, 2024

Playwright OS processes remain active after finishing parsing #25

Playwright OS processes remain active after finishing parsing #25

Comments

gllona commented Nov 11, 2024

raznem commented Nov 12, 2024

gllona commented Nov 12, 2024

danyathecoder commented Nov 19, 2024

danyathecoder commented Nov 19, 2024