How can i send the request to fetch data from external api rather than launching a browser locally ? #2853

ardhrubo · 2025-02-20T02:46:35Z

ardhrubo
Feb 20, 2025

Hi, I am working with Crawlee and want to retrieve data from an external API without launching a browser.

I want to send a request to the external source, fetch the data, and then process it using the Cheerio crawler. Using Crawlee’s BasicCrawler, I can fetch data from external sources, but I am facing an issue with the built-in URL strategy.

While BasicCrawler allows me to fetch data, the default enqueuing strategy doesn't seem to allow crawling the same hostname or subdomains the way Cheerio and Puppeteer crawlers do.

This is causing a problem when trying to process the fetched URLs that belong to the same or different subdomains.

Is there a way to enable BasicCrawler to crawl the same hostname or subdomain, similar to the behavior provided by Cheerio or Puppeteer crawlers?

Any advice or workaround to overcome this limitation would be very helpful.

janbuchar · 2025-02-24T10:54:23Z

janbuchar
Feb 24, 2025
Maintainer

Hi @ardhrubo and thanks for your interest in Crawlee. Could you send a code snippet that illustrates the problem you're facing? I have a hard time understanding how enqueueLinks fits in with the rest of your issue.

4 replies

ardhrubo Mar 1, 2025
Author

Hi, Thanks . I was trying to use your built in method where i was fetching the data and trying to use crawlee for crawl all the subdomain with it's basic crawler.

The function kinda looks like this -

const crawlSubdomain = async (c: Context) => {
  const { url, format = 'json' } = await c.req.json();
  if (!url) {
    return c.json({ error: 'url is required' }, 400);
  }

  const crawler = new CheerioCrawler({
    async requestHandler({ request, log, enqueueLinks, $, pushData }) {
      const { url } = request;
      log.info(`Processing ${url}...`);

      const html = $.html();
      await pushData({ url, html });
      
      await enqueueLinks({
        strategy: EnqueueStrategy.SameHostname,
      });

      log.info(`Enqueued links with strategy SameHostname for ${url}`);
    },
    maxRequestsPerCrawl: 10,
    maxConcurrency: 10,
  });

  await crawler.addRequests([url]);
  await crawler.run();

  const dataset = await Dataset.getData();
  return formatResponse(c, dataset.items, format);
};

Here I want to crawl the subdomain for a particular link but couldn't do that.

Though I am able to use the sitemap one perfectly. Crawling sitemap was working fine but subdomain and hostname is not working on this basic crawler.

janbuchar Mar 4, 2025
Maintainer

I see that you are writing an HTTP API server that receives a URL and crawls the whole website, storing the html of each page in the dataset. Is that correct?

Could you give me an example of a website where you noticed that links to subdomains are not correctly enqueued?

ardhrubo Mar 4, 2025
Author

Yes, you are correct.

I used crawlee.dev for testing purpose but it wasn't working.Then also tried with example.com and tried using my locally setup website. Also tried with python http server.

It works fine with cheerio crawler. But i can not use it with basic crawler. Here is an example function which is not possible.

const crawlHostname = async (c: Context) => {
  const { url, format = 'json' } = await c.req.json();
  if (!url) {
    return c.json({ error: 'url is required' }, 400);
  }

  const crawler = new BasicCrawler({
    async requestHandler({ request, log, pushData }) {
      const { url } = request;
      log.info(`Processing ${url}...`);

      try {
        const data = await fetchdata(url);
        await pushData({ url, data });
      // crawl all the hostname
        await enqueueLinks({ strategy: EnqueueStrategy.SameHostname });

        log.info(`Successfully processed ${url}`);
      } catch (error) {
        log.error(`Failed to process ${url}: ${error.message}`);
      }
    },
    maxRequestsPerCrawl: 10,
    maxConcurrency: 10,
  });

  await crawler.addRequests([url]);
  await crawler.run();

  const dataset = await Dataset.getData();
  return formatResponse(c, dataset.items, format);
};

janbuchar Mar 6, 2025
Maintainer

I see two potential issues here.

async requestHandler({ request, log, pushData }) { - on this line, you should also take enqueueLinks from the crawling context via destructuring. In your snippet, it is undefined, but I suspect that's due to you trimming down the example, correct?
await enqueueLinks({ strategy: EnqueueStrategy.SameHostname }); in BasicCrawler, the enqueueLinks handler requires you to pass the URLs to enqueue, because it does not assume anything about the structure of the page. CheerioCrawler expects the page to be HTML, and so it looks for a elements instead of relying on the caller to pass the URLs in. So, the function call in BasicCrawler should look something like this - await enqueueLinks({ urls: ["https://1.com", "https://2.com"], strategy: EnqueueStrategy.SameHostname });

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How can i send the request to fetch data from external api rather than launching a browser locally ? #2853

{{title}}

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

How can i send the request to fetch data from external api rather than launching a browser locally ? #2853

ardhrubo Feb 20, 2025

Replies: 1 comment · 4 replies

janbuchar Feb 24, 2025 Maintainer

ardhrubo Mar 1, 2025 Author

janbuchar Mar 4, 2025 Maintainer

ardhrubo Mar 4, 2025 Author

janbuchar Mar 6, 2025 Maintainer

ardhrubo
Feb 20, 2025

Replies: 1 comment 4 replies

janbuchar
Feb 24, 2025
Maintainer

ardhrubo Mar 1, 2025
Author

janbuchar Mar 4, 2025
Maintainer

ardhrubo Mar 4, 2025
Author

janbuchar Mar 6, 2025
Maintainer