Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

extraction-worker should parallelize different root domains #19

Open
TimDaub opened this issue Jun 27, 2022 · 7 comments
Open

extraction-worker should parallelize different root domains #19

TimDaub opened this issue Jun 27, 2022 · 7 comments
Labels
enhancement New feature or request

Comments

@TimDaub
Copy link
Collaborator

TimDaub commented Jun 27, 2022

  • there can be HTTPS pages that don't allow us to download at max concurrency and max rate
  • but generally, at least with neume's current music NFT crawl there are many different HTTPS endpoints being queried over the life cycle of a crawl
  • But e.g. rate-limiting only happens when a single endpoint is hammered badly
  • so it'd be great if the extraction-worker implemented an algorithm that would maximize the diversity of request types such that even with rate limiting in progress, maximally many requests are completed in parallel.
@TimDaub TimDaub added the enhancement New feature or request label Jun 27, 2022
@TimDaub
Copy link
Collaborator Author

TimDaub commented Jun 30, 2022

better-queue has a priority function that can inform order: https://github.com/diamondio/better-queue#filtering-validation-and-priority

@il3ven
Copy link
Collaborator

il3ven commented Jul 3, 2022

Thinking out loud: We can give each task a random priority out of 5.

@TimDaub
Copy link
Collaborator Author

TimDaub commented Jul 4, 2022

Thinking out loud: We can give each task a random priority out of 5.

Yeah, randomizing all incoming request could be a good parallelization strategy for now that doesn't require lots of other effort. Good idea, let's test this.

@il3ven
Copy link
Collaborator

il3ven commented Jul 6, 2022

I tried this out and it didn't work. for some reason the order didn't change. this statement is based on the fact that the order in which the files were written to data was the same. if the messages are executed in random order i expect the data to be written random order.

also if the range of priority was big like 50 then i started getting a strange error.
FetchError: request to https://node.rugpullindex.com/ failed, reason: Client network socket disconnected before secure TLS connection was established

i will try it out later again.

@TimDaub
Copy link
Collaborator Author

TimDaub commented Jul 6, 2022

weird. But I'd have to see code to comment meaningfully.

@il3ven
Copy link
Collaborator

il3ven commented Jul 6, 2022

Here's the code. The only change I did was in extraction-worker/src/worker.mjs.

export function run() {
  log(
    `Starting as worker thread with queue options: "${JSON.stringify(
      workerData.queue.options
    )}`
  );
  const queue = new Queue(messages.route, {
    ...workerData.queue.options,
    priority: function (message, cb) {
      const pr = Math.floor(Math.random() * 1000)
      cb(null, pr);
    },
  });
  queue.on("task_finish", loggingProxy(queue, reply));
  queue.on("task_failed", loggingProxy(queue, panic));
  parentPort.on("message", messageHandler(queue));
  return queue;
}

@TimDaub
Copy link
Collaborator Author

TimDaub commented Jul 7, 2022

I've cross-checked this with the better-queue docs and to me it looks correct.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants