You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thanks for sharing the programming example - https://github.com/cocrawler/cdx_toolkit#programming-example
I wanted to ask if there is a way to feed in a list of URL's and retrieve their objects. We feed URL's one by one in the above example and looping over a few thousands (or even hundreds) seems to be a little time consuming.
Thanks.
The text was updated successfully, but these errors were encountered:
@wumpus - thanks for the response :D
I am trying to retrieve meta data for nearly 10k webpages. I am feeding URL of each webpage one by one to the cdx.iter function. I am observing the retrieval time for sets of 20 webpages. Some of the sets take nearly 30 mins while other set (of same size of 20 webpages) are retrieved within 5 mins.
I read your explanation on another issue on this repo (#8). I wanted to ask if the retrieval time is dependent on how many requests are given to cc at a specific time ? And it would be helpful if you can suggest any changes that can help in speeding up the retrieval time.
Turn up the verbose level and you'll see what's going on -- if you are not limiting your time span, the cdx code has to talk to every Common Crawl index individually. Whereas for the Internet Archive, there's just one query.
Hi,
Thanks for sharing the programming example - https://github.com/cocrawler/cdx_toolkit#programming-example
I wanted to ask if there is a way to feed in a list of URL's and retrieve their objects. We feed URL's one by one in the above example and looping over a few thousands (or even hundreds) seems to be a little time consuming.
Thanks.
The text was updated successfully, but these errors were encountered: