Skip to content
This repository has been archived by the owner on Sep 5, 2024. It is now read-only.

client.download() not working. Gives error (Not a gzipped file (b'<?')). #40

Open
hamzashah47 opened this issue Mar 31, 2022 · 8 comments · May be fixed by #41
Open

client.download() not working. Gives error (Not a gzipped file (b'<?')). #40

hamzashah47 opened this issue Mar 31, 2022 · 8 comments · May be fixed by #41

Comments

@hamzashah47
Copy link

client = IndexClient(["2019-51", "2019-47"])
client.search("reddit.com/r/MachineLearning/*")
client.download()

trying to download html pages but not working. Gives error (Not a gzipped file (b'<?')).

@rokasramas rokasramas linked a pull request Apr 16, 2022 that will close this issue
@kouohhashi
Copy link

Hi I have same issue.
Have you find a workaround?

@sufyanel
Copy link

yes i fixed it. I created a fork and add this code chunk in try except block. It worked for me.

@kouohhashi
Copy link

I installed comcrawl today by pip install comcrawl.

And I did

from comcrawl import IndexClient
client = IndexClient(["2019-51", "2019-47"], verbose=True)
client.download()

is there anything I can dig?

@sufyanel
Copy link

create a fork from original repo and add try except in client.download() method. Or i can send you my module if you can share your email.

@kouohhashi
Copy link

kouohhashi commented Aug 27, 2022

When I use "CC-MAIN-2022-33" as index like below,

from comcrawl import IndexClient
client = IndexClient(["CC-MAIN-2022-33"])
client.search("reddit.com/r/MachineLearning/*")
client.download()
client.results

I did not get an error but client.results is [].

When I use "2022-33" as index like below,

from comcrawl import IndexClient
client = IndexClient(["2022-33"])
client.search("reddit.com/r/MachineLearning/*")
client.download()

I got an error.

I'm not sure how to set index correctly.

Thanks in advance.

@customer101
Copy link

customer101 commented Jan 5, 2023

I'm facing the same problem. I changed the URL_TEMPLATE here
to become URL_TEMPLATE = "https://data.commoncrawl.org/{filename}" following this announcement

EDIT: It seems this PR did the same, I don't know why it doesn't get merged #41

@jaceaser
Copy link

I have issues and get this error. Is there something I'm missing in my setup?

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/Users/joshuaceaser/Innoventage/Ingestion/.venv/lib/python3.10/site-packages/requests/adapters.py", line 486, in send
resp = conn.urlopen(
File "/Users/joshuaceaser/Innoventage/Ingestion/.venv/lib/python3.10/site-packages/urllib3/connectionpool.py", line 844, in urlopen
retries = retries.increment(
File "/Users/joshuaceaser/Innoventage/Ingestion/.venv/lib/python3.10/site-packages/urllib3/util/retry.py", line 470, in increment
raise reraise(type(error), error, _stacktrace)
File "/Users/joshuaceaser/Innoventage/Ingestion/.venv/lib/python3.10/site-packages/urllib3/util/util.py", line 38, in reraise
raise value.with_traceback(tb)
File "/Users/joshuaceaser/Innoventage/Ingestion/.venv/lib/python3.10/site-packages/urllib3/connectionpool.py", line 790, in urlopen
response = self._make_request(
File "/Users/joshuaceaser/Innoventage/Ingestion/.venv/lib/python3.10/site-packages/urllib3/connectionpool.py", line 536, in _make_request
response = conn.getresponse()
File "/Users/joshuaceaser/Innoventage/Ingestion/.venv/lib/python3.10/site-packages/urllib3/connection.py", line 454, in getresponse
httplib_response = super().getresponse()
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/http/client.py", line 1368, in getresponse
response.begin()
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/http/client.py", line 317, in begin
version, status, reason = self._read_status()
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/http/client.py", line 286, in _read_status
raise RemoteDisconnected("Remote end closed connection without"
urllib3.exceptions.ProtocolError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

@jaceaser
Copy link

Nevermind, my issue was VPN related.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants