You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've noticed some index.html files were missing after scraping a site with your script.
Seems the problem is that if wget downloads some binary files to a directory then a html page at this directory's path cant be saved to index.html. See example below.
I suggest adding --trust-server-names opt to wget, but I haven't had enough time to test it yet.
$ treeexample.com├── index.html└── main ├── index.html └── logo.png
$ cat example.com/index.html<!DOCTYPE html><a href="./main/logo.png">MAIN LOGO</a><a href="./main">MAIN PAGE</a>
$ cd example.com && python3 -m http.server
$ wget -r http://localhost:8000‘localhost:8080/index.html’ saved‘localhost:8080/main/logo.png’ savedCannot write to ‘localhost:8080/main’ (Is a directory).
I've noticed some
index.html
files were missing after scraping a site with your script.Seems the problem is that if wget downloads some
binaryfiles to a directory then a html page at this directory's path cant be saved to index.html. See example below.I suggest adding
--trust-server-names
opt to wget, but I haven't had enough time to test it yet.example.com.zip
The text was updated successfully, but these errors were encountered: