Skip to content

Commit

Permalink
warc: Fix CDX fields missing with multi-line HTTP headers
Browse files Browse the repository at this point in the history
The regex used to find the end of the HTTP headers would not match if
there was a newline in between, so get_http_header would return nothing
and the status code and MIME type fields would be empty in the CDX record.
  • Loading branch information
Frogging101 committed Apr 6, 2019
1 parent 9187761 commit 451cd2e
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion wpull/warc/format.py
Original file line number Diff line number Diff line change
Expand Up @@ -163,7 +163,7 @@ def get_http_header(self) -> Response:
with wpull.util.reset_file_offset(self.block_file):
data = self.block_file.read(4096)

match = re.match(br'(.*?\r?\n\r?\n)', data)
match = re.match(br'(.*?\r?\n\r?\n)', data, re.DOTALL)

if not match:
return
Expand Down

0 comments on commit 451cd2e

Please sign in to comment.