Implement XML parsing using stdlib #155

HarshNarayanJha · 2024-11-02T04:51:28Z

This PR replaces github.com/clbanning/mxj/v2 and uses encoding/xml xml.Decoder to parse xml and extract urls within.

EDIT: All tests pass now. The PR is complete

Co-Author: @yzqzss

Only two tests fail, I am trying to fix those

Tests

go: downloading git.archive.org/wb/gocrawlhq v1.2.13 internal/pkg/crawl/config.go:11:2: unrecognized import path "git.archive.org/wb/gocrawlhq": https fetch: Get "https://git.archive.org/wb/gocrawlhq?go-get=1": dial tcp 207.241.235.124:443: i/o timeout === RUN TestJSON === RUN TestJSON/Valid_JSON_with_URLs === RUN TestJSON/Invalid_JSON === RUN TestJSON/JSON_with_no_URLs === RUN TestJSON/JSON_with_URLs_in_various_fields === RUN TestJSON/JSON_with_array_of_URLs --- PASS: TestJSON (0.00s) --- PASS: TestJSON/Valid_JSON_with_URLs (0.00s) --- PASS: TestJSON/Invalid_JSON (0.00s) --- PASS: TestJSON/JSON_with_no_URLs (0.00s) --- PASS: TestJSON/JSON_with_URLs_in_various_fields (0.00s) --- PASS: TestJSON/JSON_with_array_of_URLs (0.00s) === RUN TestXML === RUN TestXML/Valid_XML_with_URLs === RUN TestXML/Empty_XML === RUN TestXML/Invalid_XML === RUN TestXML/XML_with_invalid_URL === RUN TestXML/Huge_sitemap xml_test.go:88: XML() gotURLs count = 10000, want 100002 --- FAIL: TestXML (0.76s) --- PASS: TestXML/Valid_XML_with_URLs (0.00s) --- PASS: TestXML/Empty_XML (0.00s) --- PASS: TestXML/Invalid_XML (0.00s) --- PASS: TestXML/XML_with_invalid_URL (0.00s) --- FAIL: TestXML/Huge_sitemap (0.66s) === RUN TestXMLBodyReadError xml_test.go:127: XML() expected error, got nil --- FAIL: TestXMLBodyReadError (0.00s) FAIL FAIL github.com/internetarchive/Zeno/internal/pkg/crawl/extractor 0.780s FAIL

Closes #84

…/xml to parse xml

CorentinB · 2024-11-06T21:27:09Z

Hey, thank you, how is it going?

HarshNarayanJha · 2024-11-08T02:30:40Z

Not going well. I am not sure what am I failing to capture in that big XML file. Any ideas @CorentinB ?

yzqzss · 2024-11-08T04:49:53Z

internal/pkg/crawl/extractor/xml.go

-			if strings.HasPrefix(value.(string), "http") {
-				URL, err := url.Parse(value.(string))
+		switch tok := tok.(type) {
+		case xml.StartElement:


case xml.StartElement: startElement = tok currentNode = &LeafNode{Path: startElement.Name.Local} for _, attr := range tok.Attr { if strings.HasPrefix(attr.Value, "http") { parsedURL, err := url.Parse(attr.Value) if err == nil { URLs = append(URLs, parsedURL) } } }

this fixed the Huge sitemap test by extracting XML attributes.
Now, the URLs' size and content match the previous tests. :)

Right, opening tags also have urls in some cases.

Now the only left is TestXMLBodyReadError, gotta look at it

I think the TestXMLBodyReadError test is invalid(?) since NopCloser certainly won't return an EOF error on xmlBody, err := io.ReadAll(resp.Body)

Possible, but they did pass previously. Sure, it does not return an EOF on read. For the test to pass, I had to decode the Token once and catch the error. (and seek back for the loop)

_, err = decoder.Token() if err != nil { return nil, sitemap, err } // seek back to 0 if we are still here reader.Seek(0, 0) decoder = xml.NewDecoder(reader)

Catching this in the loop won't work cleanly, since I want to know if EOF was somewhere in-between the file (invalid XML), or at the start (this error)

I will push these changes

Co-Authored-By: yzqzss <[email protected]>

internal/pkg/crawl/extractor/xml.go

yzqzss

LGTM :)

CorentinB · 2024-11-11T15:55:16Z

Thanks guys!

HarshNarayanJha · 2024-11-11T15:58:56Z

You're welcome

feat: remove import from github.com/clbanning/mxj/v2 and use encoding…

02d8c6a

…/xml to parse xml

yzqzss reviewed Nov 8, 2024

View reviewed changes

HarshNarayanJha and others added 2 commits November 8, 2024 13:16

fix: also parse urls from the attrs of the opening tags of xml

8d67026

Co-Authored-By: yzqzss <[email protected]>

fix: check if token decoding is valid or EOF

0898a6c

HarshNarayanJha requested a review from yzqzss November 8, 2024 07:57

HarshNarayanJha marked this pull request as ready for review November 8, 2024 07:57

yzqzss reviewed Nov 8, 2024

View reviewed changes

internal/pkg/crawl/extractor/xml.go Outdated Show resolved Hide resolved

internal/pkg/crawl/extractor/xml.go Outdated Show resolved Hide resolved

fix: directly use bytes reading for xmlBody

bd06065

yzqzss mentioned this pull request Nov 9, 2024

Better xml parsing #157

Merged

CorentinB assigned HarshNarayanJha Nov 10, 2024

CorentinB added the enhancement New feature or request label Nov 10, 2024

CorentinB requested a review from yzqzss November 10, 2024 09:12

chore: remove unused leafnodes array and element tracking

f1f9575

yzqzss approved these changes Nov 10, 2024

View reviewed changes

CorentinB merged commit 803fe6a into internetarchive:main Nov 11, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement XML parsing using stdlib #155

Implement XML parsing using stdlib #155

HarshNarayanJha commented Nov 2, 2024 •

edited

Loading

CorentinB commented Nov 6, 2024

HarshNarayanJha commented Nov 8, 2024 •

edited

Loading

yzqzss Nov 8, 2024

HarshNarayanJha Nov 8, 2024

yzqzss Nov 8, 2024 •

edited

Loading

HarshNarayanJha Nov 8, 2024

yzqzss left a comment

CorentinB commented Nov 11, 2024

HarshNarayanJha commented Nov 11, 2024

Implement XML parsing using stdlib #155

Implement XML parsing using stdlib #155

Conversation

HarshNarayanJha commented Nov 2, 2024 • edited Loading

CorentinB commented Nov 6, 2024

HarshNarayanJha commented Nov 8, 2024 • edited Loading

yzqzss Nov 8, 2024

Choose a reason for hiding this comment

HarshNarayanJha Nov 8, 2024

Choose a reason for hiding this comment

yzqzss Nov 8, 2024 • edited Loading

Choose a reason for hiding this comment

HarshNarayanJha Nov 8, 2024

Choose a reason for hiding this comment

yzqzss left a comment

Choose a reason for hiding this comment

CorentinB commented Nov 11, 2024

HarshNarayanJha commented Nov 11, 2024

HarshNarayanJha commented Nov 2, 2024 •

edited

Loading

HarshNarayanJha commented Nov 8, 2024 •

edited

Loading

yzqzss Nov 8, 2024 •

edited

Loading