Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement XML parsing using stdlib #155

Merged

Conversation

HarshNarayanJha
Copy link
Contributor

@HarshNarayanJha HarshNarayanJha commented Nov 2, 2024

This PR replaces github.com/clbanning/mxj/v2 and uses encoding/xml xml.Decoder to parse xml and extract urls within.

EDIT: All tests pass now. The PR is complete

Co-Author: @yzqzss

Only two tests fail, I am trying to fix those

Tests
go: downloading git.archive.org/wb/gocrawlhq v1.2.13
internal/pkg/crawl/config.go:11:2: unrecognized import path "git.archive.org/wb/gocrawlhq": https fetch: Get "https://git.archive.org/wb/gocrawlhq?go-get=1": dial tcp 207.241.235.124:443: i/o timeout
=== RUN   TestJSON
=== RUN   TestJSON/Valid_JSON_with_URLs
=== RUN   TestJSON/Invalid_JSON
=== RUN   TestJSON/JSON_with_no_URLs
=== RUN   TestJSON/JSON_with_URLs_in_various_fields
=== RUN   TestJSON/JSON_with_array_of_URLs
--- PASS: TestJSON (0.00s)
    --- PASS: TestJSON/Valid_JSON_with_URLs (0.00s)
    --- PASS: TestJSON/Invalid_JSON (0.00s)
    --- PASS: TestJSON/JSON_with_no_URLs (0.00s)
    --- PASS: TestJSON/JSON_with_URLs_in_various_fields (0.00s)
    --- PASS: TestJSON/JSON_with_array_of_URLs (0.00s)
=== RUN   TestXML
=== RUN   TestXML/Valid_XML_with_URLs
=== RUN   TestXML/Empty_XML
=== RUN   TestXML/Invalid_XML
=== RUN   TestXML/XML_with_invalid_URL
=== RUN   TestXML/Huge_sitemap
    xml_test.go:88: XML() gotURLs count = 10000, want 100002
--- FAIL: TestXML (0.76s)
    --- PASS: TestXML/Valid_XML_with_URLs (0.00s)
    --- PASS: TestXML/Empty_XML (0.00s)
    --- PASS: TestXML/Invalid_XML (0.00s)
    --- PASS: TestXML/XML_with_invalid_URL (0.00s)
    --- FAIL: TestXML/Huge_sitemap (0.66s)
=== RUN   TestXMLBodyReadError
    xml_test.go:127: XML() expected error, got nil
--- FAIL: TestXMLBodyReadError (0.00s)
FAIL
FAIL	github.com/internetarchive/Zeno/internal/pkg/crawl/extractor	0.780s
FAIL

Closes #84

@CorentinB
Copy link
Collaborator

Hey, thank you, how is it going?

@HarshNarayanJha
Copy link
Contributor Author

HarshNarayanJha commented Nov 8, 2024

Not going well. I am not sure what am I failing to capture in that big XML file. Any ideas @CorentinB ?

if strings.HasPrefix(value.(string), "http") {
URL, err := url.Parse(value.(string))
switch tok := tok.(type) {
case xml.StartElement:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

		case xml.StartElement:
			startElement = tok
			currentNode = &LeafNode{Path: startElement.Name.Local}
			for _, attr := range tok.Attr {
				if strings.HasPrefix(attr.Value, "http") {
					parsedURL, err := url.Parse(attr.Value)
					if err == nil {
						URLs = append(URLs, parsedURL)
					}
				}
			}

this fixed the Huge sitemap test by extracting XML attributes.
Now, the URLs' size and content match the previous tests. :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, opening tags also have urls in some cases.

Now the only left is TestXMLBodyReadError, gotta look at it

Copy link
Contributor

@yzqzss yzqzss Nov 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the TestXMLBodyReadError test is invalid(?) since NopCloser certainly won't return an EOF error on xmlBody, err := io.ReadAll(resp.Body)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possible, but they did pass previously. Sure, it does not return an EOF on read. For the test to pass, I had to decode the Token once and catch the error. (and seek back for the loop)

_, err = decoder.Token()
if err != nil {
	return nil, sitemap, err
}

// seek back to 0 if we are still here
reader.Seek(0, 0)
decoder = xml.NewDecoder(reader)

Catching this in the loop won't work cleanly, since I want to know if EOF was somewhere in-between the file (invalid XML), or at the start (this error)

I will push these changes

@HarshNarayanJha HarshNarayanJha marked this pull request as ready for review November 8, 2024 07:57
internal/pkg/crawl/extractor/xml.go Outdated Show resolved Hide resolved
internal/pkg/crawl/extractor/xml.go Outdated Show resolved Hide resolved
@yzqzss yzqzss mentioned this pull request Nov 9, 2024
@CorentinB CorentinB added the enhancement New feature or request label Nov 10, 2024
@CorentinB CorentinB requested a review from yzqzss November 10, 2024 09:12
Copy link
Contributor

@yzqzss yzqzss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM :)

@CorentinB CorentinB merged commit 803fe6a into internetarchive:main Nov 11, 2024
1 check passed
@CorentinB
Copy link
Collaborator

Thanks guys!

@HarshNarayanJha
Copy link
Contributor Author

You're welcome

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Replace github.com/clbanning/mxj/v2 with stdlib
3 participants