-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement XML parsing using stdlib #155
Implement XML parsing using stdlib #155
Conversation
…/xml to parse xml
Hey, thank you, how is it going? |
Not going well. I am not sure what am I failing to capture in that big XML file. Any ideas @CorentinB ? |
if strings.HasPrefix(value.(string), "http") { | ||
URL, err := url.Parse(value.(string)) | ||
switch tok := tok.(type) { | ||
case xml.StartElement: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
case xml.StartElement:
startElement = tok
currentNode = &LeafNode{Path: startElement.Name.Local}
for _, attr := range tok.Attr {
if strings.HasPrefix(attr.Value, "http") {
parsedURL, err := url.Parse(attr.Value)
if err == nil {
URLs = append(URLs, parsedURL)
}
}
}
this fixed the Huge sitemap test by extracting XML attributes.
Now, the URLs' size and content match the previous tests. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, opening tags also have urls in some cases.
Now the only left is TestXMLBodyReadError
, gotta look at it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the TestXMLBodyReadError
test is invalid(?) since NopCloser
certainly won't return an EOF error on xmlBody, err := io.ReadAll(resp.Body)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possible, but they did pass previously. Sure, it does not return an EOF on read. For the test to pass, I had to decode the Token once and catch the error. (and seek back for the loop)
_, err = decoder.Token()
if err != nil {
return nil, sitemap, err
}
// seek back to 0 if we are still here
reader.Seek(0, 0)
decoder = xml.NewDecoder(reader)
Catching this in the loop won't work cleanly, since I want to know if EOF was somewhere in-between the file (invalid XML), or at the start (this error)
I will push these changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM :)
Thanks guys! |
You're welcome |
This PR replaces
github.com/clbanning/mxj/v2
and usesencoding/xml
xml.Decoder
to parse xml and extract urls within.EDIT: All tests pass now. The PR is complete
Co-Author: @yzqzss
Only two tests fail, I am trying to fix those
Tests
Closes #84