Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement XML parsing using stdlib #155

Merged
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
44 changes: 31 additions & 13 deletions internal/pkg/crawl/extractor/xml.go
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
package extractor

import (
"bytes"
"encoding/xml"
"io"
"net/http"
"net/url"
"strings"

"github.com/clbanning/mxj/v2"
)

func XML(resp *http.Response) (URLs []*url.URL, sitemap bool, err error) {
Expand All @@ -19,25 +19,43 @@ func XML(resp *http.Response) (URLs []*url.URL, sitemap bool, err error) {
sitemap = true
}

mv, err := mxj.NewMapXml(xmlBody)
reader := bytes.NewReader(xmlBody)
decoder := xml.NewDecoder(reader)

// try to decode one token to see if stream is open
_, err = decoder.Token()
if err != nil {
return nil, sitemap, err
}

// Try to find if it's a sitemap
for _, node := range mv.LeafNodes() {
if strings.Contains(node.Path, "sitemap") {
sitemap = true
// seek back to 0 if we are still here
reader.Seek(0, 0)
decoder = xml.NewDecoder(reader)

for {
tok, err := decoder.Token()
if err == io.EOF {
break
}
}
if err != nil {
return nil, sitemap, err
}

for _, value := range mv.LeafValues() {
if _, ok := value.(string); ok {
if strings.HasPrefix(value.(string), "http") {
URL, err := url.Parse(value.(string))
switch tok := tok.(type) {
case xml.StartElement:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

		case xml.StartElement:
			startElement = tok
			currentNode = &LeafNode{Path: startElement.Name.Local}
			for _, attr := range tok.Attr {
				if strings.HasPrefix(attr.Value, "http") {
					parsedURL, err := url.Parse(attr.Value)
					if err == nil {
						URLs = append(URLs, parsedURL)
					}
				}
			}

this fixed the Huge sitemap test by extracting XML attributes.
Now, the URLs' size and content match the previous tests. :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, opening tags also have urls in some cases.

Now the only left is TestXMLBodyReadError, gotta look at it

Copy link
Contributor

@yzqzss yzqzss Nov 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the TestXMLBodyReadError test is invalid(?) since NopCloser certainly won't return an EOF error on xmlBody, err := io.ReadAll(resp.Body)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possible, but they did pass previously. Sure, it does not return an EOF on read. For the test to pass, I had to decode the Token once and catch the error. (and seek back for the loop)

_, err = decoder.Token()
if err != nil {
	return nil, sitemap, err
}

// seek back to 0 if we are still here
reader.Seek(0, 0)
decoder = xml.NewDecoder(reader)

Catching this in the loop won't work cleanly, since I want to know if EOF was somewhere in-between the file (invalid XML), or at the start (this error)

I will push these changes

for _, attr := range tok.Attr {
if strings.HasPrefix(attr.Value, "http") {
parsedURL, err := url.Parse(attr.Value)
if err == nil {
URLs = append(URLs, parsedURL)
}
}
}
case xml.CharData:
if strings.HasPrefix(string(tok), "http") {
parsedURL, err := url.Parse(string(tok))
if err == nil {
URLs = append(URLs, URL)
URLs = append(URLs, parsedURL)
}
}
}
Expand Down