can't iterate over images with xslt elements split_text_from_file function in #27870

viclizki · 2024-11-03T18:25:57Z

viclizki
Nov 3, 2024

Checked other resources

I added a very descriptive title to this question.
I searched the LangChain documentation with the integrated search.
I used the GitHub search to find a similar question and didn't find it.

Commit to Help

I commit to help with one of those options 👆

Example Code

../python3.11/site-packages/langchain_text_splitters/html.py

  def split_text_from_file(self, file: Any) -> List[Document]:
        """Split HTML file

        Args:
            file: HTML file
        """
        try:
            from lxml import etree
        except ImportError as e:
            raise ImportError(
                "Unable to import lxml, please install with `pip install lxml`."
            ) from e

        # Use lxml library to parse HTML document and return XML ElementTree
        parser = etree.HTMLParser(encoding="utf-8")
        tree = etree.parse(file, parser)

        # Document transformation for "structure-aware" chunking is handled with XSLT.
        xslt_path = pathlib.Path(__file__).parent / "xsl/html_chunks_with_headers.xslt"
        xslt_tree = etree.parse(xslt_path)
        transform = etree.XSLT(xslt_tree)
        result = transform(tree)
        result_dom = etree.fromstring(str(result))

../langchain_text_splitters/xsl/html_chunks_with_headers.xslt:
	<xsl:template match="*">
		<xsl:choose>
			<!-- tags of interest get serialized into the filtered tree (and recurse down child elements) -->
			<xsl:when test="contains(
				concat('|', $tags, '|'),
				concat('|', local-name(), '|'))">
			
				<xsl:variable name="xpath">
					<xsl:apply-templates mode="xpath" select="." />
				</xsl:variable>
				<xsl:variable name="txt">
					<!-- recurse down child text-nodes and elements -->
					<xsl:apply-templates mode="text" />
				</xsl:variable>
				<xsl:variable name="txt-norm" select="normalize-space($txt)" />
				
				<div title="{$xpath}">
					
					<small class="xpath">
						<xsl:value-of select="$xpath" />
					</small>

                      
                    <xsl:for-each select="img">
                        <span class="img-src">
                            <xsl:value-of select="@src" />
                        </span>
                    </xsl:for-each>

Description

added this to ../langchain_text_splitters/xsl/html_chunks_with_headers.xslt:
<xsl:for-each select="img">

<xsl:value-of select="@src" />

</xsl:for-each>

still can't get any images in split_text_from_file :

    # Build list of elements from DOM
    elements = []
    for element in result_dom.findall("*//*", ns_map):
        img_srcs = [img.text for img in element.findall(".//span[@class='img-src']", ns_map)]
        if img_srcs:
            print(img_srcs)

System Info

name = "langchain-text-splitters"
version = "0.3.0"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

can't iterate over images with xslt elements split_text_from_file function in #27870

{{title}}

Replies: 0 comments

Select a reply

can't iterate over images with xslt elements split_text_from_file function in #27870

viclizki Nov 3, 2024

Checked other resources

Commit to Help

Example Code

Description

System Info

Replies: 0 comments

viclizki
Nov 3, 2024