You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I added a very descriptive title to this question.
I searched the LangChain documentation with the integrated search.
I used the GitHub search to find a similar question and didn't find it.
Commit to Help
I commit to help with one of those options 👆
Example Code
../python3.11/site-packages/langchain_text_splitters/html.pydefsplit_text_from_file(self, file: Any) ->List[Document]:
"""Split HTML file Args: file: HTML file """try:
fromlxmlimportetreeexceptImportErrorase:
raiseImportError(
"Unable to import lxml, please install with `pip install lxml`."
) frome# Use lxml library to parse HTML document and return XML ElementTreeparser=etree.HTMLParser(encoding="utf-8")
tree=etree.parse(file, parser)
# Document transformation for "structure-aware" chunking is handled with XSLT.xslt_path=pathlib.Path(__file__).parent/"xsl/html_chunks_with_headers.xslt"xslt_tree=etree.parse(xslt_path)
transform=etree.XSLT(xslt_tree)
result=transform(tree)
result_dom=etree.fromstring(str(result))
../langchain_text_splitters/xsl/html_chunks_with_headers.xslt:
<xsl:templatematch="*"><xsl:choose><!--tagsofinterestgetserializedintothefilteredtree (andrecursedownchildelements) --><xsl:whentest="contains(
concat('|', $tags, '|'),
concat('|', local-name(), '|'))"><xsl:variablename="xpath"><xsl:apply-templatesmode="xpath"select="."/></xsl:variable><xsl:variablename="txt"><!--recursedownchildtext-nodesandelements--><xsl:apply-templatesmode="text"/></xsl:variable><xsl:variablename="txt-norm"select="normalize-space($txt)"/><divtitle="{$xpath}"><smallclass="xpath"><xsl:value-ofselect="$xpath"/></small><xsl:for-eachselect="img"><spanclass="img-src"><xsl:value-ofselect="@src"/></span></xsl:for-each>
Description
added this to ../langchain_text_splitters/xsl/html_chunks_with_headers.xslt:
<xsl:for-each select="img">
<xsl:value-of select="@src" />
</xsl:for-each>
still can't get any images in split_text_from_file :
# Build list of elements from DOM
elements = []
for element in result_dom.findall("*//*", ns_map):
img_srcs = [img.text for img in element.findall(".//span[@class='img-src']", ns_map)]
if img_srcs:
print(img_srcs)
System Info
name = "langchain-text-splitters"
version = "0.3.0"
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Checked other resources
Commit to Help
Example Code
Description
added this to ../langchain_text_splitters/xsl/html_chunks_with_headers.xslt:
<xsl:for-each select="img">
<xsl:value-of select="@src" />
</xsl:for-each>
still can't get any images in split_text_from_file :
System Info
name = "langchain-text-splitters"
version = "0.3.0"
Beta Was this translation helpful? Give feedback.
All reactions