Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Xapian Omega solution to haystack backend to fix long term issues #181

Closed
wants to merge 2 commits into from
Closed
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 30 additions & 1 deletion xapian_backend.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,10 @@
NGRAM_MIN_LENGTH = 2
NGRAM_MAX_LENGTH = 15

LONG_TERM = re.compile(b'[^\s]{239,}')
LONG_TERM_METHOD = getattr(settings, 'XAPIAN_LONG_TERM_METHOD', 'truncate')
LONG_TERM_LENGTH = getattr(settings, 'XAPIAN_LONG_TERM_LENGTH', 240)

try:
import xapian
except ImportError:
Expand Down Expand Up @@ -1627,8 +1631,33 @@ def _to_xapian_term(term):
Converts a Python type to a
Xapian term that can be indexed.
"""
return force_text(term).lower()
value = force_text(term).lower()
if LONG_TERM_METHOD:
value = _ensure_term_length(value)
return value

def _ensure_term_length(text):
"""
Ensures that terms are not too long, this helps protect against long urls
and CJK terms which are not tokenised by Xapian (and so are unsupported)
"""
# Text must operate on bytes, not unicode, because xapian's term limit is
# a byte restriction length, not a char limit length.
text = text.encode('utf8')

for match in reversed(list(LONG_TERM.finditer(text))):
hole = text[match.start():match.end()]
# There are two options available in xapian's omega project. We re-create
# these two options here using python code.
if LONG_TERM_METHOD == 'truncate':
hole = hole[:LONG_TERM_LENGTH]
elif LONG_TERM_METHOD == 'hash':
from hashlib import sha224
hole = sha224(hole.encode('utf8')).hexdigest()
doctormo marked this conversation as resolved.
Show resolved Hide resolved
text = text[:match.start()] + hole + text[match.end():]

# We ignore any errors because truncate may have chopped a unicode in half.
return text.decode('utf8', 'ignore')

def _from_xapian_value(value, field_type):
"""
Expand Down