Levenshtein distance in fulltext search with and letters ø, å etc


#1

Enonic version: 6.14.3
OS: Linux

Hi !

I have a question about levenshtein distance in fulltext search. We have a query like:

fulltext(‘data.*’, ‘XXX~2’, ‘AND’)

where XXX is the search word. For example we have object with title ( data.title ) ‘Enonic’ and we want to return this object when user search for ‘Enon’. Everything works fine with words in latin characters. However it doesn’t work for word ‘Grønnsaker’. User tries to search text ‘Grønnsak’. Is this a limitation from elasticsearch - https://www.elastic.co/guide/en/elasticsearch/reference/current/common-options.html#fuzziness ?

Thanks !


#2

Hi.

Seems like there is a bug there when applying the fuzzy-operator to the phrase. All special characters should be ascii-folded both index and query-time, meaning that grønnsaker should be indexed and queried as “gronnsaker”

Ill create an issue for this. In the mean-time, you could either use the ngram-query in addition to the fulltext (to get a match on beginnings, but obviously wont work on everything that the fuzzy-operator does), or as a workaround do the ascii-folding yourself on the search-phrase (e.g https://github.com/mplatt/fold-to-ascii)

Issue here: https://github.com/enonic/xp/issues/6496


#3

Thanks ! We will wait for release with a fix :slight_smile:


#4

Hi!

I have also discovered when I have underline “_” in my query, Levenshtein distance is not working. For example, if I have a query

fulltext(‘data.*’, ‘XXX_YYY~2’, ‘AND’)

and an object has a field with a value “XXX_YYY”, my search returns nothing. If I remove Levenshtein distance("~2") in a query, it works.

Am I doing something wrong or is it also related to a bug, which you described earlier?

Thanks!