Levenshtein distance in fulltext search with and letters ø, å etc

Enonic version: 6.14.3
OS: Linux

Hi !

I have a question about levenshtein distance in fulltext search. We have a query like:

fulltext(‘data.*’, ‘XXX~2’, ‘AND’)

where XXX is the search word. For example we have object with title ( data.title ) ‘Enonic’ and we want to return this object when user search for ‘Enon’. Everything works fine with words in latin characters. However it doesn’t work for word ‘Grønnsaker’. User tries to search text ‘Grønnsak’. Is this a limitation from elasticsearch - https://www.elastic.co/guide/en/elasticsearch/reference/current/common-options.html#fuzziness ?

Thanks !

Hi.

Seems like there is a bug there when applying the fuzzy-operator to the phrase. All special characters should be ascii-folded both index and query-time, meaning that grønnsaker should be indexed and queried as “gronnsaker”

Ill create an issue for this. In the mean-time, you could either use the ngram-query in addition to the fulltext (to get a match on beginnings, but obviously wont work on everything that the fuzzy-operator does), or as a workaround do the ascii-folding yourself on the search-phrase (e.g https://github.com/mplatt/fold-to-ascii)

Issue here: https://github.com/enonic/xp/issues/6496

Thanks ! We will wait for release with a fix :slight_smile:

Hi!

I have also discovered when I have underline “_” in my query, Levenshtein distance is not working. For example, if I have a query

fulltext(‘data.*’, ‘XXX_YYY~2’, ‘AND’)

and an object has a field with a value “XXX_YYY”, my search returns nothing. If I remove Levenshtein distance(“~2”) in a query, it works.

Am I doing something wrong or is it also related to a bug, which you described earlier?

Thanks!

Any chance you could prioritize this issue? It also seems to be a similar problem when appending an asterisk * to invoke a prefix query.

(With your 7.1 release with highlighting I revisited a few of my search functions, hoping that the fulltext() function now was behaving better …)

We will prioritise this issue in the next sprint.

1 Like

A bit more info on this one. The issue with fuzzy operators is a known bug in Elastic Search, see discussion on their forum, also their internal discussion inside Github issue, where they first deprecate and then undeprecate fuzzy-operator. We could rewrite our analyzer function to handle such queries but it’s quite a bit of work. Not dropping it, just saying that this is not coming soon.
The good news is that the second part of the issue (wildcards preventing analysis of search results) seems to be easier to fix. Created a separate issue for that: https://github.com/enonic/xp/issues/7569.