ModeShape
  1. ModeShape
  2. MODE-865

Using 'jcr:contains' with a hyphen and wildcard in the full-text expression doesn't always work

    Details

    • Similar Issues:
      Show 10 results 

      Description

      This started out as a StackOverflow question: http://stackoverflow.com/questions/3572258/

      When issuing a 'jcr:contains' clause (in either XPath or JCR-SQL2), and this clause contains a hyphen, the query will work under some situations but not others. For example, consider content that contains "4-speed" and "5-speed". A query containing this clause will work successfully:

      ... jcr:contains(//*,'"5-speed"') ...

      Even these query that uses a wildcard does work:

      ... jcr:contains(//,'"5-sp"') ...

      or

      ... jcr:contains(//,'"-sp*"') ...

      However, consider content that contains "Sophie-Anne" and "Sophie-Allen". This query does work:

      ... jcr:contains(//*,'"sophia-anne"') ...

      while any queries that include the hyphen and a wildcard do not work:

      ... jcr:contains(//*,'"sophia-anne"') ...

      or

      ... jcr:contains(//,'"sophia-a"') ...

        Activity

        Hide
        Randall Hauch
        added a comment -

        The problem appears to be in how the Lucene StandardAnalyzer handles hyphens. According to the JavaDoc [1], the StandardAnalyzer uses the StandardTokenizer, which will handle hyphens as follows: the tokenizer "Splits words at hyphens, unless there's a number in the token, in which case the whole token is interpreted as a product number and is not split." [2]

        In the above examples, "5-speed" and "4-speed" are not split into separate tokens, while "Sophia-Allen" and "Sophie-Anne" are split into two tokens each. In the case where the query does not contain a wildcard, we're transforming the jcr:contain's full-text expression into a PhraseQuery using the tokenized form of the expression. Thus, the query works whether or not the full-text search expression contains the hyphen (and since the same tokenizer is used, the same split logic is used, resulting in the expected behavior).

        However, when the full-text search expression does contain a wildcard, we're transforming this expression into a wildcard search (using a RegexQuery), and here we're not properly handling the occurrence of the hyphen. We're currently escaping it so that the hyphen remains in the Term passed into the RegexQuery. When the content contains a number in the token with the hyphen, Lucene does not split it and keeps the hyphen in the token, and thus the queries work. However, when there is no number in the token with the hyphen, the tokenizer removes the hyphen, so the RegexQuery fails to match.

        [1] http://lucene.apache.org/java/3_0_0/api/all/org/apache/lucene/analysis/standard/StandardAnalyzer.html
        [2] http://lucene.apache.org/java/3_0_0/api/all/org/apache/lucene/analysis/standard/StandardTokenizer.html

        Show
        Randall Hauch
        added a comment - The problem appears to be in how the Lucene StandardAnalyzer handles hyphens. According to the JavaDoc [1] , the StandardAnalyzer uses the StandardTokenizer, which will handle hyphens as follows: the tokenizer "Splits words at hyphens, unless there's a number in the token, in which case the whole token is interpreted as a product number and is not split." [2] In the above examples, "5-speed" and "4-speed" are not split into separate tokens, while "Sophia-Allen" and "Sophie-Anne" are split into two tokens each. In the case where the query does not contain a wildcard, we're transforming the jcr:contain's full-text expression into a PhraseQuery using the tokenized form of the expression. Thus, the query works whether or not the full-text search expression contains the hyphen (and since the same tokenizer is used, the same split logic is used, resulting in the expected behavior). However, when the full-text search expression does contain a wildcard, we're transforming this expression into a wildcard search (using a RegexQuery), and here we're not properly handling the occurrence of the hyphen. We're currently escaping it so that the hyphen remains in the Term passed into the RegexQuery. When the content contains a number in the token with the hyphen, Lucene does not split it and keeps the hyphen in the token, and thus the queries work. However, when there is no number in the token with the hyphen, the tokenizer removes the hyphen, so the RegexQuery fails to match. [1] http://lucene.apache.org/java/3_0_0/api/all/org/apache/lucene/analysis/standard/StandardAnalyzer.html [2] http://lucene.apache.org/java/3_0_0/api/all/org/apache/lucene/analysis/standard/StandardTokenizer.html
        Hide
        Randall Hauch
        added a comment -

        Unfortunately, handling complex phrase queries with wildcards is not very easy because Lucene's PhraseQuery doesn't handle wildcards. However, there is a ComplexPhraseQueryParser in 'lucene-misc' (aka, Lucene Contrib) that does appear to work, although some modification is needed. Firstly, the wildcard queries support in CPQP uses WildcardQuery even if the term begins with a wildcard, and in such cases this results in an exception. This was solved by overriding a method and using RegexQuery.

        Secondly, the tokenizer passed into the CPQP doesn't seem to be handling hyphens, so we first replace all hypens (not preceded by or followed by a digit or '' or '?'). Plus, since our 'jcr:contains' supports the both the SQL wildcards (e.g., '%' and '_') and glom or Lucene wildcards (e.g., '' and '?'), but the CPQP only understands Lucene wildcards, we need to replace all unescaped SQL wildcards with the appropriate Lucene wildcard.

        As using the CPQP is quite a bit more complicated than a simple PhraseQuery with multiple terms (that have no wildcards), the CPQP option is only used when the full-text expression does have wildcards.

        Show
        Randall Hauch
        added a comment - Unfortunately, handling complex phrase queries with wildcards is not very easy because Lucene's PhraseQuery doesn't handle wildcards. However, there is a ComplexPhraseQueryParser in 'lucene-misc' (aka, Lucene Contrib) that does appear to work, although some modification is needed. Firstly, the wildcard queries support in CPQP uses WildcardQuery even if the term begins with a wildcard, and in such cases this results in an exception. This was solved by overriding a method and using RegexQuery. Secondly, the tokenizer passed into the CPQP doesn't seem to be handling hyphens, so we first replace all hypens (not preceded by or followed by a digit or ' ' or '?'). Plus, since our 'jcr:contains' supports the both the SQL wildcards (e.g., '%' and '_') and glom or Lucene wildcards (e.g., ' ' and '?'), but the CPQP only understands Lucene wildcards, we need to replace all unescaped SQL wildcards with the appropriate Lucene wildcard. As using the CPQP is quite a bit more complicated than a simple PhraseQuery with multiple terms (that have no wildcards), the CPQP option is only used when the full-text expression does have wildcards.
        Hide
        Randall Hauch
        added a comment -

        Attached a patch file with the fix. All unit and integration tests pass with this change, so it was applied to 'trunk' and the '2.2.x' branch.

        Note: this change bumps up the Lucene version from 3.0.0 to 3.0.2. The 3.0.1 and 3.0.2 patch releases add quite a few bug fixes, optimizations, and a few API additions (the 2 backward compatibility issues do not affect ModeShape's usage). This change also updates the documentation as well.

        Show
        Randall Hauch
        added a comment - Attached a patch file with the fix. All unit and integration tests pass with this change, so it was applied to 'trunk' and the '2.2.x' branch. Note: this change bumps up the Lucene version from 3.0.0 to 3.0.2. The 3.0.1 and 3.0.2 patch releases add quite a few bug fixes, optimizations, and a few API additions (the 2 backward compatibility issues do not affect ModeShape's usage). This change also updates the documentation as well.
        Hide
        Randall Hauch
        added a comment -

        Marking as resolved, as both branches were updated and all unit and integration tests (including the new ones) pass.

        Show
        Randall Hauch
        added a comment - Marking as resolved, as both branches were updated and all unit and integration tests (including the new ones) pass.

          People

          • Assignee:
            Randall Hauch
            Reporter:
            Randall Hauch
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: