Uploaded image for project: 'ModeShape'
  1. ModeShape
  2. MODE-865

Using 'jcr:contains' with a hyphen and wildcard in the full-text expression doesn't always work

    Details

      Description

      This started out as a StackOverflow question: http://stackoverflow.com/questions/3572258/

      When issuing a 'jcr:contains' clause (in either XPath or JCR-SQL2), and this clause contains a hyphen, the query will work under some situations but not others. For example, consider content that contains "4-speed" and "5-speed". A query containing this clause will work successfully:

      ... jcr:contains(//*,'"5-speed"') ...

      Even these query that uses a wildcard does work:

      ... jcr:contains(//,'"5-sp"') ...

      or

      ... jcr:contains(//,'"-sp*"') ...

      However, consider content that contains "Sophie-Anne" and "Sophie-Allen". This query does work:

      ... jcr:contains(//*,'"sophia-anne"') ...

      while any queries that include the hyphen and a wildcard do not work:

      ... jcr:contains(//*,'"sophia-anne"') ...

      or

      ... jcr:contains(//,'"sophia-a"') ...

        Gliffy Diagrams

          Activity

          Hide
          rhauch Randall Hauch added a comment -

          The problem appears to be in how the Lucene StandardAnalyzer handles hyphens. According to the JavaDoc [1], the StandardAnalyzer uses the StandardTokenizer, which will handle hyphens as follows: the tokenizer "Splits words at hyphens, unless there's a number in the token, in which case the whole token is interpreted as a product number and is not split." [2]

          In the above examples, "5-speed" and "4-speed" are not split into separate tokens, while "Sophia-Allen" and "Sophie-Anne" are split into two tokens each. In the case where the query does not contain a wildcard, we're transforming the jcr:contain's full-text expression into a PhraseQuery using the tokenized form of the expression. Thus, the query works whether or not the full-text search expression contains the hyphen (and since the same tokenizer is used, the same split logic is used, resulting in the expected behavior).

          However, when the full-text search expression does contain a wildcard, we're transforming this expression into a wildcard search (using a RegexQuery), and here we're not properly handling the occurrence of the hyphen. We're currently escaping it so that the hyphen remains in the Term passed into the RegexQuery. When the content contains a number in the token with the hyphen, Lucene does not split it and keeps the hyphen in the token, and thus the queries work. However, when there is no number in the token with the hyphen, the tokenizer removes the hyphen, so the RegexQuery fails to match.

          [1] http://lucene.apache.org/java/3_0_0/api/all/org/apache/lucene/analysis/standard/StandardAnalyzer.html
          [2] http://lucene.apache.org/java/3_0_0/api/all/org/apache/lucene/analysis/standard/StandardTokenizer.html

          Show
          rhauch Randall Hauch added a comment - The problem appears to be in how the Lucene StandardAnalyzer handles hyphens. According to the JavaDoc [1] , the StandardAnalyzer uses the StandardTokenizer, which will handle hyphens as follows: the tokenizer "Splits words at hyphens, unless there's a number in the token, in which case the whole token is interpreted as a product number and is not split." [2] In the above examples, "5-speed" and "4-speed" are not split into separate tokens, while "Sophia-Allen" and "Sophie-Anne" are split into two tokens each. In the case where the query does not contain a wildcard, we're transforming the jcr:contain's full-text expression into a PhraseQuery using the tokenized form of the expression. Thus, the query works whether or not the full-text search expression contains the hyphen (and since the same tokenizer is used, the same split logic is used, resulting in the expected behavior). However, when the full-text search expression does contain a wildcard, we're transforming this expression into a wildcard search (using a RegexQuery), and here we're not properly handling the occurrence of the hyphen. We're currently escaping it so that the hyphen remains in the Term passed into the RegexQuery. When the content contains a number in the token with the hyphen, Lucene does not split it and keeps the hyphen in the token, and thus the queries work. However, when there is no number in the token with the hyphen, the tokenizer removes the hyphen, so the RegexQuery fails to match. [1] http://lucene.apache.org/java/3_0_0/api/all/org/apache/lucene/analysis/standard/StandardAnalyzer.html [2] http://lucene.apache.org/java/3_0_0/api/all/org/apache/lucene/analysis/standard/StandardTokenizer.html
          Hide
          rhauch Randall Hauch added a comment -

          Unfortunately, handling complex phrase queries with wildcards is not very easy because Lucene's PhraseQuery doesn't handle wildcards. However, there is a ComplexPhraseQueryParser in 'lucene-misc' (aka, Lucene Contrib) that does appear to work, although some modification is needed. Firstly, the wildcard queries support in CPQP uses WildcardQuery even if the term begins with a wildcard, and in such cases this results in an exception. This was solved by overriding a method and using RegexQuery.

          Secondly, the tokenizer passed into the CPQP doesn't seem to be handling hyphens, so we first replace all hypens (not preceded by or followed by a digit or '' or '?'). Plus, since our 'jcr:contains' supports the both the SQL wildcards (e.g., '%' and '_') and glom or Lucene wildcards (e.g., '' and '?'), but the CPQP only understands Lucene wildcards, we need to replace all unescaped SQL wildcards with the appropriate Lucene wildcard.

          As using the CPQP is quite a bit more complicated than a simple PhraseQuery with multiple terms (that have no wildcards), the CPQP option is only used when the full-text expression does have wildcards.

          Show
          rhauch Randall Hauch added a comment - Unfortunately, handling complex phrase queries with wildcards is not very easy because Lucene's PhraseQuery doesn't handle wildcards. However, there is a ComplexPhraseQueryParser in 'lucene-misc' (aka, Lucene Contrib) that does appear to work, although some modification is needed. Firstly, the wildcard queries support in CPQP uses WildcardQuery even if the term begins with a wildcard, and in such cases this results in an exception. This was solved by overriding a method and using RegexQuery. Secondly, the tokenizer passed into the CPQP doesn't seem to be handling hyphens, so we first replace all hypens (not preceded by or followed by a digit or ' ' or '?'). Plus, since our 'jcr:contains' supports the both the SQL wildcards (e.g., '%' and '_') and glom or Lucene wildcards (e.g., ' ' and '?'), but the CPQP only understands Lucene wildcards, we need to replace all unescaped SQL wildcards with the appropriate Lucene wildcard. As using the CPQP is quite a bit more complicated than a simple PhraseQuery with multiple terms (that have no wildcards), the CPQP option is only used when the full-text expression does have wildcards.
          Hide
          rhauch Randall Hauch added a comment -

          Attached a patch file with the fix. All unit and integration tests pass with this change, so it was applied to 'trunk' and the '2.2.x' branch.

          Note: this change bumps up the Lucene version from 3.0.0 to 3.0.2. The 3.0.1 and 3.0.2 patch releases add quite a few bug fixes, optimizations, and a few API additions (the 2 backward compatibility issues do not affect ModeShape's usage). This change also updates the documentation as well.

          Show
          rhauch Randall Hauch added a comment - Attached a patch file with the fix. All unit and integration tests pass with this change, so it was applied to 'trunk' and the '2.2.x' branch. Note: this change bumps up the Lucene version from 3.0.0 to 3.0.2. The 3.0.1 and 3.0.2 patch releases add quite a few bug fixes, optimizations, and a few API additions (the 2 backward compatibility issues do not affect ModeShape's usage). This change also updates the documentation as well.
          Hide
          rhauch Randall Hauch added a comment -

          Marking as resolved, as both branches were updated and all unit and integration tests (including the new ones) pass.

          Show
          rhauch Randall Hauch added a comment - Marking as resolved, as both branches were updated and all unit and integration tests (including the new ones) pass.

            People

            • Assignee:
              rhauch Randall Hauch
              Reporter:
              rhauch Randall Hauch
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development