Details
-
Feature Request
-
Resolution: Unresolved
-
Minor
-
None
-
None
-
None
-
High
Description
We should do our own Unicode normalizer, because the JDK one is not good performance-wise or memory-wise, and doesn't integrate well with the authentication mechanisms which require normalization.
It would be accessible off of CodePointIterator as a few methods:
- decomposeCanonical
- decomposeCompatibility
- composeCanonical
These methods could be chained in various ways to achieve the standard defined normalization types:
- NFD = decomposeCanonical
- NFC = decomposeCanonical + composeCanonical
- NFKD = decomposeCompatibility
- NFKC = decomposeCompatibility + composeCanonical
The types and behaviors are defined here: http://www.unicode.org/reports/tr15/tr15-43.html
The implementations should be lazy. If possible they should be implemented in code as opposed to data tables, possibly one class per operation type per Unicode version so that only the necessary transformations are loaded/initialized. The code could potentially be generated from tables and rules by a Maven plugin or annotation processor (see https://github.com/jdeparser/jdeparser2 for one option).