The Java Tutorials have been written for JDK 8. Examples and practices described in this page don't take advantage of improvements introduced in later releases and might use technology no longer available.
See Java Language Changes for a summary of updated language features in Java SE 9 and subsequent releases.
See JDK Release Notes for information about new features, enhancements, and removed or deprecated options for all JDK releases.
As of the JDK 7 release, Regular Expression pattern matching has expanded functionality to support Unicode 6.0.
You can match a specific Unicode code point using an escape sequence of the form \uFFFF
, where FFFF
is the hexadecimal value of the code point you want to match. For example, \u6771
matches the Han character for east.
Alternatively, you can specify a code point using Perl-style hex notation, \x{...}
. For example:
String hexPattern = "\x{" + Integer.toHexString(codePoint) + "}";
Each Unicode character, in addition to its value, has certain attributes, or properties. You can match a single character belonging to a particular category with the expression \p{prop}
. You can match a single character not belonging to a particular category with the expression \P{prop}
.
The three supported property types are scripts, blocks, and a "general" category.
To determine if a code point belongs to a specific script, you can either use the script
keyword, or the sc
short form, for example, \p{script=Hiragana}
. Alternatively, you can prefix the script name with the string Is
, such as \p{IsHiragana}
.
Valid script names supported by Pattern
are those accepted by
UnicodeScript.forName
.
A block can be specified using the block
keyword, or the blk
short form, for example, \p{block=Mongolian}
. Alternatively, you can prefix the block name with the string In
, such as \p{InMongolian}
.
Valid block names supported by Pattern
are those accepted by
UnicodeBlock.forName
.
Categories can be specified with optional prefix Is
. For example, IsL
matches the category of Unicode letters. Categories can also be specified by using the general_category
keyword, or the short form gc
. For example, an uppercase letter can be matched using general_category=Lu
or gc=Lu
.
Supported categories are those of
The Unicode Standard in the version specified by the
Character
class.