This tool parses a string and shows extended grapheme cluster boundaries (except for Korean jamo and emoji character sequences.)

Click on the segments to reveal the character names.

This app segments text in 3 different ways:

  • BCv graphemes start with a base character and add all following combining marks, unless the base character is preceded by a character with the virama or invisible stacker indic property, in which case it extends the previous grapheme. This may produce inaccurate results if a virama is meant to signal the end of a syllable with a visible marker.
  • BC graphemes start with a base character and add all following combining marks. They don't extend the grapheme where there are viramas or stackers. That means that conjunct graphemes are split into separate parts.
  • Unicode grapheme clusters are an approximation to user perceived graphemes where the boundaries are established by rules applied to code point sequences according to UAX #29. The rules tend to be biased towards producing the units of text needed for cursor positioning. (There is a different set of rules for establishing break opportunities for line-breaking.) Grapheme clusters may also be tailored for particular languages.

To pass a string in the URL, use one of:

  • ?bcv=<string>
  • ?bc=<string>
  • ?gc=<string>

To indicate in the URL the font you want to use for the display, add &font=<font_name>.

See also the ICU line-break segmenter.

Updated 4 April, 2022

See recent changes. Make a comment. Licence CC-By © r12a