Sinhala

orthography notes

Updated 2 May, 2025

This page brings together basic information about the Sinhala script and its use for the Sinhalese language. It aims to provide a brief, descriptive summary of the modern, printed orthography and typographic features, and to advise how to write Sinhalese using Unicode.

Referencing this document

Richard Ishida, Sinhala Orthography Notes, 02-May-2025, https://r12a.github.io/scripts/sinh/si

Sample

Select part of this sample text to show a list of characters, with links to more details.
Change size:   28px

1 වන වගන්තිය සියලු මනුෂ්‍යයෝ නිදහස්ව උපත ලබා ඇත. ගරුත්වයෙන් හා අයිතිවාසිකම්වලින් සමාන වෙති. යුක්ති අයුක්ති පිළිබඳ හැඟීමෙන් හා හෘදය සාක්ෂියෙන් යුත් ඔවුන්, ඔවුනොවුන්ට සැළකිය යුත්තේ සහෝදරත්වය පිළිබඳ හැඟීමෙනි.

2 වන වගන්තිය ජාති, වංශ, වර්ණ, ස්ත්‍රී පුරුෂ භාවය, භාෂාව, ආගම්, දේශපාලන ආදී කවර බේදයක් හෝ සමාජ, ජාතික, දේපළ, උපත ආදී කවර තත්ත්වයක විශේෂයක් හෝ නොමැතිව මේ ප්‍රකාශනයේ සඳහන් සියලු හිමිකම්වලට හා ස්වාධීනත්වයන්ට සෑම පුද්ගලයකුම උරුම වන්නේය. තවද යම් පුද්ගලයකු අයත්වන රටේ දේශපාලන, නීතිමය හෝ ජාත්‍යන්තර තත්ත්වයන් පිළිබඳ කිසිදු විශේෂයක් ද ඒ රටේ ස්වාධීන, භාරකාර, අස්වාධීන ආදී කවර තත්ත්වයක් පිළිබඳ විශේෂයක් ද නොමැතිව මේ හිමිකම් ඔහු සතු වන්නේය.

Source: Unicode UDHR, articles 1 & 2

Usage & history

Origins of the Sinhala script, 3rdC – today.

Phoenician

└ Aramaic

└ Brahmi

└ Sinhala

+ Tamil-Brahmi

+ Gupta

+ Bhattiprolu

+ Kadamba

+ Tocharian

The Sinhala script is used for writing the Sinhala language, spoken by approximately 16 million people in Sri Lanka,

සිංහල අක්ෂර මාලාව Siṁhala Akṣara Mālāva Sinhalese alphabet

The alphabet is a descendant of the ancient Indian Brahmi script and is closely related to the South Indian Grantha script and Kadamba alphabet.

More information: Scriptsource and Wikipedia.

Basic features

The Sinhala script is an abugida, ie. consonants carry an inherent vowel sound that is overridden using vowel signs. See the table to the right for a brief overview of features for the modern Sinhala orthography.

Sinhala is a diglossic language, that is, the spoken and written forms of the language show considerable variation.

Words are separated by spaces, and text runs horizontally, from left to right.

Sinhalese is also often considered two alphabets, or an alphabet within an alphabet, due to the presence of two sets of letters within the Unicode block. The core set, known as the śuddha siṃhala (pure Sinhalese, ශුද්ධ සිංහල) or eḷu hōḍiya (Eḷu alphabet එළු හෝඩිය), can represent all native phonemes, and is taught in schools. In order to accurately transcribe Sanskrit, Pali, Hindi and English loanwords, an extended set, the miśra siṃhala (mixed Sinhalese, මිශ්‍ර සිංහල), is available.

❯ consonantSummary

The eḷu hōḍiya system contains consonant and vowel letters and can be used to represent the sounds of the spoken language almost perfectly. The miśra hōḍiya set contains additional consonant letters, many of which are aspirated equivalents of existing letters (but which are pronounced in the same way as the unaspirated ones).

Unusually for indic scripts, there is a set of prenasalised consonants, and there is also an extra æ vowel.

The virama is usually visible in consonant clusters, like in Tamil. However, it is also possible to render clusters using conjunct forms (ligatures or reduced glyphs), especially for clusters involving r or j. A zero width joiner is used after the virama to signal the intention for that. Putting the ZWJ before the virama produces another form of conjunct, where adjacent consonants touch each other, but this is not used for modern Sinhalese.

One particular affix, යි yi, is pronounced j and treated as a final consonant.

Onset consonant clusters are limited in number. Syllable-final consonants can be written using ordinary consonants or one of 2 combining marks.

❯ basicV

This orthography is an abugida with an inherent vowel, that is pronounced a in stressed syllables and ə in unstressed. Other post-consonant vowels and 2 diphthongs are written using vowel signs, all combining marks.

There are 2 pre-base vowels, and 4 circumgraphs, and no multipart vowels in principle, however several vowel signs decompose to more than one character.

Standalone vowel sounds are written using independent vowel letters.

Sinhala also has vocalic letters and combining marks, but only one pair is in regular use.

A set of Sinhala digits exists, but modern Sinhala uses ASCII numbers.

Notable features

Notable features of the Sinhala orthography include:

  1. a double tier system of letters, one for native and one for Sanskrit/Pail words
  2. consonant clusters normally show a visible virama, and conjuncts are formed using ZWJ
  3. older texts use touching consonants to indicate consonant clusters
  4. a set of pre-nasalised consonant letters
  5. additional æ and æː vowels
  6. significant variation in shaping of dependent vowels
  7. two alternative shapes for the visible virama

Character index

Letters

Show

Basic consonants

ක␣ග␣ඟ␣ච␣ජ␣ට␣ඩ␣ණ␣ඬ␣ත␣ද␣න␣ඳ␣ප␣බ␣ම␣ඹ␣ය␣ර␣ල␣ව␣ස␣හ␣ළ

Extended consonants

ඛ␣ඝ␣ඞ␣ඡ␣ඣ␣ඤ␣ඥ␣ඨ␣ඪ␣ථ␣ධ␣ඵ␣භ␣ශ␣ෂ␣ෆ

Vowels

ඉ␣ඊ␣උ␣ඌ␣එ␣ඒ␣ඔ␣ඕ␣ඇ␣ඈ␣අ␣ආ␣ඓ␣ඖ

Vocalics

Not used for modern Sinhala

ඤ␣ඦ␣ඎ␣ඏ␣ඐ

Combining marks

Show

Vowels

ි␣ී␣ු␣ූ␣ෙ␣ේ␣ො␣ෝ␣ැ␣ෑ␣ා␣ෛ␣ෞ

Vocalics

Bindu

Virama

Visarga

Not used for modern Sinhala

ෲ␣ෟ␣ෳ

Numbers

Show
෦␣෧␣෨␣෩␣෪␣෫␣෬␣෭␣෮␣෯␣𑇡␣𑇢␣𑇣␣𑇤␣𑇥␣𑇦␣𑇧␣𑇨␣𑇩␣𑇪␣𑇫␣𑇬␣𑇭␣𑇮␣𑇯␣𑇰␣𑇱␣𑇲␣𑇳␣𑇴

Punctuation

Show
‘␣’␣“␣”

ASCII

(␣)␣,␣.␣:␣;␣?␣!

Not used for modern Sinhala

Other

Show
‌␣‍
Items to show in lists

Phonology

These are sounds of the Sinhala language.

Click on the sounds to reveal locations in this document where they are mentioned.

Phones in a lighter colour are non-native or allophones. Source Wikipedia.

Vowel sounds

i u e o ə əː ə əː æ æː ɐ a

əː is restricted to English loans.

a and ə are allophones in Sinhala and contrast with each other as inherent vowels in stressed and unstressed syllables, respectively.wl,#Phonology

Consonant sounds

labial alveolar post-
alveolar
retroflex palatal velar glottal
stops p b t d   ʈ ɖ   k ɡ  
pre-nasalised ᵐb ⁿd   ᶯɖ   ᵑɡ  
affricates     t͡ʃ d͡ʒ
t͡ɕ d͡ʑ
       
fricatives f
ɸ
s ʃ
ɕ
      h
nasals m n       ŋ
approximants ʋ l     j  
trills/flaps   r    

Tone

Sinhala is not a tonal language.

Structure

tbd

Vowels

Vowel summary table

This table summarises basic vowel to character assignments.

ⓘ represents the inherent vowel. Diacritics are added to the vowels to indicate nasalisation (not shown here).

Diacritics are added to the vowels to indicate nasalisation (not shown here).

  post-consonant standalone
Plain:
ි␣ී␣ු␣ූ
ඉ␣ඊ␣උ␣ඌ
ෙ␣ේ␣ො␣ෝ
එ␣ඒ␣ඔ␣ඕ
ⓘ␣ැ␣ෑ␣ා
ඇ␣ඈ␣අ␣ආ
Dipthongs:
ෛ␣ෞ
ඓ␣ඖ
Vocalics:

For additional details see vowel_mappings.

Inherent vowel

kaU+0D9A SINHALA LETTER ALPAPRAANA KAYANNA

The inherent vowel is typically transcribed as a, and pronounced a in stressed syllables, and otherwise ə.wl,#Phonology So ka is written by simply using the consonant letter. The following example shows inherent vowels.

ටකනවා

ට␣ක␣න␣ව␣ා

Inherent vowel suppression

0DCA is attached to a consonant to indicate that the inherent vowel is not pronounced. It has 2 different shapes, depending on which base consonant it is attached to.

ක්   ඛ්
The two different shapes of AL-LAKUNA. Combined with shuddha k on the left, and mishra k on the right.

Consonants without a following vowel typically occur at the end of a word, or as part of a consonant cluster or geminate (see clusters), and a vowel killer is typically used whenever the inherent vowel is absent, eg. අලුත් ඇතැම් කන්ද

Post-consonant vowels

Post-consonant vowels and 2 diphthongs are written using 13 vowel signs, all combining marks. There are 2 pre-base vowels, and 4 circumgraphs, and no multipart vowels in principle, however several vowel signs decompose to more than one character.

Nine vowel signs are spacing marks, meaning that they consume horizontal space when added to a base consonant.

All vowel signs are stored after the base consonant, and the rendering process puts them in the correct place for display. Conjuncts are treated as indivisible units when it comes to rendering vowel signs, meaning that pre-base vowel signs and left-side glyphs of circumgraphs are rendered before the conjunct as a whole (see prebase).

The shapes of vowel signs can vary significantly, depending on what they combine with. For details, see context.

Plain vowels

කී U+0D9A SINHALA LETTER ALPAPRAANA KAYANNA +U+0DD3 SINHALA VOWEL SIGN DIGA IS-PILLA

Sinhala uses the following dedicated combining marks for vowels.

ි␣ී␣ු␣ූ␣ෙ␣ේ␣ො␣ෝ␣ැ␣ෑ␣ා

The vowel letters of Sinhala are divided into a core set and an extended set. The core (ʃuddʰa) alphabet covers the sounds of modern spoken Sinhala. The extended (miʃra) letters and vocalics are used for writing Sanskrit, Pali, and Tamil words. These are the ʃuddʰa vowel signs.

It is worth noting that Sinhala has vowel signs for the sounds æ and æː, which is unusual for the major scripts in this region, eg.

බැහැර

ගෑනී

Diphthongs

කී U+0D9A SINHALA LETTER ALPAPRAANA KAYANNA +U+0DD3 SINHALA VOWEL SIGN DIGA IS-PILLA

Sinhala uses the following dedicated combining marks for diphthongs.

ෛ␣ෞ

These are extended (miʃra) letters which, with vocalics, are used for writing Sanskrit, Pali, and Tamil words.

Examples:

තෛලය

ගෞරවය

Pre-base vowel signs

ෙ␣ ␣ෛ

Two vowel signs appear to the left of the base consonant letter or cluster when rendered. The first of these is a core letter, the second an extended letter, eg.

බෙක

වෛද්‍ය

These are combining marks that are always typed and stored after the base consonant(s), ie. the codepoints follow the order in which the items are pronounced. The rendering process places the glyph before the base consonant without changing the code points. The following shows the sequence of code points that make up the first word just above.

බ␣ෙ␣ක

Because modern Sinhala usually indicates consonant clusters with a visible virama, pre-base vowel signs normally appear before the consonant that immediately precedes them audially (see fig_prebase.

එක්වෙනවා

එ␣ක␣්␣ව␣ෙ␣න␣ව␣ා

However, when the consonant cluster is rendered as a conjunct, the vowel sign is actually rendered before the start of the conjunct, ie. the sequence of glyphs for the orthographic syllable is rendered VCC, whereas the pronunciation is CCV.

පළවෙනි
A prebase vowel, rendered to the left of the consonant after which it is pronounced.
show composition

පළවෙනි

Circumgraphs

ේ␣ො␣ෝ␣ ␣ෞ

Four vowels are produced by a single combining character with visually separate parts that appear on different (mostly opposite) sides of the consonant onset. These are all core letters, except for the diphthong. Examples:

මේද

දොළ

නෝනා

ගෞරවය

Like pre-base glyphs, these are combining marks that are always typed and stored after the base consonant or consonant cluster. The rendering process places the glyphs around the base consonant(s), as needed, eg.

කොර

ක␣ො␣ර

Again, like pre-base vowel signs, in Sinhala consonant clusters the circumgraph normally surrounds only the consonant that phonetically precedes it. In the cases where it is pronounced after a cluster that is rendered as a conjunct, it surrounds the whole conjunct.

ප්රේත

ප␣්␣ර␣ේ␣ත

ප්‍රේමය

ප␣්␣‍␣ර␣ේ␣ම␣ය
ලෝකය
A circumgraph vowel: a single code point with glyphs on both sides of the consonant after which it is pronounced.
show composition

ලෝකය

All of these circumgraphs can be written as a single code points, or as multiple code points. See encoding.

Multipart vowels

Multipart vowels only occur in Sinhalese decomposed text. Usually there are no multipart vowels.

The following are the vowel signs that decompose in NFD and recompose under NFC, shown here as decomposed sequences.

ේ␣ො␣ෝ␣ෞ

Vowel sign placement

The following list shows where vowel signs are positioned around a base consonant to produce vowels, and how many instances of that pattern there are.

  • 2 pre-base, eg. කෙ ke
  • 3 post-base, eg. කැ
  • 2 superscript, eg. කි ki
  • 2 subscript, eg. කු ku
  • 3 pre+post-base, eg. කො ko
  • 1 pre+superscript, eg. කේ

Standalone vowels

Sinhala represents standalone vowels using a set of independent vowel letters. The set includes a character to represent the inherent vowel sound.

The core (ʃuddʰa) alphabet includes the following.

ඉ␣ඊ␣උ␣ඌ␣එ␣ඒ␣ඔ␣ඕ␣ඇ␣ඈ␣අ␣ආ

The extended (miʃra) letters are as follows, but see also vocalics:

ඓ␣ඖ

Examples:

අකුර

ඉරු

ඊයේ

එකතු

The pronunciations of and vary, but in a fairly predictable way. The former is a in the first syllable, except for a few words, and before double consonants or clusters, and ə word finally and before single consonants. The latter represents everywhere except word-finally, where it may be a, depending on the word structure. Similar length rules apply to e and o in final position.

Vowel sounds to characters

This section maps Sinhala vowel sounds to common graphemes in the Sinhala orthography.

Graphemes are labelled as either dependent (post-consonant) or standalone consonants.

Plain vowels

i

dependent

standalone

dependent

standalone

u

dependent

standalone

dependent

standalone

e

dependent

standalone

dependent

standalone

o

dependent

standalone

dependent

standalone

ə

inherent vowel eg. අගය

æ

dependent

standalone

æː

dependent

standalone

a

standalone

dependent

standalone

Complex vowels

ɑj

dependent

standalone

ɑw

dependent

standalone

Vocalics

These are all classed as extended (miʃra) letters. Most are no longer in contemporary use.

ඍ␣ෘ

Example:

කෘෂ්ණ

ඎ␣ඏ␣ඐ␣ෲ␣ෟ␣ෳ

Consonants

Consonant summary table

This table summarises basic consonant to character assignments.

  ʃuddʰa miʃra finals
Stops
ප␣බ␣ත␣ද␣ට␣ඩ␣ක␣ග
ඵ␣භ␣ථ␣ධ␣ඨ␣ඪ␣ඛ␣ඝ
 
Pre-nasalised
ඹ␣ඳ␣ඬ␣ඟ
   
Affricates
ච␣ජ
ඡ␣ඣ
 
Fricatives
ස␣හ
ෆ␣ශ␣ෂ
Nasals
ම␣න␣ණ
ඥ␣ඞ
Other
ව␣ර␣ල␣ළ␣ය
   

For additional details see vowel_mappings.

Basic (ʃuddʰa) consonants

The core set, or ʃuddʰa hōɖiya, is based on the classical grammar of the middle ages (called එළු හෝඩිය ẹɭu hōɖiya) and contains the following consonants. They are highly phonetic.

Click on each letter for examples of usage.

ප␣බ␣ත␣ද␣ට␣ඩ␣ක␣ග␣ච␣ජ␣ස␣හ␣ම␣ණ␣ව␣ර␣ල␣ළ␣ය

Some argue that doesn't belong in this group for academic reasons, but it is certainly one of the basic sounds of modern Sinhala.

Prenasalised consonants

ඹ␣ඳ␣ඬ␣ඟ

A peculiarity of Sinhalese among indic scripts is the inclusion of prenasalised consonants, representing a nasal sound followed by a stop. The orthography distinguishes these graphemes from the more straightforward nasal consonant followed by a stop. For example, compare අණ්ඩ අඬ

The prenasalised shapes are formed from a combination of the shapes of the participating characters.

The Sinhala block includes another, archaic pre-nasalised consonant, , ᶮd͡ʒ, which is only attested in a few words.

miʃra & other consonants

The full set of consonants includes the additional consonants in this section, known as miʃra hōɖiya (mixed alphabet).

ඵ␣භ␣ථ␣ධ␣ඨ␣ඪ␣ඛ␣ඝ␣ඡ␣ඣ␣ඥ␣ෆ␣ශ␣ෂ␣න␣ඤ␣ඞ

The miʃra stops are mapped to aspirated consonants in Sanskrit and Pali, but they are pronounced in modern Sinhala in the same way as the unaspirated ʃuddʰa ones.

This list includes a new character for f, . Sometimes, instead, a character is used that combines the Latin letter 'f' with the Sinhalese d. That letter doesn't appear to be encoded in Unicode.

is an atomic character representing the conjunct JA + NYA.

Onsets

Clusters of consonant letters at the beginning of an orthographic syllable occur in Sinhala, and they are handled as described in the section clusters.

Finals

Syllable codas are written using ordinary characters, on the whole, followed by a virama or a conjunct sequence. The -r coda has a slight difference, and there are 2 combining marks.

RA coda

Mechanically, the -r syllable coda is handled in the same way as other codas, but the glyph position in a conjunct is slightly different. It is represented by a special glyph above the full-sized glyph for the following consonant. For example:

අර්‍තාපල්

අ␣ර␣්␣‍␣ත␣ා␣ප␣ල␣්

Combining marks

Two combining characters are used to represent syllable-final consonant sounds.

ං␣ඃ

0D82 usually represents the sound ŋ, eg. සිංහල

0D83 is also in the repertoire. Not clear how it's used in Sinhala.

Either of these 'semi-consonants' must be used after a vowel or after a consonant+vowel (including the inherent vowel), and must be the last combining character in the syllable.

Consonant clusters

Consonant cluster handling is a little unusual in Sinhala, compared to other indic scripts.

There are 3 ways of managing consonant clusters. Modern Sinhala uses only the first two alternatives.

  1. Visible virama : Show 0DCA over the first character in the cluster. Unlike Devanagari, this is the default for Sinhala.
  2. Conjunct forms : Use a reduced or ligated form, especially for r or y. Since the approach changes the shape of the constituent components, the cluster is referred to as a conjunct.
  3. Touching consonants : Make the consonants touch (not used in modern Sinhala).

See also finals.

See a table of 2-consonant clusters.
The table allows you to test results for various fonts.

Using a visible virama

The virama indicates that a consonant has no vowel (see novowel). The shape of the virama can take two forms, depending on the base character it is appended to: with k you get ක්; with kh you get ඛ්, eg.

ලක්ෂය

ල␣ක␣්␣ෂ␣ය

අම්මා

අ␣ම␣්␣ම␣ා

If a pre-base vowel sign is added after the last consonant in a cluster, it will appear immediately to the left of that consonant, rather than before the first consonant in the cluster, eg. see how the vowel in fig_kko cuts between the two consonants in the cluster.

The pre-base part of the vowel (highighted) appears immediately before the consonant after which it is pronounced, rather than at the beginning of the consonant cluster.

Conjuncts

The combination 0DCA 200D causes the font to hide the virama glyph and form a conjunct, eg.

ව්‍යාඝ්‍රයා

ව␣්␣‍␣ය␣ා␣ඝ␣්␣‍␣ර␣ය␣ා

This approach is principally used when combining r or y with another consonant (both before and after, in the case of r), and produces a reduced or ligated form.

ර්‍ක ක්‍ර ක්‍ය
Common conjuncts in Sinhala.

When a u or vowel appears below a conjoined conjunct, it is placed below the final consonant, eg.

ක්‍යු kju

Although the use of the conjunct with r is required in normal Sinhalese text, it is possible to not use it: both of the following are valid ways to write karma.s

කර්ම

ක␣ර␣්␣ම

කර්‍ම

ක␣ර␣්␣‍␣ම
Show more conjuncts

Wikipedia lists several more conjuncts, some of which are reproduced below. The availability of these conjuncts is font dependent, eg. ඳ්‍ව ⁿd͓₊v doesn't ligate using the default font of this page, but may with another.

ක්‍ව␣ක්‍ෂ␣ත්‍ථ␣ත්‍ව␣න්‍ද␣න්‍ධ␣න්‍ද්‍ර

Touching consonants

The third approach is used in ancient scriptures but is not used in modern Sinhala.ws It hides the virama and moves the consonants alongside each other, so that they are touching, eg. මම becomes ම‍්ම mm

For this use ZWJ first, ie. 200D 0DCA.

Consonant length

Gemination and consonant lengthening are handled using the normal approach to consonant clusters (see clusters), eg.

අවුරුද්‍ද

කුරුල්‍ලා

Consonant sounds to characters

This section maps Sinhala consonant sounds to common graphemes in the Sinhala orthography.

Graphemes are labelled as either śuddha or miśra consonants.

p

śuddha

miśra

b

śuddha

miśra

ᵐb

śuddha

t

śuddha

miśra

t͡ʃ

śuddha

miśra

d

śuddha

miśra

ⁿd

śuddha

d͡ʒ

śuddha

miśra

ᶮd͡ʒ

miśra

ʈ

śuddha

miśra

ɖ

śuddha

miśra

ⁿɖ

śuddha

k

śuddha

miśra

g

śuddha

miśra

ᵑɡ

śuddha

f

miśra

s

śuddha

ʃ

miśra

miśra

ɦ

śuddha

m

śuddha

coda Coda.

n

śuddha

miśra

ɲ

miśra

miśra

ŋ

miśra

coda Coda.

ʋ

śuddha

r

śuddha

ri

miśra

miśra

ru

miśra

miśra

l

śuddha

ɭ

śuddha

li

miśra

j

śuddha

Encoding choices

Canonical equivalence

All of these circumgraphs can be written as a single code points, or as multiple code points.

  1. 0DDA
    0DD9 0DCA
  2. 0DDC
    0DD9 0DCF
  3. 0DDD
    0DD9 0DCF 0DCA
  4. 0DDE
    0DD9 0DDF

The single code point per vowel sign is the form preferred by the Sinhala encoding standards and the form in common use for Sinhala. The parts are separated, however, in Unicode when normalised using Normalisation Form D (NFD). If Normalisation Form C (NFC) is applied, they recompose.

Whichever approach is used, the vowel signs must be typed and stored after the consonant characters they surround. In the case of decomposed vowel signs, the order is also important and must be as shown above.

Non-equivalences

It is possible to visually analyse the atomic letters below as being composed of sub-parts, but the Unicode Standard strongly recommends that they should each be written with one of the single code points listed here. Those code points do not decompose in NFD.

Use Do not use!
0D86 0D85 0DCF
0D87 0D85 0DD0
0D88 0D85 0DD1
0D8C 0D8B 0DDF
0D8E 0D8D 0DD8
0D90 0D8F 0DDF
0D92 0D91 0DCA
0D93 0D91 0DD9
0D96 0D94 0DDF

Numbers, dates, currency, etc.

Sinhala uses european digits.

There is, however, a set of native digits, that were used into the 20th century, but mostly associated with horoscopes. The shapes of some of these are identical to characters used for other purposes.

෦␣෧␣෨␣෩␣෪␣෫␣෬␣෭␣෮␣෯

There is also another, older set that were used in an archaic number system, called Sinhala Illakkam, prior to 1815. These are all in the Sinhala Archaic Numbers block.

𑇡␣𑇢␣𑇣␣𑇤␣𑇥␣𑇦␣𑇧␣𑇨␣𑇩␣𑇪␣𑇫␣𑇬␣𑇭␣𑇮␣𑇯␣𑇰␣𑇱␣𑇲␣𑇳␣𑇴

Text direction

Sinhala text runs left to right in horizontal lines.

Show default bidi_class properties for characters in the Sinhala orthography described here.

Glyph shaping & positioning

You can experiment with examples using the Sinhala character app.

Context-based shaping & positioning

A significant amount of shaping and positioning of glyphs is needed for rendering Sinhala text. Listed here are just a few examples.

Vowel shaping

Similarly to the Tamil script, the u and ū vowels assume various different shapes and connection points, depending on what consonant they follow.

See a table of all consonants and all vowel signs.
The table allows you to test results for various fonts.

  -u -uː
kකුකූ
pපුපූ
rරුරූ
ළුළූ
krක්‍රක්‍රුක්‍රූ
Shape variants for the u and vowels.

Other special shaping approaches are also required, such as the following.

ප␣ි␣පි
ර␣ි␣රි
ඬ␣ි␣ඬි
Differently shaped i in pi, ri and ⁿɖi.
ර␣ැ␣රැ
ර␣ෑ␣රෑ
Shape variants for the æ and æː vowels.

Vowel signs may appear above, below, to the right, to the left, or on both sides of the base consonant.

ක කි කු කැ කෙ කො

Position of vowel signs for the sequence ka ki ku kæ ke ko.

Vowels signs are positioned around a conjunct, rather than around a specific consonant. So a part of a vowel sign that appears to the left of its base will appear to the left of a conjunct.

ක්ව␣ො␣ක්වො
ක්‍ව␣ො␣ක්‍වො
A circumgraph vowel sign following a regular consonant cluster and following a conjunct form.

Consonant cluster shaping

Shaping is also required for rendering consonant clusters. Various special forms are involved, from just displaying the virama to creating conjuncts (see also clusters). Conjunct ligations are generally expected for r and y, and other conjuncts depend on font availability. Generally, a conjunct is formed by reducing the non-final consonant shapes. The following is just a sample.

ක␣්␣ක්
ඛ␣්␣ඛ්
Two different versions of hal kirīma.
ක␣්␣‍␣ව␣ක්‍ව
ර␣්␣‍␣ක␣ර්‍ක
Conjoined and stacked conjuncts.

Explicit shaping controls

200D (ZWJ) is used to produce conjuncts (see clusters).

Typographic units

Word boundaries

Words are separated by spaces.

Graphemes

This section is still undergoing research and development.

Grapheme clusters can be used much of the time to segment Sinhala words, because the virama is displayed without causing a conjunct. However, there are conjuncts in Sinhala, and these should not be split apart by edit operations that visually change the text (such as letter-spacing, first-letter highlighting, and in-word line breaking). For those operations one needs to segment the text using orthographic syllables, which string grapheme clusters together with 0DCA 200D, where the al-lakuna has an Indic Syllabic Category of Virama.

The fact that modern Sinhala only combines grapheme clusters if a virama is accompanied by a ZWJ makes it much easier to manage situations where the virama should be displayed and end a typographic unit, and situations where it should become invisible and form a conjunct.

Grapheme clusters

Base Combining_mark*

Combining marks may include zero or more of the following types of character.

  1. Dependent_Vowel [13] (see combiningV)
  2. Final consonant marks [2] (see finals)
  3. a visible Virama (see novowel) (called al-lakuni).

Any of the above may occur after a consonant base. Independent vowel bases usually only have final consonant marks. There is usually only one vowel sign per base consonant, but there can be 2 in decomposed text.

A virama only occurs alone after a consonant base and indicates a syllable coda or a vowelless consonant in a cluster. Because a virama used alone is a visible vowel-killer and doesn't create conjuncts, it can be treated as just another combining mark and segmentation can break after it.

The following examples show a variety of grapheme clusters:

Click on the text version of these words to see more detail about the composition.

අදිනවාඅදිනවා
පුංචිපුංචි
අලුත්අලුත්
කන්දකන්ද

Larger typographic units

(Consonant Al_lakuna ZWJ)* Grapheme_cluster

Editorial operations that change the visual appearance of the text, such as letter-spacing, first-letter highlighting, in-word line-breaking, and justification, should never split conjunct forms apart. For this reason, an alternative way of segmenting graphemes is needed. This may not apply, however, for some other operations such as cursor movement or backwards delete.

Where conjuncts appear, a typographic unit contains multiple grapheme clusters. The non-final grapheme clusters all end with the sequence 0DCA 200D, and the final grapheme cluster begins with a consonant.

The following are examples.

Click on the text version of these words to see more detail about the composition.

ඉංග්‍රීසිඉංග්‍රීසි
චර්‍මයචර්‍මය

Pre-modern orthographies may bring consonants in a cluster closer together, rather than creating a conjunct (see touchingconsonants). In this case, the trigger is a ZWJ followed by a virama.

Complicating factors

It can be difficult to know how to type a word that you see on a non-digital platform. For example, There are several words in Wiktionary that are rendered with a visible al-lakuna but that have both the al-lakuna and ZWJ in the underlying code. The latter is invisible, so cannot be detected from looking at the word on paper, and fonts don't produce a conjunct form, but the way the word is typed will affect the behaviour in the digital world by producing different segmentation, as shown just below, where the top spelling has the ZWJ and the bottom doesn't.

Click on the text version of these words to see more detail about the composition.

ඉංග්‍රීසිකුරුල්‍ලා
චර්‍මයකුරුල්ලා

The differenece between just al-lakuna and al-lakuna with ZWJ can also affect vowel sign positioning. For the purposes of illustration, see fig_kro, where the word on the left is written with ZWJ to produce a conjunct, whereas on the right there is no conjunct. Otherwise the characters are the same. Observe the placement of the pre-base vowel. In the syllable kro on the left, the vowel sign surrounds the whole conjunct. In the middle we drop the ZWJ to give -k.ro, and now the pre-base glyph precedes the RA. The same should happen if the code points indicate a conjunct but the font doesn't have the necessary glyphs.

ක්‍රො  ක්රො  *කේරා
Placement of pre-base vowel glyphs.

Browser behaviour

Test in your browser. Left to right, the following words contain 2 conjunct sequences with virama+ZWJ, one that displays as a conjunct, another that doesn't, and two sequences with virama but no ZWJ. First, the text is displayed in a contenteditable paragraph, then in a textarea. Results are reported for Gecko (Firefox), Blink (Chrome), and WebKit (Safari) on a Mac.

ක්‍රො ක්‍සො ක්රො කන්ද

Cursor movement. Move the cursor through the text.
Gecko steps through the whole text using grapheme clusters. The cursor visually stops in the middle of the virama+ZWJ sequences. Blink steps through the virama+ZWJ sequences using grapheme clusters, however the cursor appears to skip to the end of the whole sequence and you have to hit the cursor key again (with no apparent movement) to actually clear it. Blink treats the sequences with just a virama as a single unit. WebKit skips all sequences with a virama (whether or not there is a ZWJ) as a single unit.

Selection. Place the cursor next to a character and hold down shift while pressing an arrow key.
The behaviour is the same as for cursor movement. This has the effect of sometimes appearing to highlight backwards in Blink.

Deletion. Forward deletion works in the same way as cursor movement. The backspace key deletes code point by code point, except that WebKit deletes both the virama and the ZWJ at the same time.

Line-break. See this test. The CSS sets the value of the line-break property to anywhere. Change the size of the box to slowly move the line break point.
Gecko wraps at grapheme cluster boundaries except that it wraps a sequence with virama+ZWJ as a single unit. Blink and WebKit wrap everything at grapheme cluster boundaries, which has the effect of breaking a conjunct in half at the end of a line.

Punctuation & inline features

Phrase & section boundaries

,␣:␣;␣.␣?␣!␣෴

Sinhala uses western punctuation.

phrase

,

;

:

sentence

.

?

!

The punctuation character once functioned to indicate the end of a paragraph, but is not used for modern Sinhala content.

Bracketed text

(␣)

Sinhala commonly uses ASCII parentheses to insert parenthetical information into text.

  start end
standard

(

)

Quotations & citations

‘␣’␣“␣”

Sinhala texts use quotation marks around quotations. Of course, due to keyboard design, quotations may also be surrounded by ASCII double and single quote marks.

  start end
initial

nested

Single quotation marks are used for quotations within quotations.

Line & paragraph layout

Line breaking & hyphenation

Sinhala is normally wrapped where spaces mark word boundaries.

Line-edge rules

As in almost all writing systems, certain punctuation characters should not appear at the end or the start of a line. The Unicode line-break properties help applications decide whether a character should appear at the start or end of a line.

Show (default) line-breaking properties for characters in the modern Sinhala orthography.

The following list gives examples of typical behaviours for characters used in modern Sinhala. Context may affect the behaviour of some of these and other characters.

Click on the Sinhala characters to show what they are.

  • “ ‘ (   should not be the last character on a line
  • ” ’ ) ? ! %   should not begin a new line

Baselines, line height, etc.

Sinhala uses the so-called 'alphabetic' baseline, which is the same as for Latin and many other scripts.

Diacritics appear above and below Sinhala letters, and consonant clusters are stacked. However, these remain reasonably close to the letters, and in fact, tall letters may be reshaped to avoid large extensions.

To give an approximate idea, fig_baselines compares Latin and Sinhala glyphs from Noto fonts. The basic height of Sinhala letters is typically around (just marginally higher than) the Latin x-height, however combining marks reach a little beyond the Latin ascenders, creating a need for slightly larger line spacing.

Xhqxන්දලලූයුඬිග්‍රීම𑇬𑇳 Xhqxන්දලලූයුඬිග්‍රීම𑇬𑇳
Font metrics for Latin text compared with Sinhala glyphs in the Noto Serif Thai (top) and Noto Sans Thai (bottom) fonts.

fig_baselines_other shows similar comparisons for the Iskoola Pota and Sinhala MN fonts.

Xhqxන්දලලූයුඬිග්‍රීම𑇬𑇳 Xhqxන්දලලූයුඬිග්‍රීම𑇬𑇳
Latin font metrics compared with Sinhala glyphs in the Iskoola Pota (top) and Sinhala MN (bottom) fonts.

Page & book layout

Input

The Sinhala keyboards has deadkeys which change the assignments of keys around them when pressed. For example, pressing the key for e will change several keys to letters that start with the same symbol.

Sinhala keyboard in default state.

Sinhala keyboard after the key for e is pressed.

Note also, in the bottom left corner, that the keyboard has a key for the combination of 0DCA 200D 0DBB, ie. the conjoined -r. The shifted layout has a similar key for -y.

There is a rephaya key (for the sequence 0DBB 0DCA 200D), but it is typed after the consonant that normally follows it in memory. The input method then has to rearrange the codepoints in canonical order.

Effectively, you type characters or parts of multipart characters in visual order, and the system then has to rearrange things to produce the expected codepoint order.

References