Sorani (draft)

Arabic script orthography notes

Updated 13 April, 2024

This page brings together basic information about the Arabic script and its use for the Sorani (Central Kurdish) language. It aims to provide a brief, descriptive summary of the modern, printed orthography and typographic features, and to advise how to write Sorani using Unicode.

Referencing this document

Richard Ishida, Sorani (Arabic script) Orthography Notes, 13-Apr-2024, https://r12a.github.io/scripts/arab/ckb

Sample

Select part of this sample text to show a list of characters, with links to more details.
Change size:   28px

هەموو مرۆڤ ئازاد و دوەقار و مافان دە وەکهەڤ تێن دنیایێ. ئەو خوەدی هش و شوئوورن و دڤێ لهەمبەر هەڤ بزهنیەتەکە براتیێ بلڤن.

Source: Unicode UDHR, article 1, from Omniglot.

Usage & history

Origins of the Arabic script, 6thC – today.

Phoenician

└ Aramaic

└ Nabataean

└ Arabic

The use of the Sorani (or Central Kurdish) language during the 20th century alternated between periods when it was encouraged and others when it was suppressed. The language currently has a range of uses, principally in Iraq and Iran, that include education, government, and online. According to the Ethonologue the language is used by just over 5 million people, the majority of which are in Iraq.

زمانێ سۆرانی

Although the language had been written using Arabic-script letters from some centuries beforehand, the Sorani orthography described here first began to take its current form after reforms introduced by Taufiq Wahby in the 1920s. This occurred while the British were encouraging the use of the language.

The following map of Kurdish dialects was created for Wikipedia. The Wikipedia article on Sorani contains a useful additional details about the use of Sorani since the 1700s.

Map of Kurdish language use.
Map of Kurdish language use.

Basic features

The Sorani Arabic orthography is derived from the Arabic/Persian abjads, where in normal use the script represents only consonant and long vowel sounds. However, the script has been adapted in this orthography to use letters for vowel sounds, making it an alphabet. See the table to the right for a brief overview of features for the modern Sorani orthography using the Arabic script.

Sorani text runs right-to-left in horizontal lines, but numbers and embedded Latin text are read left-to-right.

The writing is cursive (ie. letters are joined), and some basic letter shapes change significantly, depending on what they join to. The baseline is the same as for Latin text. There is no case distinction. Words are separated by spaces.

A mandatory ligature is used for combinations of lam + alif.

❯ consonantSummary

Standard Sorani represents consonant sounds using 28 consonant letters. (6 other characters have been noted for dialectal orthographies but are poorly attested.) Unusually, the basic consonant letters distinguish between r and ɾ, and between l and ɫ.

Sorani doesn't use any special features (such as sukun or shadda) to indicate consonant clusters or gemination.

❯ basicV

Sorani is an alphabet where vowels are written using letters; there are no combining marks. However, it is not completely alphabetic because the sound ɪ is unwritten (like medial ə in Armenian). Sorani uses 4 dedicated vowel letters and 2 consonants to write the other 7 vowel sounds (plus some contextual variants of æ). The long vowel is represented by a doubled WAW digraph (وو).

Word-initial standalone vowels are preceded by 0626.

The orthography has no special feature to indicate an absence of a vowel following a consonant.

Sorani uses both ASCII and native digits, and a mixture of ASCII and Arabic code points for common punctuation marks.

Joining forms

Because the Arabic script is 'cursive' (ie. joined-up) writing, letters tend to have different shapes depending on whether they join with adjacent letters or not (see cursive). In addition, vowels can be represented using different characters, depending on where in a word they appear.

In scripts such as Arabic, several characters have no left-joining form. In what follows we'll use the characters ي and د to illustrate shapes. The former can join on both sides, but the latter can only join on the right.

Left-joining glyphs are commonly called initial; dual-joining are called medial; and right-joining are called final. Glyphs that don't join on either side are called isolated. However, these glyph shapes can be found in various places within a single word.

Word-initial characters usually have initial glyph shapes (eg. 064A ). However, characters that only join to the right will use an isolated glyph shape (eg. 062F ). Furthermore, words beginning with a vowel are always preceded by a vowel carrier, which is normally ا (eg. 0627 06CC or 0627 064E ).

Word-medial characters will typically join on both sides (eg. 064A ) but those that only join to the right will use a final glyph (eg. 062F ). However, if either of those is preceded by another character that only joins to the right, the glyph shapes rendered will be initial (eg. 064A ) and isolated (eg. 062F ), respectively.

Word-final characters will typically use a final glyph shape (eg. 064A and 062F ). However, if the previous character joins only to the right, they will use isolated glyph shapes (eg.064A and 062F ).

In all this contextual glyph shaping the basic shapes used for a character can vary significantly in a script like Arabic. This also includes some characters that only have ijam dots in certain contexts.

Character index

Letters

Show

Basic consonants

ئ␣ب␣ت␣ج␣ح␣خ␣د␣ر␣ز␣س␣ش␣ع␣غ␣ف␣ق␣ل␣م␣ن␣ه␣و␣پ␣چ␣ڕ␣ژ␣ڤ␣ک␣گ␣ڵ␣ھ␣ی

Vowels

ا␣ۆ␣ۊ␣ێ␣ە

Other

ك␣ڶ␣ڷ␣ڒ␣ڔ␣ۊ␣ي

Combining marks

Show

Decomposed text

ٔ

Not used for Sorani

ّ␣ْ

Numbers

Show
۱␣۲␣۳␣۴␣۵␣۶␣۷␣۸␣۹␣۰

Punctuation

Show
،␣؛␣؟␣“␣”␣…

ASCII

!␣(␣)␣.␣:␣[␣]

Other

Show
‌␣‍␣⁧␣‫␣⁦␣‪␣⁨␣⁩␣‬␣‏␣‎

To be investigated

%␣-␣«␣»␣‑␣–␣—␣‘␣’␣“␣‰␣‹␣›
Items to show in lists

Phonology

The following represents the general repertoire of the Sorani language.

Click on the sounds to reveal locations in this document where they are mentioned.

Phones in a lighter colour are non-native or allophones.

Vowel sounds

Plain vowels

ɪ ə ə ɛ æ ɑː ɑː

Consonant sounds

labial alveolar post-
alveolar
palatal velar uvular pharyngeal glottal
stop p b t d     k ɡ q   ʔ
affricate     t͡ʃ d͡ʒ          
fricative f v s z ʃ ʒ   x ɣ χ ħ ʕ h
nasal m n     ŋ    
approximant w l ɫ   j      
trill/flap   r ɾ    

The velar consonants tend to be palatalised before i and e.ua

The sounds ħ and ʕ are the result of Arabic influence on pronunciation, and are most often heard outside of Iran. They are not necessarily used for loan words. The word حەوت is commonly pronounced using ħ.

Tone

Sorani is not a tonal language.

Structure

tbd

Vowels

Vowel summary table

The following table summarises the main vowel to character assigments.

ⓘ represents the unwritten vowel. Each table cell shows word-initial, word-medial, and word-final forms from right to left. The glyphs shown are illustrative; alternative shapes may occur (see joining_forms).

Simple:
‍ی␣‍ی‍␣ئی‍
‍وو␣‍وو␣ئوو
‍و␣‍و␣ئو
‍ێ␣‍ێ‍␣ئێ‍
‍ۆ␣‍ۆ␣ئۆ
‍ە␣‍ە␣ئە
 
‍ە␣‍ە␣ئە
‍ا␣‍ا␣ئا‍

For additional details see vowel_mappings.

Here is the full set of characters described in this section.

ئ␣ا␣و␣ۆ␣ۊ␣ی␣ێ␣ە

Unwritten vowel

The vowel sound ə~ɪ is not written in Sorani text. This is similar to the way Armenian doesn't write the sound ə, and means that Sorani isn't a perfect alphabet.

مرۆڤ

سفر

وشک

مامر

Vowel letters

All written Sorani vowels use letters. Those letters do not decompose, so there are no combining marks involved.

Sorani uses 4 dedicated vowel letters and 2 consonants to write the 7 post-consonant vowel sounds (plus some contextual variants of æ). The long vowel is represented by a doubled WAW digraph (وو).

Plain vowels

This panel shows the characters used for monophthongs in Sorani. The shape of the letter changes according to position (see basicV above), but the vowel is always written with the same, single character.

ی␣وو␣و␣ێ␣ۆ␣ە␣ا

06D5 is shown here to represent the short æ vowel. In some resources 0647 200C is used, instead. See unresolved_encoding for more information.

06D5 is generally pronounced æ, but before the codas -w and -j it becomes ə. Before a j- onset it becomes ɛ.

ی and و are consonants that are also used to indicate vowels.

Non-standard letter

The Unicode Standard mentions the letter below as representing a dialectal or other poorly attested alternative form of the Soraní alphabetu15,397.

ۊ

Diphthongs/semi-vowels

The main vowel in a syllable can be followed by -j or -w. These semivowels are written using the consonant letters ی and و, respectively.

چووی

بنەوشە

As already mentioned, ە, which normally represents the sound æ, is pronounced ə before these glides.

قەیچی

شەو

These glides can also appear before a main vowel, but for that see onsets.

Vowel length

Vowel length is indicated by the choice of vowel letter.

Standalone vowels

ئ

Word-initial standalone vowels are preceded by ئ.

ئافرەت

ئەمڕۆ

ئێرانی

Vowel absence

Sorani doesn't use 0651 to indicate vowel absence. A little care needs to be exercised, however, when reading, since Sorani has an unwritten vowel ɪ, which may occur, unmarked, between consonants. For example, compare:

ئەمڕۆ

مرۆڤ

Vowel sounds to characters

This section maps Sorani vowel sounds to common graphemes in the Arabic orthography.

The left column shows dependent vowels, and the right column independent vowel letters.

Click on a grapheme to find other mentions on this page (links appear at the bottom of the page). Click on the character name to see examples and for detailed descriptions of the character(s) shown.

Plain vowels

 

06CC

بیست

06CC06CC06CC

 

0648 0648

بووڵ

0648 06480648 0648

ɪ
 

Not written

سفر

 
ʊ
 

0648

کورد

06480648

 

06CE

شێر

06CE06CE06CE

 

06C6

بۆن

06C606C6

ɛ
 

06D5 esp. before a j- onset.

06D506D5

ə
 

06D5 esp. before a -w or -j coda.

شەو

06D506D5

æ
 

06D5

ئەمڕۆ

06D506D5

ɑː
 

0627

شەو

06270627

Consonants

Consonant summary table

The following table summarises the main consonant to character assigments.

Stops
پ␣ب␣ت␣د␣ک␣گ␣ق
Affricates
چ␣ج
Fricatives
ف␣ڤ␣س␣ز␣ش␣ژ␣خ␣غ␣ح␣ع␣ه␣ھ
Nasals
م␣ن
Approximants
trills & flaps
و␣ر␣ڕ␣ل␣ڵ␣ی

For additional details see consonant_mappings.

Basic consonants

Whereas the table just above takes you from sounds to letters, the following simply lists the basic consonant letters (however, since the orthography is highly phonetic there is little difference in ordering).

پ␣ب␣ت␣د␣ک␣گ␣ق␣چ␣ج␣ف␣ڤ␣س␣ز␣ش␣ژ␣خ␣غ␣ح␣ع␣ه␣ھ␣م␣ن␣ل␣ڵ␣ی␣ڕ␣ر␣و

The list above includes 2 letters for the sound h because some texts use one, whereas some use the other. For more information, see unresolved_encoding.

Non-standard letters

The letters in the list below represent dialectal or other poorly attested alternative forms of the Soraní alphabet extensionsu15,397.

نٚ␣ڒ␣ڔ␣ڶ␣ڷ␣ك

See also encoding for alternative code points used in some texts.

Onsets

Sorani syllables may begin with a consonant followed by a j or w glide. These are written using the ordinary consonants ی and و.

پیاو

ژوان

Consonant clusters

Sorani doesn't use any special features to deal with consonant clusters, or syllable-final consonants. There are no conjuncts, and 0651 is not used.

ئاشتی

زانستگە

Consonant sounds to characters

This section maps Sorani consonant sounds to common graphemes in the Arabic orthography.

The right-hand column shows the various joining forms for each letter.

Click on a grapheme to find other mentions on this page (links appear at the bottom of the page). Click on the character name to see examples and for detailed descriptions of the character(s) shown. Sounds listed as 'infrequent' are allophones, or sounds used for foreign words, etc.

p
 

067E

پێنج

067E067E067E

b
 

0628

بادەم

062806280628

t
 

062A

توتن

062A062A062A

d
 

062F

دیوار

062F062F

k
 

06A9

کەر

06A906A906A9

ɡ
 

06AF

گران

06AF06AF06AF

q
 

0642

قاز

064206420642

ʔ
 

0626 before standalone vowels.

ئەرمەنی

062606260626

t͡ʃ
 

0686

چاو

068606860686

d͡ʒ
 

062C

جووتیار

062C062C062C

f
 

0641

فیلم

064106410641

v
 

06A4

مرۆڤ

06A406A406A4

s
 

0633

سوور

063306330633

z
 

0632

زستان

06320632

ʃ
 

0634

شەش

063406340634

ʒ
 

0698

ژان

06980698

x
 

062E

خۆر

062E062E062E

ɣ
 

063A

کاغەز

063A063A063A

ħ
 

062D

حەوت

062D062D062D

ʕ
 

0639

عێراق

063906390639

h
 

Only one of these 2 characters is used for this sound, but there are no clear rules at present as to which, so either can be found in online texts. The KRG expects the second to be used.kk

06BE

ھیوا

06BE06BE06BE

 
 

0647

هەز

064706470647

m
 

0645

مازوو

064506450645

n
 

0646

نەچە

064606460646

ŋ
 

0646 before a velar consonant.

مانگ

064606460646

w
 

0648

وشک

06480648

ɾ
 

0631

زەرد

06310631

r
 

0695

ڕەقە

06950695

l
 

0644

لەبن

064406440644

ɫ
 

06B5

چۆڵ

06B506B506B5

j
 

06CC

یانزە

06CC06CC06CC

Other features

Ligatures

The combination لا is always written as a ligature. The underlying code points are, however, preserved. The shape varies slightly, depending on whether the ligature joins to the right or not. Compare:

لادێ

سڵاو

Formatting characters

Arabic script text makes use of a relatively large set of invisible formatting characters, especially in plain text, many of which are used to manage text direction. Descriptions of these characters can be found in the following sections:

Presentation Forms

The code points in the Unicode blocks Arabic Presentation Forms-A and Arabic Presentation Forms-B provide positional forms of Arabic letters and ligatures. They should not be used for ordinary text. Those code points are provided for compatibility with legacy code pages, and have (compatibility) character decomposition mappings. Normally, Arabic text should be written with code points from the main Arabic block and its extensions; positional forms are dealt with by the font and rendering algorithms.

For more information see the Arabic orthography notes.

Encoding choices

This section offers advice about characters or character sequences to avoid, and what to use instead. It takes into account the relevance of Unicode Normalisation Form D (NFD) and Unicode Normalisation Form C (NFC)..

Although usage is recommended here, content authors may well be unaware of such recommendations. Therefore, applications should look out for the non-recommended approach and treat it the same as the recommended approach wherever possible.

Canonically equivalent encodings

One letter only can be represented as an atomic character (the norm), or as a sequence of base letter plus combining mark. The parts are separated in Unicode Normalisation Form D (NFD), and recomposed in Unicode Normalisation Form C (NFC), so both approaches should be treated as canonically equivalent.

Atomic (recommended) Decomposed ( NOT recommended )
ێ 064A 0654

Note that the base character in the decomposed sequence is 064A, and not 06CC, which is used elsewhere for yeh in Kurmanji. 064A is only used for this specific decomposed sequence; it is inappropriate to use it elsewhere in Kurmanji text.

Unresolved encodings

A couple of Sorani letters are encoded in different ways in different texts, and in fact can be mixed within the same text. At the moment there doesn't appear to be a clear ruling on which is expected. This section lists the alternatives.

Alternative 1 Alternative 2 Notes
06BE 0647

Wikipedia and Wiktionary represent the sound h using only HEH DOACHASHMEE, whereas gov.krd uses only ARABIC HEH. Other online resources examined are not completely one or the other. Note that Uighur uses HEH DOACHASHMEE to represent h.

The Kurdish Regional Government pages standardising keyboard layout expect the use of 0647kk.

06D5 0647 200C

The majority usage seems to favour AE, although again some resources mix both to some degree (although they tend to mostly use AE). This makes sense, since the use of HEH plus ZWNJ has the appearance of a hack intended to prevent HEH joining to the left, whereas AE will do this naturally, without any formatting code point. Use of AE also gets around practical difficulties that arise because the ZWNJ character is invisible and is not readily accessible from many keyboards – difficulties that are amplified by the fact that this vowel letter is one of the most commonly used letters in the Sorani alphabet. Note that Uighur also uses AE to represent a vowel, and a different character for the sound h.

The Kurdish Regional Government pages standardising keyboard layout distinguish between bothkk.

Confusables & spelling errors

This table lists characters that are often mistakenly used because they look the same as or similar to the code points used for Kurmanji, or perhaps because the correct character is not available on the user's keyboard.

Incorrect Correct Notes
064A 06CC The Arabic YEH doesn't drop the dots below in isolate and final positions.
0643 06A9 Common fonts tend not to show the difference between these two characters, but the ability to search and compare text is impaired unless the application is aware of and takes counter-measures against this substitution.

False friends

The following atomic characters look as if they could be composed of parts, but in fact there is no equivalence during normalisation, and so the atomic characters only should be used.

Atomic Sequence ( DO NOT use! )
ڵ 0644 065A
ێ 06CC 065A
ۆ 0648 065A

Codepoint sequences

Combining marks always follow the based character, however for Sorani, which doesn't normally use combining marks, this is only relevant for ێ when it occurs in decomposed text.

Numbers

Digits

Sorani uses the set of native digits from the extended Arabic range, but most modern texts appear to use ASCII digits much of the time.

۱␣۲␣۳␣۴␣۵␣۶␣۷␣۸␣۹␣۰

Text direction

Arabic script text is written horizontally and right-to-left in the main but, as in most right-to-left scripts, numbers and embedded text in other scripts are written left-to-right (producing 'bidirectional' text).

العاشر ليونيكود (Unicode Conference)،الذي سيعقد في 10-12 آذار 1997 مبدينة
Arabic words are read right-to-left, starting from the right of this line, but numbers and Latin text (highlighted) are read left-to-right.

The Unicode Bidirectional Algorithm automatically takes care of the ordering for all the text in fig_bidi, as long as the 'base direction' is set to RTL. In HTML this can be set using the dir attribute, or in plain text using formatting controls.

If the base direction is not set appropriately, the directional runs will be ordered incorrectly as shown in fig_bidi_no_base_direction, making it very difficult to get the meaning.

في XHMTL 1.0 يتم تحقيق ذلك بإضافة العنصر المضمن bdo.
في XHMTL 1.0 يتم تحقيق ذلك بإضافة العنصر المضمن bdo.
The exact same sequence of characters with the base direction set to RTL (top), and with no base direction set on this LTR page (bottom). Certain items are highlighted to help track their position.

Show default bidi_class properties for characters in the Sorani language.

For other aspects of dealing with right-to-left writing systems see the following sections:

For more information about how directionality and base direction work, see Unicode Bidirectional Algorithm basics. For information about plain text formatting characters see How to use Unicode controls for bidi text. And for working with markup in HTML, see Creating HTML Pages in Arabic, Hebrew and Other Right-to-left Scripts.

For authoring HTML pages, one of the most important things to remember is to use <html dir="rtl" … > at the top of the page. Also, use markup to manage direction, and do not use CSS styling.

Managing text direction

Unicode provides a set of 10 formatting characters that can be used to control the direction of text when displayed. These characters have no visual form in the rendered text, however text editing applications may have a way to show their location.

202B (RLE), 202A (LRE), and 202C (PDF) are in widespread use to set the base direction of a range of characters. RLE/LRE comes at the start, and PDF at the end of a range of characters for which the base direction is to be set.

In Unicode 6.1, the Unicode Standard added a set of characters which do the same thing but also isolate the content from surrounding characters, in order to avoid spillover effects. They are 2067 (RLI), 2066 (LRI), and 2066 (PDI). The Unicode Standard recommends that these be used instead.

There is also 2068 (FSI), used initially to set the base direction according to the first recognised strongly-directional character.

061C (ALM) is used to produce correct sequencing of numeric data. Follow the link and see expressions for details.

200F (RLM) and 200E (LRM) are invisible characters with strong directional properties that are also sometimes used to produce the correct ordering of text.

For more information about how to use these formatting characters see How to use Unicode controls for bidi text. Note, however, that when writing HTML you should generally use markup rather than these control codes. For information about that, see Creating HTML Pages in Arabic, Hebrew and Other Right-to-left Scripts.

Glyph shaping & positioning

Experiment with examples using the Sorani character app.

Cursive script

Arabic script is always cursive, ie. letters in a word are joined up. Fonts need to produce the appropriate joining form for a letter, according to its visual context, but the code point used doesn't change. This results in four different shapes for most letters (including an isolated shape). Ligated forms also join with characters alongside them.

The highlights in the example below show the same letter, ع, with three different joining forms.

على • متعددة • وسيجمع

The letter ع (ain) in 3 different joining contexts.

Most Arabic script letters join on both sides. A few only join on the right-hand side: this involves 5 underlying shapes for the Sorani orthography.

Cursive joining forms

Most dual-joining characters add or become a swash when they don't join to the left. A number of characters, however, undergo additional shape changes across the joining forms. fig_joining_forms and fig_right_joining_forms show the basic shapes in the Sorani orthography and what their joining forms look like. Significant variations are highlighted.

isolatedright-joineddual-joinleft-joined Sorani letters
ب ـب ـبـ بـ
ب␣ت␣پ
ن ـن ـنـ نـ
ن
ق ـق ـقـ قـ
ق
ف ـف ـفـ فـ
ف␣ڤ
س ـس ـسـ سـ
س␣ش
ک ـک ـکـ کـ
ک␣گ
ل ـل ـلـ لـ
ل␣ڵ␣ڶ␣ڷ
ه ـه ـهـ هـ
ه␣ھ
م ـم ـمـ مـ
م
ع ـع ـعـ عـ
ع␣غ
ح ـح ـحـ حـ
ج␣ح␣خ␣چ
ي ـي ـيـ يـ
ی␣ێ␣ئ
Joining forms for shapes that join on both sides..
isolatedright-joined Sorani letters
ا ـا
ا
ر ـر
ڒ␣ڔ␣ڕ␣ژ␣ر␣ز
د ـد
د
و ـو
و␣ۆ␣ۊ
ە ـە
ە
Joining forms for shapes that join on the right only.

Managing glyph shaping

200D (ZWJ) and 200C (ZWNJ) are used to control the joining behaviour of cursive glyphs. They are particularly useful in educational contexts, but also have real world applications.

ZWJ permits a letter to form a cursive connection without a visible neighbour. For example, the marker for hijri dates is an initial form of heh, even though it doesn't join to the left, ie. ه‍. For this, use ZWJ immediately after the heh, eg. الاثنين 10 رجب 1415 ه‍..

ZWNJ prevents two adjacent letters forming a cursive connection with each other when rendered. For example, it is used in Persian for plural suffixes, some proper names, and Ottoman Turkish vowels. Ignoring or removing the ZWNJ will result in text with a different meaning or meaningless text, eg, تن‌ها is the plural of body, whereas تنها is the adjective alone.2 The only difference is the presence or absence of ZWNJ after noon.

People who don't use 06D5 to represent the sound æ, but use 0647 instead, need to follow the latter with a ZWNJ character to prevent it joining with any following letter. For more information see unresolved_encoding.

Context-based shaping & positioning

See just above for shaping related to cursive joining.

See also the section on glyph shaping in the Arabic orthography notes.

Typographic units

Word boundaries

Words are separated by spaces.

Graphemes

Since there are no combining marks or decompositions in normal Kurmanji text, grapheme clusters correspond to individual characters. Where combining marks appear in decomposed text, the combination of base and combining mark still fits within the definition of a grapheme cluster.

Punctuation & inline features

Phrase & section boundaries

tbd

،␣؛␣؟␣!␣.␣:

Sorani uses a mixture of ASCII and Arabic punctuation.

phrase

،

؛

:

sentence

.

؟

!

Bracketed text

(␣)

Sorani commonly uses ASCII parentheses to insert parenthetical information into text.

  start end
standard

(

)

Mirrored characters

The words 'left' and 'right' in the Unicode names for parentheses, brackets, and other paired characters should be ignored. LEFT should be read as if it said START, and RIGHT as END. The direction in which the glyphs point will be automatically determined according to the base direction of the text.

a > b > c
ا > ب > ج
Both of these lines use > U+003E GREATER-THAN SIGN, but the direction it faces depends on the base direction at the point of display.

The number of characters that are mirrored in this way is around 550, most of which are mathematical symbols. Some are single characters, rather than pairs. The following are some of the more common ones for Arabic.

(␣)␣<␣>␣[␣]␣{␣}␣«␣»␣‹␣›

Quotations & citations

“␣”␣‘␣’

Sorani texts may use quotation marks around quotations. Of course, due to keyboard design, quotations may also be surrounded by ASCII double and single quote marks.

  start end
initial

nested

Unlike the brackets, these characters are not mirrored during display. This means that LEFT means use on the left, and RIGHT means use on the right.

Line & paragraph layout

Line breaking & hyphenation

tbd

The primary line break opportunity is the space between words.

In-word line-breaking

tbd

Line-edge rules

As in almost all writing systems, certain punctuation characters should not appear at the end or the start of a line. The Unicode line-break properties help applications decide whether a character should appear at the start or end of a line.

Show line-breaking properties for characters in the modern Sorani orthography.

The following list gives examples of typical behaviours for some of the characters used in modern Bangla. Context may affect the behaviour of some of these and other characters.

Click/tap on the characters to show what they are.

  • “ ‘ (   should not be the last character on a line.
  • ” ’ ) . , ; ! ? । ॥ %   should not begin a new line.

Line breaking should not move a danda or double danda to the beginning of a new line even if they are preceded by a space character.

Breaking between Latin words

When a line break occurs in the middle of an embedded left-to-right sequence, the items in that sequence need to be rearranged visually so that it isn't necessary to read lines from top to bottom.

latin-line-breaks shows how two Latin words are apparently reordered in the flow of text to accommodate this rule. (The text is in Arabic.) Of course, the rearragement is only that of the visual glyphs: nothing affects the order of the characters in memory.

Text with no line break in Latin text.

Text with line break in Latin text.

The lower of these two images shows the result of decreasing the line width, so that text wraps between a sequence of Latin words embedded in Arabic script text.

Baselines, line height, etc.

tbd

Sorani uses the 'alphabetic' baseline.

Page & book layout

Online resources

  1. Gov.kurd, Kurdistan Regional Government
  2. Gav News
  3. Xebat News (Iraq)
  4. Kurdish Wikipedia
  5. Bible.is

References