These are notes culled from various places. There may well be
some copy-pasting involved, but I did it long enough ago that I no
longer remember all the sources. But these are notes, it’s not an
article.
Case conversions are not always possible in Unicode by applying an
offset to a codepoint, although this can work for the ASCII range by
adding 32, or by adding 1 for many other characters in the Latin
extensions. There are many cases where the corresponding cased character
is in another block, or in an irregularly offset location.
In addition, there are linguistic issues that mean that simple
mappings of one character to another are not sufficient for case
conversion.
In German, the uppercase of ß is SS. German and Greek cannot,
however, be easily transformed from upper to lower case: German because
SS could be converted either to ß or ss, depending on the word; Greek
because all tonos marks are omitted in upper case, eg. does ΑΘΗΝΑ
convert to Αθηνά (the goddess) or Αθήνα (capital of Greece)? German may
also uppercase ß to ẞ sometimes for things like signboards.
Also Greek converts uppercase sigma to either a final or non-final
form, depending on the position in a word, eg. ΟΔΥΣΣΕΥΣ becomes
οδυσσευς. This contextual difference is easy to manage, however,
compared to the lexical issues in the previous paragraph.
In Serbo-Croatian there is an important distinction between uppercase
and titlecase. The single letter dž converts to DŽ when the whole word is
uppercased, but Dž when titlecased. Both of these forms revert to dž in
lowercase, so there is no ambiguity here.
In Dutch, the titlecase of ijsvogel is IJsvogel, ie. which commonly
means that the first two letters have to be titlecased. There is a
single character IJ (U+0132 LATIN CAPITAL LIGATURE IJ)
in Unicode that will behave as expected, but this single character is
very often not available on a keyboard, and so the word is commonly
written with the two letters I+J.
In Greek, tonos diacritics are dropped during uppercasing, but not
dialytika. Greek diphthongs with tonos over the first vowel are
converted during uppercasing to no tonos but a dialytika over the
second vowel in the diphthong, eg. Νεράιδα becomes ΝΕΡΑΪΔΑ. A letter
with both tonos and dialytika above drops the tonos but keeps the
dialytika, eg. ευφυΐα becomes ΕΥΦΥΪΑ. Also, contrary to the initial rule
mentioned here, Greek does not drop the tonos on the disjunctive eta
(usually meaning ‘or’), eg. ήσουν ή εγώ ή εσύ becomes ΗΣΟΥΝ Ή ΕΓΩ Ή ΕΣΥ
(note that the initial eta is not disjunctive, and so does drop the
tonos). This is to maintain the distinction between ‘either/or’ ή from
the η feminine form of the article, in the nominative case, singular
number.
Greek titlecased vowels, ie. a vowel at the start of a word that is uppercased, retains its tonos accent, eg. Όμηρος.
Turkish, Azeri, Tatar and Bashkir pair dotted and undotted i’s, which
requires special handling for case conversion, that is
language-specific. For example, the name of the second largest city in
Turkey is “Diyarbakır”, which contains both the dotted and dotless
letters i. When rendered into upper case, this word appears like this:
DİYARBAKIR.
Lithuanian also has language-specific rules that retain the dot over i
when combined with accents, eg. i̇̀ i̇́ i̇̃, whereas the capital I has
no dot.
Sometimes European French omits accents from uppercase letters,
whereas French Canadian typically does not. However, this is more of a
stylistic than a linguistic rule. Sometimes French people uppercase œ to
OE, but this is mostly due to issues with lack of keyboard support, it
seems (as is the issue with French accents).
Capitalisation may ignore leading symbols and punctuation for a word,
and titlecase the first casing letter. This applies not only to
non-letters. A letter such as the (non-casing version of the) glottal
stop, ʔ, may be ignored at the start of a word, and the following letter
titlecased, in IPA or Americanist phonetic transcriptions. (Note that,
to avoid confusion, there are separate case paired characters available
for use in orthographies such as Chipewyan, Dogrib and Slavey. These are
Ɂ and ɂ.)
Another issue for titlecasing is that not all words in a sequence are
necessarily titlecased. German uses capital letters to start noun
words, but not verbs or adjectives. French and Italian may expect to
titlecase the ‘A’ in “L’Action”, since that is the start of a word. In
English, it is common not to titlecase words like ‘for’, ‘of’, ‘the’ and
so forth in titles.
Unicode provides only algorithms for generic case conversion and case
folding. CLDR provides some more detail, though it is hard to
programmatically achieve all the requirements for case conversion.
Case folding is a way of converting to a standard sequence of
(lowercase) characters that can be used for comparisons of strings.
(Note that this sequence may not represent normal lowercase text: for
example, both the uppercase Greek sigma and lowercase final sigma are
converted to a normal sigma, and the German ß is converted to ‘ss’.)
There are also different flavours of case folding available: common,
full, and simple.
November 7th, 2016 at 4:43 am
Isn’t Adlam bicameral?
November 7th, 2016 at 9:30 am
Yes, indeed. Hmmm. I used http://unicode.org/cldr/utility/list-unicodeset.jsp?a=%5B:General_Category=Uppercase_Letter:%5D to find the cased letters, but it seems that that CLDR table doesn’t know about Adlam or Osage (which is also bicameral) :(. I searched using UniView instead, and added Adlam and Osage to the list, plus a note about Mathematical Alphanumeric Symbols. Thanks, Amir.