Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Ive long thought programming languages need a "localizable string" (Aka user-facing string) type, different from regular utf8 strings. Something like what gettext and other i18n libraries fake for you, but native to the language.

Behaviour like this is definitely a good reason why: sorting, changing case, etc should be consistent when dealing with strings used as constants and identifiers, but Python's .lower() behaviour makes sense in a localizable string context.



Along similar lines, I've thought that it would be useful if Unicode included language marks (i.e. codepoints to identify blocks of text as being written in a specific language). It would be strictly more useful than the barebones left-to-right/right-to-left marks (U+200E/U+200F) when deciding how to process and display text. And it would be a step towards correcting the mess that was Han unification.


See RFC 2482 — Language Tagging in Unicode Plain Text:

https://tools.ietf.org/html/rfc2482

But it was deprecated later on:

https://tools.ietf.org/html/rfc6082


Interesting. Unfortunate that the deprecation notice doesn't include much rationale. I found at least one mail thread about it[1], which seems to confirm that the main thought was that semantic information about text should be handled at a higher layer (e.g. XML). I can understand that argument for a general purpose tagging mechanism, but language and glyphs are strongly semantically linked.

(Somewhat ironically, the previous thread on that mailing list is about the struggles of case folding in a general fashion across multiple language scripts[2])

Edit: I also found [3], which offers the following:

----

- Most of the data sources used to assemble the documents on the Web will not contain these characters; producers, in the process of assembling or serializing the data, will need to introspect and insert the characters as needed—changing the data from the original source. Consumers must then deserialize and introspect the information using an identical agreement. The consumer has no way of knowing if the characters found in the data were inserted by the producer (and should be removed) or if the characters were part of the source data. Overzealous producers might introduce additional and unnecessary characters, for example adding an additional layer of bidi control codes to a string that would not otherwise require it. Equally, an overzealous consumer might remove characters that are needed by or intended for downstream processes.

- Another challenge is that many applications that use these data formats have limitations on content, such as length limits or character set restrictions. Inserting additional characters into the data may violate these externally applied requirements, and interfere with processing. In the worst case, portions (or all of) the data value itself might be rejected, corrupted, or lost as a result.

- Inserting additional characters changes the identity of the string. This may have important consequences in certain contexts.

- Inserting and removing characters from the string is not a common operation for most data serialization libraries. Any processing that adds language or direction controls would need to introspect the string to see if these are already present or might need to do other processing to insert or modify the contents of the string as part of serializing the data.

----

Other than #3 (the one about string identity), I find these wholly unpersuasive. And even #3 isn't even that great a reason considering that programmatic processors have to deal with that issue anyway due to case folding.

[1] https://www.unicode.org/mail-arch/unicode-ml/y2010-m11/0039....

[2] https://www.unicode.org/mail-arch/unicode-ml/y2010-m11/0038....

[3] https://www.w3.org/TR/string-meta/


What this gets right down to is that Unicode is a flawed idea: the meaning/behavior/whatever of characters is insanely dependent on their context.

The problem was never gazillions of code pages, but our inability to write C to deal with that amount of complexity circa 1990.

With modern machines, and good programming languages with good type systems, I absolutely think we could store a language per string, and concatenate into a polylinguistic rope if needed.

This would hopefully push us away from stringly-typed crap in general.


Unicode goes to great pains to avoid ascribing any meaning/behavior/whatever to character sets. Because to your point you can’t. Unicode is actually incredibly well thought out. That’s why we have values, code points and grapheme clusters. I don’t think the Unicode standard even defines casing except in the human-readable names ascribed to code points.

If you want to build a polylinguistic rope you can certainly do that with Unicode, but you won’t have solved anything because language alone without context doesn’t really define many of the operations you’re describing.

The answer is usually the same as “doctor it hurts when I...” — stop doing it. Stop manipulating user input without context. Stop trying to limit user visible strings by character count, use pixel width in the rendered font. And so on.


> I don’t think the Unicode standard even defines casing except in the human-readable names ascribed to code points

Sure it does; the Unicode Character Database includes fields for the lowercase, uppercase and titlecase mappings. But it also acknowledges that these are just default mappings, and may need to be tailored for specific languages/locales.


Unicode is well thought out! And that's what makes it hard to critique :). I think it's one of the best-maintained, well-thought out standards there is, but I still think the premise is wrong.

If all that good effort went into something along the lines I am describing, where languages, or at least scripts, cannot be arbitrary mixed at the character level, I think we would have an even better result with the same level of effort.


If you treat Unicode as your backing representation -- a pile of glyphs -- you can build what you're asking on top right?


That is like saying I can take an untyped language and then add types. Sure you can! That said, it's much nicer (to me) to first define the typing rules (static semantics) and then define evaluation (dynamic semantics) only on well-typed programs. This avoids the need to include lots of annoying stuff in the domain. See my other comment, https://news.ycombinator.com/item?id=24180620, for an example of something I rather leave ill-typed.

That said, any "multicode" better describe the interopt with "unicode" in great detail for practical reasons. Still this is the "FFI", and one can be careful to let it not muddle things by e.g. not allowing every unicode string to be imported without additional metadata.


I'm suggesting it's more like layering a programming language on top of assembly. The lower level is the universe of what you can do (in this case, the set of all glyphs) and the higher level is an imposition of specific constraints (in your case, which ones go together).


Languages need not by defined by how they compile. (If they do, we tend to call it "desugaring".) At the very least, they usually compile to multiple ISAs, and none is more definitive than the other.

I am happy to define how to translate Multicode to Unicode, but I wouldn't want any of the internal notions of Multicode to be defined in terms of that translation.


> the meaning/behavior/whatever of characters is insanely dependent on their context

I wish you would give an example instead of just proclaiming crapness. You know, so we n00bs can learn something.


Different languages have different rules for change case (as seen here) or what to do when translitterating to 7-bit ascii, in French, you can mostly drop accents if you need to, in German, you need to transform an umlaut to an e following the vowel. Of course, many languages don't have a way to translitterate to 7-bit ascii.

Sorting of strings is language dependent, but I don't know that there's a defined order for mixed language lists, so I guess user's context works if you're sorting for user purposes, but if you're sorting for machine purposes, you better not use the locale aware sort without telling it a hardcoded locale that doesn't change between localization library versions.


@toast0, @lazulicurio, both of your points seem to illustrate the complexities of the languages, not "...that Unicode is a flawed idea" as the original poster said. AFAIKS this is intrinsic complexity showing itself and does not make any indication of how it should be done correctly, or better.


> both of your points seem to illustrate the complexities of the languages, not "...that Unicode is a flawed idea"

The flaw in Unicode is that it punts on the intrinsic complexity---pretending that codepoints have language-independent, plain-text, semantic meaning.

A couple of threads that have molded my views over time:

I can't write my name in Unicode https://news.ycombinator.com/item?id=9219162 (Specifically these two comments https://news.ycombinator.com/item?id=9220530 and https://news.ycombinator.com/item?id=9220970)

Why isn't the external link symbol in Unicode? https://news.ycombinator.com/item?id=23016832


> The flaw in Unicode is that it punts on the intrinsic complexity---pretending that codepoints have language-independent, plain-text, semantic meaning.

> Pretending "plain text" isn't an oxymoron

FTFY :)


The benefit of looking at languages/scripts in isolation is that the combinatorial explosion of all languages/scripts at once is dodged.

E.g. lookalike charaters, and social engineering by using a vs а. (One is Cyrillic). I don't want to even define "a == а". I want Latin and Cyrillic to be different types of characters, and that expression to be ill-typed.

This solves the Turkish problem, where the upper case I is two different charters in two different types (Turkish Roman script?), and the case folding functions likewise have disjoint types.


> I want Latin and Cyrillic to be different types of characters

How do you concatenate English and Ру́сская text, and what is the type of this sentence?


[Either [Latin] [Cyrillic]] is a very simply type taking advantage that the language only switches at word boundary.


Huh. That doesn't quite address my objection (CamelCase like EnglishEtРу́сская still un-works), but that's actually a good point in the overwhelming majority of cases. I'm not quite convinced this approach works in practice (I'm sticking with "A"="A"="A"), but I'd definitely like to see a more technically fleshed-out design.


How about: case folding for the letter 'I' is dependent on whether the locale is Turkish or not.

;)


Unicode supported this with tag sequences but that is deprecated and unlikely to work with modern libs.



.NET is one of the few ecoecosystems to get this right. It offers the invariant culture for identifier-like things, "fr" for French language and "fr-FR" for French language in France, allowing you to specify your intention to every string-modifying function.

Support at the type level would be a lot less verbose, but support at the function level is already much better than many other popular languages.


It would be great if strings and especially date-time values always carried locale and timezone information with them.

It would take slightly more memory but not significant on modern machines.


Putting the locale information on the string sounds like a good idea. However I'm not sure how that should handle combined strings with components from different locales. For example `logLevel + ": " + logMessage` might produce "info: bağlantı kesildi" in Turkish. How to annotate that? Neither English nor Turkish would work correctly, each would produce the wrong result when uppercasing.

You could treat it as a series of string slices with different locales `[("info", "en"), (": ", ""), ("bağlantı kesildi", "tr")]`. That would work correctly, and you could now uppercase each slice according to its appropriate locale, but it wouldn't really be low overhead anymore. Maybe still worth it. It would be an interesting approach that might even be able to be implemented pretty seamlessly as a library in some languages (C++ or rust for example)


That just seems to be a parameter for locale-dependent functions. Very useful, but no, I'm talking about splitting the unicode-string datatype in two: "user-facing unicode string" vs "internal unicode string".

Example: logging.log("INFO", i"This is a localizable string")

In the i18n world, we could gather i-strings just like gettext does (where it looks like `logging.log("INFO", _("This is a localizable string")`). The language could then have other useful hooks/behaviours into that datatype, and definitely one of them would be whether various methods have i18n behaviour enabled on them, versus using a C locale.


In Java, there is Locale.ROOT, which can be used in a similar way. In particular, it is useful when performing locale-dependent operations in locale-independent contexts (e.g. working with case-insensitive identifiers) where you don’t want the behavior of your code to depend on the current default locale.


That would be great! For example, in Python you currently have to do something like this

    import locale
    sorted(list_of_strings, key=locale.strxfrm)
To sort using the current loacale, which many people forget.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: