There are various ways in which the Asyntactic script may (appear to) be corrupted by an application or its environment. This post is an atempt to place the variable symptoms of script corruption into a classification system.
R – Rendering
Asyntactic script characters are rendered :
as expected
no characters visible
unrecognisable characters, (mostly) all the same
BMP plane (2 byte) characters OK, multi-byte characters no good
all planes NG
unrecognisable characters, (mostly) all different. (sub-categories same as 3.)
recognisable but unexpected characters. (sub-categories same as 3.)
e.g. rendering in the two images in the previous post would be classified respectively as R3.2 and R1.
C -Code Points
Each Asyntactic script code point :
maintains integrity through input / storage / communication / output
gets translated into an incorrect code point
each one into the same character
each one into different characters in the BMP only
each one into different characters in the BMP and other planes
e.g. corruption in the modified Spaz Twitter client sample would be classified as R.3.1 C.1. Although Spaz (AIR) did not render most characters correctly, the destination site Twitter demonstrates that data integrity was preserved in the Spaz client.
L – Length of character string
Strings encoded in the Asyntactic script are :
the expected length
shorter than expected
approximately half of the length
other
longer than expected
e.g. the Twitter web client and the standard Spaz client both use the Javascript String.length method to count characters in a message. Because this method is broken for higher plane Unicode, they both corrupt data by classification R.2.1.
If you have it installed, LastResort ensures that any code point in the Unicode range has a representative glyph. Each glyph is different in that the code point’s numeric value is displayed. Apart from this, code points are also classified by script (if known), whether valid or invalid and more.
Font priority
Wherever an existing font already covers a particular code point, LastResort should leave it alone. That code point would be represented using the other font. Only when no other font is available for a code point does LastResort’s glyph get used. Hence the name.
Examples
Here’s a couple of screen prints of an earlier post demonstrating the Base16b Plugin; with and without LastResort installed.
LastResort not installed
LastResort installed
Note the empty boxes, which are how this particular OS (Windows XP Japanese) happens to display code points for which no suitable font is installed.
Asyntactic script
The Asyntactic script, used by Base16b for encoding, is in a code point range little used by other mainstream Unicode scripts. Therefore it is likely that, without LastResort or a similar font installed, Base16b encoded characters would not be represented coherently on the screen.
Encoding functionality
Whether or not a suitable font is installed should in no way affect the functionality of the Base16b encoding. The font is merely a visual convenience.
If you have any example to the contrary, please let us know.
This blog post is an alternate, public, forum for suggesting changes to the spec. We welcome any suggestions, typos, errors, mistakes, improvements or clarifications. Please feel free to comment below.
The previous post showed base64 encoding of part of a WordPress post. Base64 encoded 148 characters (a few of which were due to htmlentity expansion).
Now let’s do the same thing using a version of the plugin modified for Base16b encoding.
After hitting Encode, if you see only empty boxes, you might want to install a font such as LastResort.
Note: the image tag comprising a URL of 84 characters (not the actual binary image) is encoded. See the next post for the encoded img tag.
Note2: the original text is additionally altered when encoded, through htmlentity substitution. e.g double quote ” is encoded in base64 as & quot; (no space)
The base16b.org web site was started both as a means to an end and as an end in itself.
There is a particular use case for a binary encoding method which I have in mind: structured binary data between 0.1Kib to over 100Kib, communicated over web transport layers, for semantic tagging. That data format is itself new and I hope to announce it here some time soon.
In the mean time, base16b seems like it could be a good generic solution to the problem of more efficient, standards-compliant binary data encoding in Unicode. So I decided to spin it off as a separate project. Should this encoding method gain general acceptance, it seems likely that others may step-in to contribute with their improvements to the design. Eventually we can get into standardisation.
This eponymous web site and its blog are intended to become a focus of discussion for interested developers.