Looking into Automatized Transkription and Transliteration for Aksara Jawa

Preliminary notes on attempting to transcribe and transliterate text using Aksara Java

Over the last two days I have been looking into possibilities for using OCR on texts written using Aksara Jawa and subsequently transliterating them in an automated fashion. As I will not have time to work on this further in the next few days and am far from finished, I will note down preliminary results here.

OCR

I am certainly not the first to look into the problem. There have increasingly been efforts to digitize old Javanese manuscripts and papers have been published on different aspects necessary for using OCR on the manuscripts and then transliterating them (See Widiarti et al. 2013; Widiarti et al. 2014). Unfortunately, I have however not been able to find the corresponding software.

During the course "Chinese Media Language", I had already used VietOCR (then still on Windows) to digitize the worksheets. VietOCR is based on tesseract, and I thus looked into using this for transcribing Javanese script. Indeed, there is already a trained data set for using OCR on Javanese texts - but this is unfortunately limited to Javanese texts using latin script. On the other hand, support for Indic and Arab scripts has increased much with more recent versions and tesseract can be trained for new languages, which means that using tesseract might be a good idea.

If tesseract is to be trained for Aksara Java, there are some requirements. First, there need to be fonts available for Aksara Jawa. Second, a text for training needs to be generated using all available characters in a more or less realistic way (for this, the data used by tesseract's Javanese (with latin script) OCR might be useful later).

Playing Around With tesseract

As I was testing tesseract for the command line, I wrote a little script to capture an area of my screen and then use OCR on the captured image. As a little gimmick, I then run the transcribed text through Google Translator. For getting better results, the captured image is grayscaled and scaled up to 300% of its original size (say, I borrowed some methods from OCRdesktop). The script can be found here.

An area with text is selected in the left window. OCRed text and suggested translation are returned in window to the right.

Fonts

To display Aksara Jawa, an appropriate font needs to be installed. Since it has only been added to unicode in 2009, there are two approaches to translating Aksara Jawa letters into a computer font:

I could find two fonts that fit into the latter group: Tuladha Jejeg and Carakan Unicode (according to Utami (2012) there is another one named adjikasa, which I have been unable to find. The link does not lead to the font anymore).

For testing the fonts, I wrote a little script to return all characters with a set unicode range. For Aksara Jawa that's the range between U+A980 and U+A9DF (Wikipedia).

output = ''
for i in range(128,230):
    j = hex(i)
    k = str(j).replace("0x", "")
    code = 'A9' + k   
    output = output + (unichr(int(code, 16))) + ' 'print (output)`

Unicode uses hexadecimal values to number characters (padded to five digits). Since Aksara Jawa occupies a range starting with A9 this part could remain a fixed value, while the ensuing number was changed in a loop.

Tuladha Jejeg has 99.44% coverage on the output, Carakan Unicode has a 100% coverage.

Transliteration

There have already been some attempts to write transliteration programs for Jawanese to Latin transliteration and vice versa, mostly notable of which are JawaTeX and Benny Lin's effort. JawaTeX transliterates based on LaTEX and originally returned PDF files (it has later been expanded to also output HTML files (Utami 2012)). It is relatively well documented and able to transliterate from latin script to Aksara Jawa, but I have neither been able to test it myself (this would require logging in on the project's website), nor have I been able to find any code.

Benny Lin's Javanese transliteration program is written in Javascript and did not initially work in my browser (maybe my settings are too strict). The code is however available on GitHub and licensed under Creative Commons, and contains a nice list with corresponding characters and rules for transliterating. I thus ported the transliteration script to Python. The ported script is available at GitHub.

The script works rather well, but does not add spaces between words and misses some characters. It is, I think, good for a start nevertheless. Division by characters might later on be added, e.g. based on the wordlist to be found in tesseract's language data (linked above).

$ python3 transliterate-jav.py ꦲꦏ꧀ꦱꦫꦗꦮ  ꦕꦫꦏꦤ꧀  ꦲꦤꦕꦫꦏak​sarajawa carakan​ anacaraka`

The String "ꦲꦏ꧀ꦱꦫꦗꦮ" (Aksara Jawa) is given as a single argument and thus not divided into different words. "ꦲꦤꦕꦫꦏ" (Hanacaraka) is transliterated without the initial h.

References