Over the last two days I have been looking into possibilities for using OCR on texts written using Aksara Jawa and subsequently transliterating them in an automated fashion. As I will not have time to work on this further in the next few days and am far from finished, I will note down preliminary results here.
I am certainly not the first to look into the problem. There have increasingly been efforts to digitize old Javanese manuscripts and papers have been published on different aspects necessary for using OCR on the manuscripts and then transliterating them (See Widiarti et al. 2013; Widiarti et al. 2014). Unfortunately, I have however not been able to find the corresponding software.
During the course "Chinese Media Language", I had already used VietOCR (then still on Windows) to digitize the worksheets. VietOCR is based on tesseract, and I thus looked into using this for transcribing Javanese script. Indeed, there is already a trained data set for using OCR on Javanese texts - but this is unfortunately limited to Javanese texts using latin script. On the other hand, support for Indic and Arab scripts has increased much with more recent versions and tesseract can be trained for new languages, which means that using tesseract might be a good idea.
If tesseract is to be trained for Aksara Java, there are some requirements. First, there need to be fonts available for Aksara Jawa. Second, a text for training needs to be generated using all available characters in a more or less realistic way (for this, the data used by tesseract's Javanese (with latin script) OCR might be useful later).
Playing Around With tesseract
As I was testing tesseract for the command line, I wrote a little script to capture an area of my screen and then use OCR on the captured image. As a little gimmick, I then run the transcribed text through Google Translator. For getting better results, the captured image is grayscaled and scaled up to 300% of its original size (say, I borrowed some methods from OCRdesktop). The script can be found here.
To display Aksara Jawa, an appropriate font needs to be installed. Since it has only been added to unicode in 2009, there are two approaches to translating Aksara Jawa letters into a computer font:
- Older fonts replaced existing letters with a Javanese character. Since Aksara Jawa works very differently from latin script, this is a bit of an awkward way, but was the best possible at a time when Aksara Jawa had not yet been added to unicode.
- Some newer fonts make use of the inclusion of Aksara Jawa in unicode. This gives much more accurate results (from the programmer's view) and is thus preferable.
I could find two fonts that fit into the latter group: Tuladha Jejeg and Carakan Unicode (according to Utami (2012) there is another one named adjikasa, which I have been unable to find. The link does not lead to the font anymore).
For testing the fonts, I wrote a little script to return all characters with a set unicode range. For Aksara Jawa that's the range between U+A980 and U+A9DF (Wikipedia).
Tuladha Jejeg has 99.44% coverage on the output, Carakan Unicode has a 100% coverage.
There have already been some attempts to write transliteration programs for Jawanese to Latin transliteration and vice versa, mostly notable of which are JawaTeX and Benny Lin's effort. JawaTeX transliterates based on LaTEX and originally returned PDF files (it has later been expanded to also output HTML files (Utami 2012)). It is relatively well documented and able to transliterate from latin script to Aksara Jawa, but I have neither been able to test it myself (this would require logging in on the project's website), nor have I been able to find any code.
The script works rather well, but does not add spaces between words and misses some characters. It is, I think, good for a start nevertheless. Division by characters might later on be added, e.g. based on the wordlist to be found in tesseract's language data (linked above).