Importing JMdict

Well, it seems progress toward getting Kotoba to its base-line functionality is nearer than farther away as of this weekend with the ability to import words from JMdict.

Importing using Ruby is both straight-forward and circuitous.  It is straight-forward in that using REXML API is pretty straight-forward.  However, the dearth of competent examples does mean that implementing a parser is strewn with mis-directions, especially if you take endorsements at face-value.  In particular, determining what parser to use along with predicting parser performance required me to write three different parsers over the weekend to ultimately create one with decent performance.  While Enterprise Ruby :: Parsing is a well-written article, it makes some implicit assumptions about file-size that limits the utility of XMLLib as a parser for large files.  More pointedly, anytime a tree/DOM parser is advocated for large files (10+ MB) then users should seriously question the credibility of the source.  The only appropriate solution in these situations is the use of stream parsers such as callback (SAX1 or SAX2 APIs) or another similar listener pattern.

To wit, JMdict is itself a 47 MB XML file with some 150,000 Japanese-English entries.  More so, there is a number of extraneous nodes, at least from perspective of Kotoba, that need to be processed but ignored.  A tree-parser will need to create objects for each node, even if a large potion of the tree will be ignored.  Given that each Ruby object minimally requires 12 bytes per object, large files that are first loaded into run-time memory before being processed will significantly tax most modern machines.  

Fortunately, minus the cross-references and some trivial references to parts of speech and dialects, the majority of information within each entry node is self-contained.  Consequently, DOM parsers are neither necessary nor realistic due to their memory requirements.  It took some searching to find a great resource on Ruby XML parsers.  I ultimately wrote a StreamListener and a StreamParser that works in an incremental fashion where we parse N entries then persist those N entries.  This is to ensure that we do not load our entire set of words (150,000) into memory before persisting to our database.

Author: Ward

I’m the creator and operator of this little corner of the internets, writing on all things related to art and more specifically my experiences trying to figure this whole thing out. I guess I’m trying to figure out life, too, but mostly I just post about art here.

Breath some fire into this post!

This site uses Akismet to reduce spam. Learn how your comment data is processed.