Storing a Directed Graph in SQL

I recently ran into the problem of having to store a directed graph in MySQL. Actually, I was trying to efficiently store a thesaurus in SQL so that it both used little space and had high performing lookups. A thesaurus, as you probably know, could best be represented as a graph of some type; perhaps as a tree (in a naive implementation), or a directed acyclic graph, or even a directed multigraph.

SQL, unfortunately, does not readily lend itself to these type of data structures at first glance. With some thought, however, one can readily store a tree or any other graph in SQL. Consider the following example:


This quickly becomes much more complex as we want to attach definitions to words, and some words in the thesaurus may be the same ‘word’, but a differing class of speech, such as a noun, verb, etc.

How to store this in SQL? One possible solution is to build a node list and an adjacency list:

The node list:

ID Word Definition …
1392 dork nerd; stupid person …
3729 geek odd person; computer expert
7318 nerd geek, but lacking in social grace …
8504 tech computer guru …

The adjacency list:

src dst
1392 7318
3729 1392
3729 7318
3729 8504
7318 1392
7318 3729
7318 8504

Relatively fast and efficient lookups can now be done using SQL JOIN. While I’m sure this isn’t the most optimal structure for performance, it has to be close to the most efficient(*) means of storing these cross references.

I’ve set up a live demonstration of the thesaurus right here:

It contains 83,318 words and phrases, and 1,112,705 cross references for synonyms. Most uncached lookups take place on the order of less than a 1/10th of a second on this busy shared hosting server. Cached queries are virtually instantaneous. And of course, MySQL is not my area of expertise; I would imagine that this could be made even faster.

(*) The context of this problem was specifically storing a directed graph in SQL, not a thesaurus in SQL, nor a directed graph in general. If you’re looking for an optimal structure for a thesaurus or dictionary, you should look into a DAWG like this one: How to Create an Optimal DAWG