Monday, January 5, 2015

Undirected graphs in PostgreSQL

Introduction

The PostgreSQL manual, in its presentation of recursive queries, provides an example of how to write a recursive query to traverse a directed graph. Lorenzo Alberton also write a very nice post on Tech Portal on this subject. I've been wanting to get my head around how PostgreSQL's recursive queries could be used to unravel the general case of a linked list of nodes for a while, and finally got around to spending some time on it. I focused my attention on the case of an undirected graph. The approach I came up with is slightly different than Lorenzo's, in that I'm using two records to store each connection (or edge), with each of the two records sharing a common id to identify them as being the same connection. I don't know if this is better or worse, but it was fun to do, so I thought I'd share. Hopefully someone finds this helpful. At the very least, I'll find this helpful myself someday, because I'll forget what I was doing and need a reminder.


Table setup

Here are some simple tables to represent a set of nodes, and connections between them. For the purposes of illustrating my approach to maintaining and querying an undirected graph, they could be even simpler, but hopefully you'll excuse my idiomatic quirks, like always using UUID's for id values.

The connections table describes a link between two nodes. A trigger function, which we'll see below, ensures that this table contains a mirror pair of nodes. In other words, if node_a is 'x' and node_b is 'y', then there will also be a record where node_a is 'y' and node_b is 'x'. The id value of these two records will be the same. I thought I might make use of the id field when I constructed my recursive query, but in the end, I didn't use it.


What are we shooting for?

There are a lot of helper functions, views, triggers and so on to cover, but in the end, we only need to use a couple of functions to illustrate what we're trying to accomplish. The 'create_new_nodes' function will clear the nodes table and all connections, and will then repopulate the nodes table with the indicated number of nodes. The nodes are all given a simple name, like 'n0001', 'n0002', and so on. The 'create_random_graph' function populates the connections table. It's a bit clumsy, but it works. The way I'm grabbing random rows is inefficient, but large random graphs are so complicated that they would kill the recursive queries we're dealing with, so it doesn't really matter. The 'connection_node_names_v' view gives us a simple way to see what our connection table looks like.

And finally we get to the main point of this whole exercise - a query to expose the structure of our network. The 'paths_from_node_pretty' function is really just a simple wrapper around a recursive query that does all the hard work. What does it show us? All possible paths from a starting node to every other node that it can reach. The 'final' column tells us the name of a reachable node. The 'path_count' column tells us how many unique paths we can take to reach that node. The 'distance' column tells us the number of hops. The 'distance_count' function tells us how many paths have the same distance. And the 'path' column indicates the specific path. There will be a record for each path we can take to a reachable node.


Utility functions

I wrote a few functions to help me with my node table. The 'create_new_nodes' function, as you might expect, creates new nodes. The 'create_random_graph' function is not my proudest creation, but it serves its purpose, which is simply to create a somewhat random graph network. In the end, we end up with some partially overlapping (or not, depending on the parameters you use) completely interconnected sets. I'd like to spend more time considering different ways of building random networks. This was sufficient to test my recursive queries though. There's also a 'create_tree_graph' function you can use to explore tree structures, if you like.


The graph creation function use the following utility functions to hook things up. As you can see, these functions are based on a view, not the connections table. There is an INSTEAD OF trigger function, which we will see below, which inserts or deletes records from the underlying connections table. Why are the arguments TEXT instead of UUID you ask? Hmm. That's because if I expose these functions as a web service using the WSO2 Data Services Server (DSS), I need to pass TEXT arguments, because the DSS doesn't know what a UUID is. But's that's a topic for a different day.


Trigger functions

This is what I did to keep my connections table in order. Whenever a record is inserted or deleted by one of the utility functions above, 'connection_insert_t' takes over, and ensures that the connections table contains two records to represent a single connection - one record for each direction we can traverse the connection, if you want to think about it that way. It's really just a single connection though, so each of the two records share the same id value.

The 'no_self_connections_t' trigger function simply ensures that we don't hook a node up to itself. We could do that if we want, I suppose, but I didn't want to.


Recursive queries to enumerate paths

And now we finally get to the fun stuff! I wish I had a lot to say here, but to be honest, I haven't taken this much further than what you can find in the fine manual. The main difference is that I'm dealing with an undirected graph, which, because of the doubled connection entries, turns out to be pretty much the same exercise.

The 'all_paths' function will return all paths to all reachable nodes, for every node. Of course, the performance of these queries is quite sensitive to how big and complicated your network is. Just be aware, you don't need to create a very big or complicated network for this function to really bog down. Speaking of bogging down, I ran into some trouble running out of temp table space on some of my queries. Increasing the value of 'temp_buffers' in postgresql.conf seemed to help.

The 'paths_from_node' function is a parameterized version of the same thing. It tells all paths to all reachable nodes from a single given node. That's a lot less work than doing this calculation for every node, and may very well be what you really want to know in the first place. Why do extra work if you don't have to?

As we can see, 'paths_from_node_pretty' simply wraps 'paths_from_node' in a ribbon. I threw in a couple of window functions to count the number of unique paths to any particular final node, and to count the number of paths that had the same distance. I also convert the path array into text with a symbolic delimiter.


Final Thoughts

Hopefully you found this helpful, or were at least amused by my bad writing and terrible code. I think the most interesting thing I discovered when I started playing with this was how complicated the path structure of a network described by even a small set of connections can be. Notice that the output above is for a graph with few nodes and few connections. Small changes to the create_random_graph' function's parameters have profound consequences on how complicated the path structure is. I'd like to come up with some more general mechanisms for populating my connections table, in order to better understand the differences between different types of networks.