Sunday, September 07, 2014

Graph Databases

Graph Databases are not new - sites like LinkedIn and Facebook are based on highly connected data which is not managed on traditional RDBMS (Relational Database Management System) infrastructure. Graph DB technology is being rapidly commoditised with platforms like Neo4J and OrientDB leading the way. I believe they will become a new defacto standard in developing all sorts of business and online applications once the inertia of 30+ years of RDBMS thinking is slowly broken down.

Often when I describe what a graph database is - the ideal way to store highly connected data - most people just shrug and say that was solved years ago with the RDBMS platforms - they have 'Relational' in the name after all, right there as the first letter of the acronym! This post is an attempt to explain what makes them a better choice for many applications.

Firstly let's take a look at an example. Say you have a permissioning service which manages permissions for various systems grouped by roles which in turn have a list of functions. In a relational model we may end up with a tables for 'System', 'Role', 'Function', and 'Person' with additional join tables for 'Role_Function' and 'Person_Role'. A typical query of this model would be to determine which functions Person 'A' has permissions to for Application 'X'. The most basic TSQL implementation would be something like

SELECT Function.Id, Function.Name
FROM Function 
INNER JOIN Role_Function ON Role_Function.FunctionId = Function.Id
INNER JOIN Role ON Role.Id = Role_Function.RoleId
INNER JOIN System ON System.Id = Role.SystemId
INNER JOIN Person_Role ON Person_Role.RoleId = Role.Id
INNER JOIN Person ON Person.Id = Person_Role.PersonId
WHERE Person.Name = 'A'
AND System.Name = 'X'

Of course if you add in some reasonable complexity like supporting the fact that some functions may imply permissions to other functions (to edit a record you need to be able to view or search for it). Or that you might have profiles linked to positions rather than people you end up with an explosion in the JOIN factory and the TSQL becomes many times more complicated.

In a graph world however each row in each table simply becomes a node in the graph. The person A node would have a 'IS_IN_ROLE' relationship to a bunch of Roles which would be linked to systems with a 'HAS_ROLE relationship and to functions with an 'INCLUDES' relationship. Functions could relate to each other in a hierarchy. You could add Profile nodes which a Person could hold which can have Roles of their own etc. E.g. a graph looking something like this:


With graph technology comes new querying languages / syntaxes. Neo4J provides a very elegant Cypher language which allows you to query the graph very succinctly. E.g. our complex non-performat TSQL statement might look like this in a graph world:

MATCH (:System {Name:"X"})-->(r:Role)-[*]->(f:Function)
,(p:Person {Name:"A")-[*]->(r)
RETURN DISTINCT f

This query would accommodate function hierarchies and any sorts of connections from a Person to a role (e.g. via positions). In fact whilst it's already simplier than the TSQL version above it is also more powerful, performant, and flexible to changes in the underlying model. We actually have this permissioning system in our organisation written with a RDBMS back end and with the additional complexities mentioned, the TSQL query to retrieve functions for a user takes up over 200 lines which in Cypher condenses down to the simple 3-line statement above.

So with the intro out of the way, what are the benefits?

  • One of the obvious benefits of graph DBs is the types of queries that are easily supported and often DO NOT require changing even for changes to the graph structure itself. Greatly speeding up development time.
  • Another benefit is performance. In the TSQL world there are many index lookups going on to find data in separate tables to JOIN on. In the graph world each node has direct references to its related nodes meaning that traversing the graph (given known starting points like the Person with Name "A" and system with name "X") is super fast as it only ever considers related nodes to see if they match the query. In fact, although indexes are supported in graph DBs they are generally only used to 'anchor' the query to fixed starting points in the graph not to find the data being retrieved.
  • Flexibility to requirements changes. In an agile development world (which is everywhere now really right?) Graph databases accommodate changes to requirements far more easily. The rise of ORMs was due largely to the impedance mismatch between Object Oriented development and the RDBMS data storage structure. Graph DBs remove this issue by allowing data to be stored in a way that more closely matches the code. In fact Graph DBs do not strictly have schemas (though this is somewhat dependent on the technology used) - there is nothing to prevent one node representing a Person having an 'Eye Colour' attribute and another node having a 'Height' attribute. Obviously for use in business applications you will expect some conformity but this is held and defined in code rather than in a separate DB schema as with RDBMS.
  • Deployment of changes is also simplified. Though there are gotchas to look out for with the lack of a schema driven model you are free to add and remove nodes and relationships dynamically meaning you could re-organise the structure of the graph in a live environment
And the drawbacks?
  • Most obvious is the lack of mainstream support. Graph technology is new and untrusted in both enterprise architect and development worlds. This will change over time as exposure increases.
  • The market has not yet stabilized meaning even the most prominent players have not yet settled on a standardized querying language or code base (e.g. Neo4J have recently deprecated their original APIs)
  • There are some applications where a 'good old' RDBMS is still more suitable. Any application with serious aggregation / number crunching requirements or where the structure of the information is very static, not highly related, nor subject to frequent change is probably still going to be developed using an RDBMS backend. Though, I'd hazard a guess that there are fewer and fewer of these systems left. 
  • Reporting requirements are also probably better suited to a properly structured reporting cube maintained separately to the graph. This is actually true of systems running on an RDBMS but since TSQL can aggregate data well often reporting and transactional requirements are supported by a single DB. If you are a reporting purist in someways this is another benefit of the Graph DB as it forces us to think about the reporting requirements separately to the transactional requirements of the system.
If you want to investigate Graph DBs some further reading and suggestions:
  • Neo4J - the self-proclaimed worlds leading Graph DB has a free community edition and a fantastic query language in Cypher. Beware of the 'Oracle' like license model for enterprise implementations though. The site includes some great intro information and links to graph DB examples, demos, and tutorials.
  • OrientDB is another great graph DB tool which also operates as a document DB. Its query language is based on SQL to make it 'more familiar to TSQL developers'. It also supports more of a controlled definition of node types with inheritance from higher order nodes. Like Neo4J there is a community edition available and licensing for enterprise is very reasonable.