This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

RedisGraph: A High Performance In-Memory Graph Database

1: Technical specification for writing RedisGraph client libraries
2: RedisGraph Result Set Structure
3: Implementation details for the GRAPH.BULK endpoint

Abstract

Graph-based data is everywhere nowadays. Facebook, Google, Twitter and Pinterest are just a few who've realize the power behind relationship data and are utilizing it to its fullest. As a direct result, we see a rise both in interest for and the variety of graph data solutions.

With the introduction of Redis Modules we've recognized the great potential of introducing a graph data structure to the Redis arsenal, and developed RedisGraph. Bringing new graph capabilities to Redis through a native C implementation with an emphasis on performance, RedisGraph is now available as an open source project.

In this documentation, we'll discuss the internal design and features of RedisGraph and demonstrate its current capabilities.

RedisGraph At-a-Glance

RedisGraph is a graph database developed from scratch on top of Redis, using the new Redis Modules API to extend Redis with new commands and capabilities. Its main features include:

Simple, fast indexing and querying
Data stored in RAM using memory-efficient custom data structures
On-disk persistence
Tabular result sets
Uses the popular graph query language openCypher

A Little Taste: RedisGraph in Action

Let’s look at some of the key concepts of RedisGraph using this example over the redis-cli tool:

Constructing a graph

It is common to represent entities as nodes within a graph. In this example, we'll create a small graph with both actors and movies as its entities, and an "act" relation that will connect actors to the movies they acted in. We'll use the graph.QUERY command to issue a CREATE query, which will introduce new entities and relations to our graph.

graph.QUERY <graph_id> 'CREATE (:<label> {<attribute_name>:<attribute_value>,...})'

graph.QUERY <graph_id> 'CREATE (<source_node_alias>)-[<relation> {<attribute_name>:<attribute_value>,...}]->(<dest_node_alias>)'

We'll construct our graph in one command:

graph.QUERY IMDB 'CREATE (aldis:actor {name: "Aldis Hodge", birth_year: 1986}),
                         (oshea:actor {name: "OShea Jackson", birth_year: 1991}),
                         (corey:actor {name: "Corey Hawkins", birth_year: 1988}),
                         (neil:actor {name: "Neil Brown", birth_year: 1980}),
                         (compton:movie {title: "Straight Outta Compton", genre: "Biography", votes: 127258, rating: 7.9, year: 2015}),
                         (neveregoback:movie {title: "Never Go Back", genre: "Action", votes: 15821, rating: 6.4, year: 2016}),
                         (aldis)-[:act]->(neveregoback),
                         (aldis)-[:act]->(compton),
                         (oshea)-[:act]->(compton),
                         (corey)-[:act]->(compton),
                         (neil)-[:act]->(compton)'

Querying the graph

RedisGraph exposes a subset of the openCypher graph language. Although only some language capabilities are supported, there's enough functionality to extract valuable insights from your graphs. To execute a query, we'll use the GRAPH.QUERY command:

GRAPH.QUERY <graph_id> <query>

Let's execute a number of queries against our movies graph.

Find the sum, max, min and avg age of the 'Straight Outta Compton' cast:

GRAPH.QUERY IMDB 'MATCH (a:actor)-[:act]->(m:movie {title:"Straight Outta Compton"})
RETURN m.title, SUM(2020-a.birth_year), MAX(2020-a.birth_year), MIN(2020-a.birth_year), AVG(2020-a.birth_year)'

RedisGraph will reply with:

1) 1) "m.title"
   2) "SUM(2020-a.birth_year)"
   3) "MAX(2020-a.birth_year)"
   4) "MIN(2020-a.birth_year)"
   5) "AVG(2020-a.birth_year)"
2) 1) 1) "Straight Outta Compton"
      2) "135"
      3) (integer) 40
      4) (integer) 29
      5) "33.75"

The first row is our result-set header which names each column according to the return clause.
The second row contains our query result.

Let's try another query. This time, we'll find out how many movies each actor played in.

GRAPH.QUERY IMDB "MATCH (actor)-[:act]->(movie) RETURN actor.name, COUNT(movie.title) AS movies_count ORDER BY
movies_count DESC"

1) "actor.name, movies_count"
2) "Aldis Hodge,2.000000"
3) "O'Shea Jackson,1.000000"
4) "Corey Hawkins,1.000000"
5) "Neil Brown,1.000000"

The Theory: Ideas behind RedisGraph

Representation

RedisGraph uses sparse adjacency matrices to represent graphs. As directed relationship connecting source node S to destination node T is recorded within an adjacency matrix M, by setting M's S,T entry to 1 (M[S,T]=1). As a rule of thumb, matrix rows represent source nodes while matrix columns represent destination nodes.

Every graph stored within RedisGraph has at least one matrix, referred to as THE adjacency matrix (relation-type agnostic). In addition, every relation with a type has its own dedicated matrix. Consider a graph with two relationships types:

visits
friend

The underlying graph data structure maintains three matrices:

THE adjacency matrix - marking every connection within the graph
visit matrix - marking visit connections
friend matrix - marking friend connections

A 'visit' relationship E that connects node A to node B, sets THE adjacency matrix at position [A,B] to 1. Also, the visit matrix V sets position V[A,B] to 1.

To accommodate typed nodes, one additional matrix is allocated per label, and a label matrix is symmetric with ones along the main diagonal. Assume that node N was labeled as a Person, then the Person matrix P sets position P[N,N] to 1.

This design lets RedisGraph modify its graph easily, including:

Adding new nodes simply extends matrices, adding additional rows and columns
Adding new relationships by setting the relevant entries at the relevant matrices
Removing relationships clears relevant entries
Deleting nodes by deleting matrix row/column.

One of the main reasons we chose to represent our graphs as sparse matrices is graph traversal.

Traversal

Graph traversal is done by matrix multiplication. For example, if we wanted to find friends of friends for every node in the graph, this traversal can be expressed by FOF = F^2. F stands for the friendship relation matrix, and the result matrix FOF summarizes the traversal. The rows of the FOF represent source nodes and its columns represent friends who are two hops away: if FOF[i,j] = 1 then j is a friend of a friend of i.

Algebraic expression

When a search pattern such as (N0)-[A]->(N1)-[B]->(N2)<-[A]-(N3) is used as part of a query, we translate it into a set of matrix multiplications. For the given example, one possible expression would be: A * B * Transpose(A).

Note that matrix multiplication is an associative and distributive operation. This gives us the freedom to choose which terms we want to multiply first (preferring terms that will produce highly sparse intermediate matrices). It also enables concurrency when evaluating an expression, e.g. we could compute AZ and BY in parallel.

GraphBLAS

To perform all of these operations for sparse matrices, RedisGraph uses GraphBLAS - a standard API similar to BLAS. The current implementation uses the CSC sparse matrix format (compressed sparse columns), although the underlying format is subject to change.

Query language: openCypher

There are a number of graph query languages, so we didn't want to reinvent the wheel. We decided to implement a subset of one of the most popular graph query languages out there: openCypher While the openCypher project provides a parser for the language, we decided to create our own parser. We used Lex as a tokenizer and Lemon to generate a C target parser.

As mentioned earlier, only a subset of the language is currently supported, but we plan to continue adding new capabilities and extend RedisGraph's openCypher capabilities.

Runtime: query execution

Let's review the steps RedisGraph takes when executing a query.

Consider this query that finds all actors who played alongside Aldis Hodge and are over 30 years old:

MATCH (aldis::actor {name:"Aldis Hodge"})-[:act]->(m:movie)<-[:act]-(a:actor) WHERE a.age > 30 RETURN m.title, a.name

RedisGraph will:

Parse the query, and build an abstract syntax tree (AST)
Compose traversal algebraic expressions
Build filter trees
Construct an optimized query execution plan composed of:
- Filtered traverse
- Conditional traverse
- Filter
- Project
Execute the plan
Populate a result set with matching entity attributes

Filter tree

A query can filter out entities by creating predicates. In our example, we filter actors younger then 30.

It's possible to combine predicates using OR and AND keywords to form granular conditions.

During runtime, the WHERE clause is used to construct a filter tree, and each node within the tree is either a condition (e.g. A > B) or an operation (AND/OR). When finding candidate entities, they are passed through the tree and evaluated.

Benchmarks

Depending on the underlying hardware results may vary. However, inserting a new relationship is done in O(1). RedisGraph is able to create over 1 million nodes under half a second and form 500K relations within 0.3 of a second.

License

RedisGraph is published under Redis Source Available License Agreement.

Conclusion

Although RedisGraph is still a young project, it can be an alternative to other graph databases. With its subset of operations, you can use it to analyze and explore graph data. Being a Redis Module, this project is accessible from every Redis client without adjustments. It's our intention to keep on improving and extending RedisGraph with the help of the open source community.

1 - Technical specification for writing RedisGraph client libraries

By design, there is not a full standard for RedisGraph clients to adhere to. Areas such as pretty-print formatting, query validation, and transactional and multithreaded capabilities have no canonically correct behavior, and the implementer is free to choose the approach and complexity that suits them best. RedisGraph does, however, provide a compact result set format for clients that minimizes the amount of redundant data transmitted from the server. Implementers are encouraged to take advantage of this format, as it provides better performance and removes ambiguity from decoding certain data. This approach requires clients to be capable of issuing procedure calls to the server and performing a small amount of client-side caching.

Retrieving the compact result set

Appending the flag --compact to any query issued to the GRAPH.QUERY endpoint will cause the server to issue results in the compact format. Because we don't store connection-specific configurations, all queries should be issued with this flag.

GRAPH.QUERY demo "MATCH (a) RETURN a" --compact

Formatting differences in the compact result set

The result set has the same overall structure as described in the Result Set documentation.

Certain values are emitted as integer IDs rather than strings:

Node labels
Relationship types
Property keys

Instructions on how to efficiently convert these IDs in the Procedure Calls section below.

Additionally, two enums are exposed:

ColumnType, which as of RedisGraph v2.1.0 will always be COLUMN_SCALAR. This enum is retained for backwards compatibility, and may be ignored by the client unless RedisGraph versions older than v2.1.0 must be supported.

ValueType indicates the data type (such as Node, integer, or string) of each returned value. Each value is emitted as a 2-array, with this enum in the first position and the actual value in the second. Each property on a graph entity also has a scalar as its value, so this construction is nested in each value of the properties array when a column contains a node or relationship.

Decoding the result set

Given the graph created by the query:

GRAPH.QUERY demo "CREATE (:plant {name: 'Tree'})-[:GROWS {season: 'Autumn'}]->(:fruit {name: 'Apple'})"

Let's formulate a query that returns 3 columns: nodes, relationships, and scalars, in that order.

Verbose (default):

127.0.0.1:6379> GRAPH.QUERY demo "MATCH (a)-[e]->(b) RETURN a, e, b.name"
1) 1) "a"
   2) "e"
   3) "b.name"
2) 1) 1) 1) 1) "id"
            2) (integer) 0
         2) 1) "labels"
            2) 1) "plant"
         3) 1) "properties"
            2) 1) 1) "name"
                  2) "Tree"
      2) 1) 1) "id"
            2) (integer) 0
         2) 1) "type"
            2) "GROWS"
         3) 1) "src_node"
            2) (integer) 0
         4) 1) "dest_node"
            2) (integer) 1
         5) 1) "properties"
            2) 1) 1) "season"
                  2) "Autumn"
      3) "Apple"
3) 1) "Query internal execution time: 1.326905 milliseconds"

Compact:

127.0.0.1:6379> GRAPH.QUERY demo "MATCH (a)-[e]->(b) RETURN a, e, b.name" --compact
1) 1) 1) (integer) 1
      2) "a"
   2) 1) (integer) 1
      2) "e"
   3) 1) (integer) 1
      2) "b.name"
2) 1) 1) 1) (integer) 8
         2) 1) (integer) 0
            2) 1) (integer) 0
            3) 1) 1) (integer) 0
                  2) (integer) 2
                  3) "Tree"
      2) 1) (integer) 7
         2) 1) (integer) 0
            2) (integer) 0
            3) (integer) 0
            4) (integer) 1
            5) 1) 1) (integer) 1
                  2) (integer) 2
                  3) "Autumn"
      3) 1) (integer) 2
         2) "Apple"
3) 1) "Query internal execution time: 1.085412 milliseconds"

These results are being parsed by redis-cli, which adds such visual cues as array indexing and indentation, as well as type hints like (integer). The actual data transmitted is formatted using the RESP protocol. All of the current RedisGraph clients rely upon a stable Redis client in the same language (such as redis-rb for Ruby) which handles RESP decoding.

Top-level array results

The result set above had 3 members in its top-level array:

1) Header row
2) Result rows
3) Query statistics

All queries that have a RETURN clause will have these 3 members. Queries that don't return results have only one member in the outermost array, the query statistics:

127.0.0.1:6379> GRAPH.QUERY demo "CREATE (:plant {name: 'Tree'})-[:GROWS {season: 'Autumn'}]->(:fruit {name: 'Apple'})" --compact
1) 1) "Labels added: 2"
   2) "Nodes created: 2"
   3) "Properties set: 3"
   4) "Relationships created: 1"
   5) "Query internal execution time: 1.972868 milliseconds"

Rather than introspecting on the query being emitted, the client implementation can check whether this array contains 1 or 3 elements to choose how to format data.

Reading the header row

Our sample query MATCH (a)-[e]->(b) RETURN a, e, b.name generated the header:

1) 1) (integer) 1
   2) "a"
3) 1) (integer) 1
   3) "e"
4) 1) (integer) 1
   3) "b.name"

The 4 array members correspond, in order, to the 3 entities described in the RETURN clause.

Each is emitted as a 2-array:

1) ColumnType (enum)
2) column name (string)

The first element is the ColumnType enum, which as of RedisGraph v2.1.0 will always be COLUMN_SCALAR. This element is retained for backwards compatibility, and may be ignored by the client unless RedisGraph versions older than v2.1.0 must be supported.

Reading result rows

The entity representations in this section will closely resemble those found in Result Set Graph Entities.

Our query produced one row of results with 3 columns (as described by the header):

1) 1) 1) (integer) 8
      2) 1) (integer) 0
         2) 1) (integer) 0
         3) 1) 1) (integer) 0
               2) (integer) 2
               3) "Tree"
   2) 1) (integer) 7
      2) 1) (integer) 0
         2) (integer) 0
         3) (integer) 0
         4) (integer) 1
         5) 1) 1) (integer) 1
               2) (integer) 2
               3) "Autumn"
   3) 1) (integer) 2
      2) "Apple"

Each element is emitted as a 2-array - [ValueType, value].

It is the client's responsibility to store the ValueType enum. RedisGraph guarantees that this enum may be extended in the future, but the existing values will not be altered.

The ValueType for the first entry is VALUE_NODE. The node representation contains 3 top-level elements:

The node's internal ID.
An array of all label IDs associated with the node (currently, each node can have either 0 or 1 labels, though this restriction may be lifted in the future).
An array of all properties the node contains. Properties are represented as 3-arrays - [property key ID, ValueType, value].

[	
    Node ID (integer),
    [label ID (integer) X label count]
    [[property key ID (integer), ValueType (enum), value (scalar)] X property count]
]

The ValueType for the first entry is VALUE_EDGE. The edge representation differs from the node representation in two respects:

Each relation has exactly one type, rather than the 0+ labels a node may have.
A relation is emitted with the IDs of its source and destination nodes.

As such, the complete representation is as follows:

The relation's internal ID.
The relationship type ID.
The source node's internal ID.
The destination node's internal ID.
The key-value pairs of all properties the relation possesses.

[	
    Relation ID (integer),
    type ID (integer),
    source node ID (integer),
    destination node ID (integer),
    [[property key ID (integer), ValueType (enum), value (scalar)] X property count]
]

The ValueType for the third entry is VALUE_STRING, and the other element in the array is the actual value, "Apple".

Reading statistics

The final top-level member of the GRAPH.QUERY reply is the execution statistics. This element is identical between the compact and standard response formats.

The statistics always include query execution time, while any combination of the other elements may be included depending on how the graph was modified.

"Labels added: (integer)"
"Nodes created: (integer)"
"Properties set: (integer)"
"Nodes deleted: (integer)"
"Relationships deleted: (integer)"
"Relationships created: (integer)"
"Query internal execution time: (float) milliseconds"

Procedure Calls

Property keys, node labels, and relationship types are all returned as IDs rather than strings in the compact format. For each of these 3 string-ID mappings, IDs start at 0 and increase monotonically.

As such, the client should store an string array for each of these 3 mappings, and print the appropriate string for the user by checking an array at position ID. If an ID greater than the array length is encountered, the local array should be updated with a procedure call.

These calls are described generally in the Procedures documentation.

To retrieve each full mapping, the appropriate calls are:

db.labels()

127.0.0.1:6379> GRAPH.QUERY demo "CALL db.labels()"
1) 1) "label"
2) 1) 1) "plant"
   2) 1) "fruit"
3) 1) "Query internal execution time: 0.321513 milliseconds"

db.relationshipTypes()

127.0.0.1:6379> GRAPH.QUERY demo "CALL db.relationshipTypes()"
1) 1) "relationshipType"
2) 1) 1) "GROWS"
3) 1) "Query internal execution time: 0.429677 milliseconds"

db.propertyKeys()

127.0.0.1:6379> GRAPH.QUERY demo "CALL db.propertyKeys()"
1) 1) "propertyKey"
2) 1) 1) "name"
   2) 1) "season"
3) 1) "Query internal execution time: 0.318940 milliseconds"

Because the cached values never become outdated, it is possible to just retrieve new values with slightly more complex constructions:

CALL db.propertyKeys() YIELD propertyKey RETURN propertyKey SKIP [cached_array_length]

Though the property calls are quite efficient regardless of whether this optimization is used.

As an example, the Python client checks its local array of labels to resolve every label ID as seen here.

In the case of an IndexError, it issues a procedure call to fully refresh its label cache as seen here.

Reference clients

All the logic described in this document has been implemented in most of the clients listed in Client Libraries. Among these, node-redis, redis-py and jedis are currently the most sophisticated.

2 - RedisGraph Result Set Structure

This document describes the format RedisGraph uses to print data when accessed through the redis-cli utility. The language-specific clients retrieve data in a more succinct format, and provide their own functionality for printing result sets.

Top-level members

Queries that return data emit an array with 3 top-level members:

The header describing the returned records. This is an array which corresponds precisely in order and naming to the RETURN clause of the query.
A nested array containing the actual data returned by the query.
An array of metadata related to the query execution. This includes query runtime as well as data changes, such as the number of entities created, deleted, or modified by the query.

Result Set data types

A column in the result set can be populated with graph entities (nodes or relations) or scalar values.

Scalars

RedisGraph replies are formatted using the RESP protocol. The current RESP iteration provides fewer data types than RedisGraph supports internally, so displayed results are mapped as follows:

RedisGraph type	Display format
Integer	Integer
NULL	NULL (nil)
String	String
Boolean	String ("true"/"false")
Double	String (15-digit precision)

Graph Entities

When full entities are specified in a RETURN clause, all data relevant to each entity value is emitted within a nested array. Key-value pairs in the data, such as the combination of a property name and its corresponding value, are represented as 2-arrays.

Internal IDs are returned with nodes and relations, but these IDs are not immutable. After entities have been deleted, higher IDs may be migrated to the vacated lower IDs.

Nodes

The node representation contains 3 top-level elements:

The node's internal ID.
Any labels associated with the node.
The key-value pairs of all properties the node possesses.

[	
    [“id”, ID (integer)]
    [“labels”,
        [label (string) X label count]
    ]
    [properties”, [
        [prop_key (string), prop_val (scalar)] X property count]
    ]
]

Relations

The relation representation contains 5 top-level elements:

The relation's internal ID.
The type associated with the relation.
The source node's internal ID.
The destination node's internal ID.
The key-value pairs of all properties the relation possesses.

[	
    [“id”, ID (integer)]
    [“type”, type (string)]
    [“src_node”, source node ID (integer)]
    [“dest_node”, destination node ID (integer)]
    [properties”, [
        [prop_key (string), prop_val (scalar)] X property count]
    ]
]

Collections

Arrays

When array values are specified in a RETURN clause, the representation in the response is the array string representation. This is done solely for better readability of the response. The string representation of an array which contains graph entities, will print only their ID. A node string representation is round braces around its ID, and edge string representation is brackets around the edge ID.

Paths

Returned path value is the string representation of an array with the path's nodes and edges, interleaved.

Example

Given the graph created by the command:

CREATE (:person {name:'Pam', age:27})-[:works {since: 2010}]->(:employer {name:'Dunder Mifflin'})""

We can run a query that returns a node, a relation, and a scalar value.

"MATCH (n1)-[r]->(n2) RETURN n1, r, n2.name"
1) 1) "n1"
   2) "r"
   3) "n2.name"
2) 1) 1) 1) 1) "id"
            2) (integer) 0
         2) 1) "labels"
            2) 1) "person"
         3) 1) "properties"
            2) 1) 1) "age"
                  2) (integer) 27
               2) 1) "name"
                  2) "Pam"
      2) 1) 1) "id"
            2) (integer) 0
         2) 1) "type"
            2) "works"
         3) 1) "src_node"
            2) (integer) 0
         4) 1) "dest_node"
            2) (integer) 1
         5) 1) "properties"
            2) 1) 1) "since"
                  2) (integer) 2010
      3) "Dunder Mifflin"
3) 1) "Query internal execution time: 1.858986 milliseconds"

3 - Implementation details for the GRAPH.BULK endpoint

The RedisGraph bulk loader uses the GRAPH.BULK endpoint to build a new graph from 1 or more Redis queries. The bulk of these queries is binary data that is unpacked to create nodes, edges, and their properties. This endpoint could be used to write bespoke import tools for other data formats using the implementation details provided here.

Caveats

The main complicating factor in writing bulk importers is that Redis has a maximum string length of 512 megabytes and a default maximum query size of 1 gigabyte. As such, large imports must be written incrementally.

The RedisGraph team will do their best to ensure that future updates to this logic do not break current implementations, but cannot guarantee it.

Query Format

GRAPH.BULK [graph name] ["BEGIN"] [node count] [edge count] ([binary blob] * N)

Arguments

graph name

The name of the graph to be inserted.

BEGIN

The endpoint cannot be used to update existing graphs, only to create new ones. For this reason, the first query in a sequence of BULK commands should pass the string literal "BEGIN".

node count

Number of nodes being inserted in this query.

edge count

Number of edges being inserted in this query.

binary blob

A binary string of up to 512 megabytes that partially or completely describes a single label or relationship type.

Any number of these blobs may be provided in a query provided that Redis's 1-gigabyte query limit is not exceeded.

Module behavior

The endpoint will parse binary blobs as nodes until the number of created nodes matches the node count, then will parse subsequent blobs as edges. The import tool is expected to correctly provide these counts.

If the BEGIN token is found, the module will verify that the graph key is unused, and will emit an error if it is. Otherwise, the partially-constructed graph will be retrieved in order to resume building.

Binary Blob format

Node format

Nodes in node blobs do not need to specify their IDs. The ID of each node is an 8-byte unsigned integer corresponding to the node count at the time of its creation. (The first-created node has the ID of 0, the second has 1, and so forth.)

The blob consists of:

header specification
1 or more property specifications

Edge format

The import tool is responsible for tracking the IDs of nodes used as edge endpoints.

The blob consists of:

header specification
1 or more:
1. 8-byte unsigned integer representing source node ID
2. 8-byte unsigned integer representing destination node ID
3. property specification

Header specification

name - A null-terminated string representing the name of the label or relationship type.
property count - A 4-byte unsigned integer representing the number of properties each entry in this blob possesses.
property names - an ordered sequence of property count null-terminated strings, each representing the name for the property at that position.

Property specification

property type - A 1-byte integer corresponding to the TYPE enum:

BI_NULL = 0,
BI_BOOL = 1,
BI_DOUBLE = 2,
BI_STRING = 3,
BI_LONG = 4,
BI_ARRAY = 5,

property:
- 1-byte true/false if type is boolean
- 8-byte double if type is double
- 8-byte integer if type is integer
- Null-terminated C string if type is string
- 8-byte array length followed by N values of this same type-property pair if type is array

Redis Reply

Redis will reply with a string of the format:

[N] nodes created, [M] edges created