This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Redis programming patterns

Novel patterns for working with Redis data structures

The following documents describe some novel development patterns you can use with Redis.

1 - Bulk loading

Writing data in bulk using the Redis protocol

Bulk loading is the process of loading Redis with a large amount of pre-existing data. Ideally, you want to perform this operation quickly and efficiently. This document describes some strategies for bulk loading data in Redis.

Bulk loading using the Redis protocol

Using a normal Redis client to perform bulk loading is not a good idea for a few reasons: the naive approach of sending one command after the other is slow because you have to pay for the round trip time for every command. It is possible to use pipelining, but for bulk loading of many records you need to write new commands while you read replies at the same time to make sure you are inserting as fast as possible.

Only a small percentage of clients support non-blocking I/O, and not all the clients are able to parse the replies in an efficient way in order to maximize throughput. For all of these reasons the preferred way to mass import data into Redis is to generate a text file containing the Redis protocol, in raw format, in order to call the commands needed to insert the required data.

For instance if I need to generate a large data set where there are billions of keys in the form: `keyN -> ValueN' I will create a file containing the following commands in the Redis protocol format:

SET Key0 Value0
SET Key1 Value1
...
SET KeyN ValueN

Once this file is created, the remaining action is to feed it to Redis as fast as possible. In the past the way to do this was to use the netcat with the following command:

(cat data.txt; sleep 10) | nc localhost 6379 > /dev/null

However this is not a very reliable way to perform mass import because netcat does not really know when all the data was transferred and can't check for errors. In 2.6 or later versions of Redis the redis-cli utility supports a new mode called pipe mode that was designed in order to perform bulk loading.

Using the pipe mode the command to run looks like the following:

cat data.txt | redis-cli --pipe

That will produce an output similar to this:

All data transferred. Waiting for the last reply...
Last reply received from server.
errors: 0, replies: 1000000

The redis-cli utility will also make sure to only redirect errors received from the Redis instance to the standard output.

Generating Redis Protocol

The Redis protocol is extremely simple to generate and parse, and is Documented here. However in order to generate protocol for the goal of bulk loading you don't need to understand every detail of the protocol, but just that every command is represented in the following way:

*<args><cr><lf>
$<len><cr><lf>
<arg0><cr><lf>
<arg1><cr><lf>
...
<argN><cr><lf>

Where <cr> means "\r" (or ASCII character 13) and <lf> means "\n" (or ASCII character 10).

For instance the command SET key value is represented by the following protocol:

*3<cr><lf>
$3<cr><lf>
SET<cr><lf>
$3<cr><lf>
key<cr><lf>
$5<cr><lf>
value<cr><lf>

Or represented as a quoted string:

"*3\r\n$3\r\nSET\r\n$3\r\nkey\r\n$5\r\nvalue\r\n"

The file you need to generate for bulk loading is just composed of commands represented in the above way, one after the other.

The following Ruby function generates valid protocol:

def gen_redis_proto(*cmd)
    proto = ""
    proto << "*"+cmd.length.to_s+"\r\n"
    cmd.each{|arg|
        proto << "$"+arg.to_s.bytesize.to_s+"\r\n"
        proto << arg.to_s+"\r\n"
    }
    proto
end

puts gen_redis_proto("SET","mykey","Hello World!").inspect

Using the above function it is possible to easily generate the key value pairs in the above example, with this program:

(0...1000).each{|n|
    STDOUT.write(gen_redis_proto("SET","Key#{n}","Value#{n}"))
}

We can run the program directly in pipe to redis-cli in order to perform our first mass import session.

$ ruby proto.rb | redis-cli --pipe
All data transferred. Waiting for the last reply...
Last reply received from server.
errors: 0, replies: 1000

How the pipe mode works under the hood

The magic needed inside the pipe mode of redis-cli is to be as fast as netcat and still be able to understand when the last reply was sent by the server at the same time.

This is obtained in the following way:

  • redis-cli --pipe tries to send data as fast as possible to the server.
  • At the same time it reads data when available, trying to parse it.
  • Once there is no more data to read from stdin, it sends a special ECHO command with a random 20 byte string: we are sure this is the latest command sent, and we are sure we can match the reply checking if we receive the same 20 bytes as a bulk reply.
  • Once this special final command is sent, the code receiving replies starts to match replies with these 20 bytes. When the matching reply is reached it can exit with success.

Using this trick we don't need to parse the protocol we send to the server in order to understand how many commands we are sending, but just the replies.

However while parsing the replies we take a counter of all the replies parsed so that at the end we are able to tell the user the amount of commands transferred to the server by the mass insert session.

2 - Distributed Locks with Redis

A Distributed Lock Pattern with Redis

Distributed locks are a very useful primitive in many environments where different processes must operate with shared resources in a mutually exclusive way.

There are a number of libraries and blog posts describing how to implement a DLM (Distributed Lock Manager) with Redis, but every library uses a different approach, and many use a simple approach with lower guarantees compared to what can be achieved with slightly more complex designs.

This page describes a more canonical algorithm to implement distributed locks with Redis. We propose an algorithm, called Redlock, which implements a DLM which we believe to be safer than the vanilla single instance approach. We hope that the community will analyze it, provide feedback, and use it as a starting point for the implementations or more complex or alternative designs.

Implementations

Before describing the algorithm, here are a few links to implementations already available that can be used for reference.

Safety and Liveness Guarantees

We are going to model our design with just three properties that, from our point of view, are the minimum guarantees needed to use distributed locks in an effective way.

  1. Safety property: Mutual exclusion. At any given moment, only one client can hold a lock.
  2. Liveness property A: Deadlock free. Eventually it is always possible to acquire a lock, even if the client that locked a resource crashes or gets partitioned.
  3. Liveness property B: Fault tolerance. As long as the majority of Redis nodes are up, clients are able to acquire and release locks.

Why Failover-based Implementations Are Not Enough

To understand what we want to improve, let’s analyze the current state of affairs with most Redis-based distributed lock libraries.

The simplest way to use Redis to lock a resource is to create a key in an instance. The key is usually created with a limited time to live, using the Redis expires feature, so that eventually it will get released (property 2 in our list). When the client needs to release the resource, it deletes the key.

Superficially this works well, but there is a problem: this is a single point of failure in our architecture. What happens if the Redis master goes down? Well, let’s add a replica! And use it if the master is unavailable. This is unfortunately not viable. By doing so we can’t implement our safety property of mutual exclusion, because Redis replication is asynchronous.

There is a race condition with this model:

  1. Client A acquires the lock in the master.
  2. The master crashes before the write to the key is transmitted to the replica.
  3. The replica gets promoted to master.
  4. Client B acquires the lock to the same resource A already holds a lock for. SAFETY VIOLATION!

Sometimes it is perfectly fine that, under special circumstances, for example during a failure, multiple clients can hold the lock at the same time. If this is the case, you can use your replication based solution. Otherwise we suggest to implement the solution described in this document.

Correct Implementation with a Single Instance

Before trying to overcome the limitation of the single instance setup described above, let’s check how to do it correctly in this simple case, since this is actually a viable solution in applications where a race condition from time to time is acceptable, and because locking into a single instance is the foundation we’ll use for the distributed algorithm described here.

To acquire the lock, the way to go is the following:

    SET resource_name my_random_value NX PX 30000

The command will set the key only if it does not already exist (NX option), with an expire of 30000 milliseconds (PX option). The key is set to a value “my_random_value”. This value must be unique across all clients and all lock requests.

Basically the random value is used in order to release the lock in a safe way, with a script that tells Redis: remove the key only if it exists and the value stored at the key is exactly the one I expect to be. This is accomplished by the following Lua script:

if redis.call("get",KEYS[1]) == ARGV[1] then
    return redis.call("del",KEYS[1])
else
    return 0
end

This is important in order to avoid removing a lock that was created by another client. For example a client may acquire the lock, get blocked performing some operation for longer than the lock validity time (the time at which the key will expire), and later remove the lock, that was already acquired by some other client. Using just DEL is not safe as a client may remove another client's lock. With the above script instead every lock is “signed” with a random string, so the lock will be removed only if it is still the one that was set by the client trying to remove it.

What should this random string be? We assume it’s 20 bytes from /dev/urandom, but you can find cheaper ways to make it unique enough for your tasks. For example a safe pick is to seed RC4 with /dev/urandom, and generate a pseudo random stream from that. A simpler solution is to use a UNIX timestamp with microsecond precision, concatenating the timestamp with a client ID. It is not as safe, but probably sufficient for most environments.

The "lock validity time" is the time we use as the key's time to live. It is both the auto release time, and the time the client has in order to perform the operation required before another client may be able to acquire the lock again, without technically violating the mutual exclusion guarantee, which is only limited to a given window of time from the moment the lock is acquired.

So now we have a good way to acquire and release the lock. With this system, reasoning about a non-distributed system composed of a single, always available, instance, is safe. Let’s extend the concept to a distributed system where we don’t have such guarantees.

The Redlock Algorithm

In the distributed version of the algorithm we assume we have N Redis masters. Those nodes are totally independent, so we don’t use replication or any other implicit coordination system. We already described how to acquire and release the lock safely in a single instance. We take for granted that the algorithm will use this method to acquire and release the lock in a single instance. In our examples we set N=5, which is a reasonable value, so we need to run 5 Redis masters on different computers or virtual machines in order to ensure that they’ll fail in a mostly independent way.

In order to acquire the lock, the client performs the following operations:

  1. It gets the current time in milliseconds.
  2. It tries to acquire the lock in all the N instances sequentially, using the same key name and random value in all the instances. During step 2, when setting the lock in each instance, the client uses a timeout which is small compared to the total lock auto-release time in order to acquire it. For example if the auto-release time is 10 seconds, the timeout could be in the ~ 5-50 milliseconds range. This prevents the client from remaining blocked for a long time trying to talk with a Redis node which is down: if an instance is not available, we should try to talk with the next instance ASAP.
  3. The client computes how much time elapsed in order to acquire the lock, by subtracting from the current time the timestamp obtained in step 1. If and only if the client was able to acquire the lock in the majority of the instances (at least 3), and the total time elapsed to acquire the lock is less than lock validity time, the lock is considered to be acquired.
  4. If the lock was acquired, its validity time is considered to be the initial validity time minus the time elapsed, as computed in step 3.
  5. If the client failed to acquire the lock for some reason (either it was not able to lock N/2+1 instances or the validity time is negative), it will try to unlock all the instances (even the instances it believed it was not able to lock).

Is the Algorithm Asynchronous?

The algorithm relies on the assumption that while there is no synchronized clock across the processes, the local time in every process updates at approximately at the same rate, with a small margin of error compared to the auto-release time of the lock. This assumption closely resembles a real-world computer: every computer has a local clock and we can usually rely on different computers to have a clock drift which is small.

At this point we need to better specify our mutual exclusion rule: it is guaranteed only as long as the client holding the lock terminates its work within the lock validity time (as obtained in step 3), minus some time (just a few milliseconds in order to compensate for clock drift between processes).

This paper contains more information about similar systems requiring a bound clock drift: Leases: an efficient fault-tolerant mechanism for distributed file cache consistency.

Retry on Failure

When a client is unable to acquire the lock, it should try again after a random delay in order to try to desynchronize multiple clients trying to acquire the lock for the same resource at the same time (this may result in a split brain condition where nobody wins). Also the faster a client tries to acquire the lock in the majority of Redis instances, the smaller the window for a split brain condition (and the need for a retry), so ideally the client should try to send the SET commands to the N instances at the same time using multiplexing.

It is worth stressing how important it is for clients that fail to acquire the majority of locks, to release the (partially) acquired locks ASAP, so that there is no need to wait for key expiry in order for the lock to be acquired again (however if a network partition happens and the client is no longer able to communicate with the Redis instances, there is an availability penalty to pay as it waits for key expiration).

Releasing the Lock

Releasing the lock is simple, and can be performed whether or not the client believes it was able to successfully lock a given instance.

Safety Arguments

Is the algorithm safe? Let's examine what happens in different scenarios.

To start let’s assume that a client is able to acquire the lock in the majority of instances. All the instances will contain a key with the same time to live. However, the key was set at different times, so the keys will also expire at different times. But if the first key was set at worst at time T1 (the time we sample before contacting the first server) and the last key was set at worst at time T2 (the time we obtained the reply from the last server), we are sure that the first key to expire in the set will exist for at least MIN_VALIDITY=TTL-(T2-T1)-CLOCK_DRIFT. All the other keys will expire later, so we are sure that the keys will be simultaneously set for at least this time.

During the time that the majority of keys are set, another client will not be able to acquire the lock, since N/2+1 SET NX operations can’t succeed if N/2+1 keys already exist. So if a lock was acquired, it is not possible to re-acquire it at the same time (violating the mutual exclusion property).

However we want to also make sure that multiple clients trying to acquire the lock at the same time can’t simultaneously succeed.

If a client locked the majority of instances using a time near, or greater, than the lock maximum validity time (the TTL we use for SET basically), it will consider the lock invalid and will unlock the instances, so we only need to consider the case where a client was able to lock the majority of instances in a time which is less than the validity time. In this case for the argument already expressed above, for MIN_VALIDITY no client should be able to re-acquire the lock. So multiple clients will be able to lock N/2+1 instances at the same time (with "time" being the end of Step 2) only when the time to lock the majority was greater than the TTL time, making the lock invalid.

Liveness Arguments

The system liveness is based on three main features:

  1. The auto release of the lock (since keys expire): eventually keys are available again to be locked.
  2. The fact that clients, usually, will cooperate removing the locks when the lock was not acquired, or when the lock was acquired and the work terminated, making it likely that we don’t have to wait for keys to expire to re-acquire the lock.
  3. The fact that when a client needs to retry a lock, it waits a time which is comparably greater than the time needed to acquire the majority of locks, in order to probabilistically make split brain conditions during resource contention unlikely.

However, we pay an availability penalty equal to TTL time on network partitions, so if there are continuous partitions, we can pay this penalty indefinitely. This happens every time a client acquires a lock and gets partitioned away before being able to remove the lock.

Basically if there are infinite continuous network partitions, the system may become not available for an infinite amount of time.

Performance, Crash Recovery and fsync

Many users using Redis as a lock server need high performance in terms of both latency to acquire and release a lock, and number of acquire / release operations that it is possible to perform per second. In order to meet this requirement, the strategy to talk with the N Redis servers to reduce latency is definitely multiplexing (putting the socket in non-blocking mode, send all the commands, and read all the commands later, assuming that the RTT between the client and each instance is similar).

However there is another consideration around persistence if we want to target a crash-recovery system model.

Basically to see the problem here, let’s assume we configure Redis without persistence at all. A client acquires the lock in 3 of 5 instances. One of the instances where the client was able to acquire the lock is restarted, at this point there are again 3 instances that we can lock for the same resource, and another client can lock it again, violating the safety property of exclusivity of lock.

If we enable AOF persistence, things will improve quite a bit. For example we can upgrade a server by sending it a SHUTDOWN command and restarting it. Because Redis expires are semantically implemented so that time still elapses when the server is off, all our requirements are fine. However everything is fine as long as it is a clean shutdown. What about a power outage? If Redis is configured, as by default, to fsync on disk every second, it is possible that after a restart our key is missing. In theory, if we want to guarantee the lock safety in the face of any kind of instance restart, we need to enable fsync=always in the persistence settings. This will affect performance due to the additional sync overhead.

However things are better than they look like at a first glance. Basically, the algorithm safety is retained as long as when an instance restarts after a crash, it no longer participates to any currently active lock. This means that the set of currently active locks when the instance restarts were all obtained by locking instances other than the one which is rejoining the system.

To guarantee this we just need to make an instance, after a crash, unavailable for at least a bit more than the max TTL we use. This is the time needed for all the keys about the locks that existed when the instance crashed to become invalid and be automatically released.

Using delayed restarts it is basically possible to achieve safety even without any kind of Redis persistence available, however note that this may translate into an availability penalty. For example if a majority of instances crash, the system will become globally unavailable for TTL (here globally means that no resource at all will be lockable during this time).

Making the algorithm more reliable: Extending the lock

If the work performed by clients consists of small steps, it is possible to use smaller lock validity times by default, and extend the algorithm implementing a lock extension mechanism. Basically the client, if in the middle of the computation while the lock validity is approaching a low value, may extend the lock by sending a Lua script to all the instances that extends the TTL of the key if the key exists and its value is still the random value the client assigned when the lock was acquired.

The client should only consider the lock re-acquired if it was able to extend the lock into the majority of instances, and within the validity time (basically the algorithm to use is very similar to the one used when acquiring the lock).

However this does not technically change the algorithm, so the maximum number of lock reacquisition attempts should be limited, otherwise one of the liveness properties is violated.

Want to help?

If you are into distributed systems, it would be great to have your opinion / analysis. Also reference implementations in other languages could be great.

Thanks in advance!

Analysis of Redlock


  1. Martin Kleppmann analyzed Redlock here. A counterpoint to this analysis can be found here.

3 - Secondary indexing

Building secondary indexes in Redis

Redis is not exactly a key-value store, since values can be complex data structures. However it has an external key-value shell: at API level data is addressed by the key name. It is fair to say that, natively, Redis only offers primary key access. However since Redis is a data structures server, its capabilities can be used for indexing, in order to create secondary indexes of different kinds, including composite (multi-column) indexes.

This document explains how it is possible to create indexes in Redis using the following data structures:

  • Sorted sets to create secondary indexes by ID or other numerical fields.
  • Sorted sets with lexicographical ranges for creating more advanced secondary indexes, composite indexes and graph traversal indexes.
  • Sets for creating random indexes.
  • Lists for creating simple iterable indexes and last N items indexes.

Implementing and maintaining indexes with Redis is an advanced topic, so most users that need to perform complex queries on data should understand if they are better served by a relational store. However often, especially in caching scenarios, there is the explicit need to store indexed data into Redis in order to speedup common queries which require some form of indexing in order to be executed.

Simple numerical indexes with sorted sets

The simplest secondary index you can create with Redis is by using the sorted set data type, which is a data structure representing a set of elements ordered by a floating point number which is the score of each element. Elements are ordered from the smallest to the highest score.

Since the score is a double precision float, indexes you can build with vanilla sorted sets are limited to things where the indexing field is a number within a given range.

The two commands to build these kind of indexes are ZADD and ZRANGEBYSCORE to respectively add items and retrieve items within a specified range.

For instance, it is possible to index a set of person names by their age by adding element to a sorted set. The element will be the name of the person and the score will be the age.

ZADD myindex 25 Manuel
ZADD myindex 18 Anna
ZADD myindex 35 Jon
ZADD myindex 67 Helen

In order to retrieve all persons with an age between 20 and 40, the following command can be used:

ZRANGEBYSCORE myindex 20 40
1) "Manuel"
2) "Jon"

By using the WITHSCORES option of ZRANGEBYSCORE it is also possible to obtain the scores associated with the returned elements.

The ZCOUNT command can be used in order to retrieve the number of elements within a given range, without actually fetching the elements, which is also useful, especially given the fact the operation is executed in logarithmic time regardless of the size of the range.

Ranges can be inclusive or exclusive, please refer to the ZRANGEBYSCORE command documentation for more information.

Note: Using the ZREVRANGEBYSCORE it is possible to query a range in reversed order, which is often useful when data is indexed in a given direction (ascending or descending) but we want to retrieve information the other way around.

Using objects IDs as associated values

In the above example we associated names to ages. However in general we may want to index some field of an object which is stored elsewhere. Instead of using the sorted set value directly to store the data associated with the indexed field, it is possible to store just the ID of the object.

For example I may have Redis hashes representing users. Each user is represented by a single key, directly accessible by ID:

HMSET user:1 id 1 username antirez ctime 1444809424 age 38
HMSET user:2 id 2 username maria ctime 1444808132 age 42
HMSET user:3 id 3 username jballard ctime 1443246218 age 33

If I want to create an index in order to query users by their age, I could do:

ZADD user.age.index 38 1
ZADD user.age.index 42 2
ZADD user.age.index 33 3

This time the value associated with the score in the sorted set is the ID of the object. So once I query the index with ZRANGEBYSCORE I'll also have to retrieve the information I need with HGETALL or similar commands. The obvious advantage is that objects can change without touching the index, as long as we don't change the indexed field.

In the next examples we'll almost always use IDs as values associated with the index, since this is usually the more sounding design, with a few exceptions.

Updating simple sorted set indexes

Often we index things which change over time. In the above example, the age of the user changes every year. In such a case it would make sense to use the birth date as index instead of the age itself, but there are other cases where we simply want some field to change from time to time, and the index to reflect this change.

The ZADD command makes updating simple indexes a very trivial operation since re-adding back an element with a different score and the same value will simply update the score and move the element at the right position, so if the user antirez turned 39 years old, in order to update the data in the hash representing the user, and in the index as well, we need to execute the following two commands:

HSET user:1 age 39
ZADD user.age.index 39 1

The operation may be wrapped in a MULTI/EXEC transaction in order to make sure both fields are updated or none.

Turning multi dimensional data into linear data

Indexes created with sorted sets are able to index only a single numerical value. Because of this you may think it is impossible to index something which has multiple dimensions using this kind of indexes, but actually this is not always true. If you can efficiently represent something multi-dimensional in a linear way, they it is often possible to use a simple sorted set for indexing.

For example the Redis geo indexing API uses a sorted set to index places by latitude and longitude using a technique called Geo hash. The sorted set score represents alternating bits of longitude and latitude, so that we map the linear score of a sorted set to many small squares in the earth surface. By doing an 8+1 style center plus neighborhoods search it is possible to retrieve elements by radius.

Limits of the score

Sorted set elements scores are double precision floats. It means that they can represent different decimal or integer values with different errors, because they use an exponential representation internally. However what is interesting for indexing purposes is that the score is always able to represent without any error numbers between -9007199254740992 and 9007199254740992, which is -/+ 2^53.

When representing much larger numbers, you need a different form of indexing that is able to index numbers at any precision, called a lexicographical index.

Lexicographical indexes

Redis sorted sets have an interesting property. When elements are added with the same score, they are sorted lexicographically, comparing the strings as binary data with the memcmp() function.

For people that don't know the C language nor the memcmp function, what it means is that elements with the same score are sorted comparing the raw values of their bytes, byte after byte. If the first byte is the same, the second is checked and so forth. If the common prefix of two strings is the same then the longer string is considered the greater of the two, so "foobar" is greater than "foo".

There are commands such as ZRANGEBYLEX and ZLEXCOUNT that are able to query and count ranges in a lexicographically fashion, assuming they are used with sorted sets where all the elements have the same score.

This Redis feature is basically equivalent to a b-tree data structure which is often used in order to implement indexes with traditional databases. As you can guess, because of this, it is possible to use this Redis data structure in order to implement pretty fancy indexes.

Before we dive into using lexicographical indexes, let's check how sorted sets behave in this special mode of operation. Since we need to add elements with the same score, we'll always use the special score of zero.

ZADD myindex 0 baaa
ZADD myindex 0 abbb
ZADD myindex 0 aaaa
ZADD myindex 0 bbbb

Fetching all the elements from the sorted set immediately reveals that they are ordered lexicographically.

ZRANGE myindex 0 -1
1) "aaaa"
2) "abbb"
3) "baaa"
4) "bbbb"

Now we can use ZRANGEBYLEX in order to perform range queries.

ZRANGEBYLEX myindex [a (b
1) "aaaa"
2) "abbb"

Note that in the range queries we prefixed the min and max elements identifying the range with the special characters [ and (. This prefixes are mandatory, and they specify if the elements of the range are inclusive or exclusive. So the range [a (b means give me all the elements lexicographically between a inclusive and b exclusive, which are all the elements starting with a.

There are also two more special characters indicating the infinitely negative string and the infinitely positive string, which are - and +.

ZRANGEBYLEX myindex [b +
1) "baaa"
2) "bbbb"

That's it basically. Let's see how to use these features to build indexes.

A first example: completion

An interesting application of indexing is completion. Completion is what happens when you start typing your query into a search engine: the user interface will anticipate what you are likely typing, providing common queries that start with the same characters.

A naive approach to completion is to just add every single query we get from the user into the index. For example if the user searches banana we'll just do:

ZADD myindex 0 banana

And so forth for each search query ever encountered. Then when we want to complete the user input, we execute a range query using ZRANGEBYLEX. Imagine the user is typing "bit" inside the search form, and we want to offer possible search keywords starting for "bit". We send Redis a command like that:

ZRANGEBYLEX myindex "[bit" "[bit\xff"

Basically we create a range using the string the user is typing right now as start, and the same string plus a trailing byte set to 255, which is \xff in the example, as the end of the range. This way we get all the strings that start for the string the user is typing.

Note that we don't want too many items returned, so we may use the LIMIT option in order to reduce the number of results.

Adding frequency into the mix

The above approach is a bit naive, because all the user searches are the same in this way. In a real system we want to complete strings according to their frequency: very popular searches will be proposed with an higher probability compared to search strings typed very rarely.

In order to implement something which depends on the frequency, and at the same time automatically adapts to future inputs, by purging searches that are no longer popular, we can use a very simple streaming algorithm.

To start, we modify our index in order to store not just the search term, but also the frequency the term is associated with. So instead of just adding banana we add banana:1, where 1 is the frequency.

ZADD myindex 0 banana:1

We also need logic in order to increment the index if the search term already exists in the index, so what we'll actually do is something like that:

ZRANGEBYLEX myindex "[banana:" + LIMIT 0 1
1) "banana:1"

This will return the single entry of banana if it exists. Then we can increment the associated frequency and send the following two commands:

ZREM myindex 0 banana:1
ZADD myindex 0 banana:2

Note that because it is possible that there are concurrent updates, the above three commands should be send via a Lua script instead, so that the Lua script will atomically get the old count and re-add the item with incremented score.

So the result will be that, every time a user searches for banana we'll get our entry updated.

There is more: our goal is to just have items searched very frequently. So we need some form of purging. When we actually query the index in order to complete the user input, we may see something like that:

ZRANGEBYLEX myindex "[banana:" + LIMIT 0 10
1) "banana:123"
2) "banaooo:1"
3) "banned user:49"
4) "banning:89"

Apparently nobody searches for "banaooo", for example, but the query was performed a single time, so we end presenting it to the user.

This is what we can do. Out of the returned items, we pick a random one, decrement its score by one, and re-add it with the new score. However if the score reaches 0, we simply remove the item from the list. You can use much more advanced systems, but the idea is that the index in the long run will contain top searches, and if top searches will change over the time it will adapt automatically.

A refinement to this algorithm is to pick entries in the list according to their weight: the higher the score, the less likely entries are picked in order to decrement its score, or evict them.

Normalizing strings for case and accents

In the completion examples we always used lowercase strings. However reality is much more complex than that: languages have capitalized names, accents, and so forth.

One simple way do deal with this issues is to actually normalize the string the user searches. Whatever the user searches for "Banana", "BANANA" or "Ba'nana" we may always turn it into "banana".

However sometimes we may like to present the user with the original item typed, even if we normalize the string for indexing. In order to do this, what we do is to change the format of the index so that instead of just storing term:frequency we store normalized:frequency:original like in the following example:

ZADD myindex 0 banana:273:Banana

Basically we add another field that we'll extract and use only for visualization. Ranges will always be computed using the normalized strings instead. This is a common trick which has multiple applications.

Adding auxiliary information in the index

When using a sorted set in a direct way, we have two different attributes for each object: the score, which we use as an index, and an associated value. When using lexicographical indexes instead, the score is always set to 0 and basically not used at all. We are left with a single string, which is the element itself.

Like we did in the previous completion examples, we are still able to store associated data using separators. For example we used the colon in order to add the frequency and the original word for completion.

In general we can add any kind of associated value to our indexing key. In order to use a lexicographical index to implement a simple key-value store we just store the entry as key:value:

ZADD myindex 0 mykey:myvalue

And search for the key with:

ZRANGEBYLEX myindex [mykey: + LIMIT 0 1
1) "mykey:myvalue"

Then we extract the part after the colon to retrieve the value. However a problem to solve in this case is collisions. The colon character may be part of the key itself, so it must be chosen in order to never collide with the key we add.

Since lexicographical ranges in Redis are binary safe you can use any byte or any sequence of bytes. However if you receive untrusted user input, it is better to use some form of escaping in order to guarantee that the separator will never happen to be part of the key.

For example if you use two null bytes as separator "\0\0", you may want to always escape null bytes into two bytes sequences in your strings.

Numerical padding

Lexicographical indexes may look like good only when the problem at hand is to index strings. Actually it is very simple to use this kind of index in order to perform indexing of arbitrary precision numbers.

In the ASCII character set, digits appear in the order from 0 to 9, so if we left-pad numbers with leading zeroes, the result is that comparing them as strings will order them by their numerical value.

ZADD myindex 0 00324823481:foo
ZADD myindex 0 12838349234:bar
ZADD myindex 0 00000000111:zap

ZRANGE myindex 0 -1
1) "00000000111:zap"
2) "00324823481:foo"
3) "12838349234:bar"

We effectively created an index using a numerical field which can be as big as we want. This also works with floating point numbers of any precision by making sure we left pad the numerical part with leading zeroes and the decimal part with trailing zeroes like in the following list of numbers:

    01000000000000.11000000000000
    01000000000000.02200000000000
    00000002121241.34893482930000
    00999999999999.00000000000000

Using numbers in binary form

Storing numbers in decimal may use too much memory. An alternative approach is just to store numbers, for example 128 bit integers, directly in their binary form. However for this to work, you need to store the numbers in big endian format, so that the most significant bytes are stored before the least significant bytes. This way when Redis compares the strings with memcmp(), it will effectively sort the numbers by their value.

Keep in mind that data stored in binary format is less observable for debugging, harder to parse and export. So it is definitely a trade off.

Composite indexes

So far we explored ways to index single fields. However we all know that SQL stores are able to create indexes using multiple fields. For example I may index products in a very large store by room number and price.

I need to run queries in order to retrieve all the products in a given room having a given price range. What I can do is to index each product in the following way:

ZADD myindex 0 0056:0028.44:90
ZADD myindex 0 0034:0011.00:832

Here the fields are room:price:product_id. I used just four digits padding in the example for simplicity. The auxiliary data (the product ID) does not need any padding.

With an index like that, to get all the products in room 56 having a price between 10 and 30 dollars is very easy. We can just run the following command:

ZRANGEBYLEX myindex [0056:0010.00 [0056:0030.00

The above is called a composed index. Its effectiveness depends on the order of the fields and the queries I want to run. For example the above index cannot be used efficiently in order to get all the products having a specific price range regardless of the room number. However I can use the primary key in order to run queries regardless of the price, like give me all the products in room 44.

Composite indexes are very powerful, and are used in traditional stores in order to optimize complex queries. In Redis they could be useful both to implement a very fast in-memory Redis index of something stored into a traditional data store, or in order to directly index Redis data.

Updating lexicographical indexes

The value of the index in a lexicographical index can get pretty fancy and hard or slow to rebuild from what we store about the object. So one approach to simplify the handling of the index, at the cost of using more memory, is to also take alongside to the sorted set representing the index a hash mapping the object ID to the current index value.

So for example, when we index we also add to a hash:

MULTI
ZADD myindex 0 0056:0028.44:90
HSET index.content 90 0056:0028.44:90
EXEC

This is not always needed, but simplifies the operations of updating the index. In order to remove the old information we indexed for the object ID 90, regardless of the current fields values of the object, we just have to retrieve the hash value by object ID and ZREM it in the sorted set view.

Representing and querying graphs using an hexastore

One cool thing about composite indexes is that they are handy in order to represent graphs, using a data structure which is called Hexastore.

The hexastore provides a representation for relations between objects, formed by a subject, a predicate and an object. A simple relation between objects could be:

antirez is-friend-of matteocollina

In order to represent this relation I can store the following element in my lexicographical index:

ZADD myindex 0 spo:antirez:is-friend-of:matteocollina

Note that I prefixed my item with the string spo. It means that the item represents a subject,predicate,object relation.

In can add 5 more entries for the same relation, but in a different order:

ZADD myindex 0 sop:antirez:matteocollina:is-friend-of
ZADD myindex 0 ops:matteocollina:is-friend-of:antirez
ZADD myindex 0 osp:matteocollina:antirez:is-friend-of
ZADD myindex 0 pso:is-friend-of:antirez:matteocollina
ZADD myindex 0 pos:is-friend-of:matteocollina:antirez

Now things start to be interesting, and I can query the graph in many different ways. For example, who are all the people antirez is friend of?

ZRANGEBYLEX myindex "[spo:antirez:is-friend-of:" "[spo:antirez:is-friend-of:\xff"
1) "spo:antirez:is-friend-of:matteocollina"
2) "spo:antirez:is-friend-of:wonderwoman"
3) "spo:antirez:is-friend-of:spiderman"

Or, what are all the relationships antirez and matteocollina have where the first is the subject and the second is the object?

ZRANGEBYLEX myindex "[sop:antirez:matteocollina:" "[sop:antirez:matteocollina:\xff"
1) "sop:antirez:matteocollina:is-friend-of"
2) "sop:antirez:matteocollina:was-at-conference-with"
3) "sop:antirez:matteocollina:talked-with"

By combining different queries, I can ask fancy questions. For example: Who are all my friends that, like beer, live in Barcelona, and matteocollina consider friends as well? To get this information I start with an spo query to find all the people I'm friend with. Then for each result I get I perform an spo query to check if they like beer, removing the ones for which I can't find this relation. I do it again to filter by city. Finally I perform an ops query to find, of the list I obtained, who is considered friend by matteocollina.

Make sure to check Matteo Collina's slides about Levelgraph in order to better understand these ideas.

Multi dimensional indexes

A more complex type of index is an index that allows you to perform queries where two or more variables are queried at the same time for specific ranges. For example I may have a data set representing persons age and salary, and I want to retrieve all the people between 50 and 55 years old having a salary between 70000 and 85000.

This query may be performed with a multi column index, but this requires us to select the first variable and then scan the second, which means we may do a lot more work than needed. It is possible to perform these kinds of queries involving multiple variables using different data structures. For example, multi-dimensional trees such as k-d trees or r-trees are sometimes used. Here we'll describe a different way to index data into multiple dimensions, using a representation trick that allows us to perform the query in a very efficient way using Redis lexicographical ranges.

Let's start by visualizing the problem. In this picture we have points in the space, which represent our data samples, where x and y are our coordinates. Both variables max value is 400.

The blue box in the picture represents our query. We want all the points where x is between 50 and 100, and where y is between 100 and 300.

Points in the space

In order to represent data that makes these kinds of queries fast to perform, we start by padding our numbers with 0. So for example imagine we want to add the point 10,25 (x,y) to our index. Given that the maximum range in the example is 400 we can just pad to three digits, so we obtain:

x = 010
y = 025

Now what we do is to interleave the digits, taking the leftmost digit in x, and the leftmost digit in y, and so forth, in order to create a single number:

001205

This is our index, however in order to more easily reconstruct the original representation, if we want (at the cost of space), we may also add the original values as additional columns:

001205:10:25

Now, let's reason about this representation and why it is useful in the context of range queries. For example let's take the center of our blue box, which is at x=75 and y=200. We can encode this number as we did earlier by interleaving the digits, obtaining:

027050

What happens if we substitute the last two digits respectively with 00 and 99? We obtain a range which is lexicographically continuous:

027000 to 027099

What this maps to is to a square representing all values where the x variable is between 70 and 79, and the y variable is between 200 and 209. We can write random points in this interval, in order to identify this specific area:

Small area

So the above lexicographic query allows us to easily query for points in a specific square in the picture. However the square may be too small for the box we are searching, so that too many queries are needed. So we can do the same but instead of replacing the last two digits with 00 and 99, we can do it for the last four digits, obtaining the following range:

020000 029999

This time the range represents all the points where x is between 0 and 99 and y is between 200 and 299. Drawing random points in this interval shows us this larger area:

Large area

Oops now our area is ways too big for our query, and still our search box is not completely included. We need more granularity, but we can easily obtain it by representing our numbers in binary form. This time, when we replace digits instead of getting squares which are ten times bigger, we get squares which are just two times bigger.

Our numbers in binary form, assuming we need just 9 bits for each variable (in order to represent numbers up to 400 in value) would be:

x = 75  -> 001001011
y = 200 -> 011001000

So by interleaving digits, our representation in the index would be:

000111000011001010:75:200

Let's see what are our ranges as we substitute the last 2, 4, 6, 8, ... bits with 0s ad 1s in the interleaved representation:

2 bits: x between 70 and 75, y between 200 and 201 (range=2)
4 bits: x between 72 and 75, y between 200 and 203 (range=4)
6 bits: x between 72 and 79, y between 200 and 207 (range=8)
8 bits: x between 64 and 79, y between 192 and 207 (range=16)

And so forth. Now we have definitely better granularity! As you can see substituting N bits from the index gives us search boxes of side 2^(N/2).

So what we do is check the dimension where our search box is smaller, and check the nearest power of two to this number. Our search box was 50,100 to 100,300, so it has a width of 50 and an height of 200. We take the smaller of the two, 50, and check the nearest power of two which is 64. 64 is 2^6, so we would work with indexes obtained replacing the latest 12 bits from the interleaved representation (so that we end replacing just 6 bits of each variable).

However single squares may not cover all our search, so we may need more. What we do is to start with the left bottom corner of our search box, which is 50,100, and find the first range by substituting the last 6 bits in each number with 0. Then we do the same with the right top corner.

With two trivial nested for loops where we increment only the significant bits, we can find all the squares between these two. For each square we convert the two numbers into our interleaved representation, and create the range using the converted representation as our start, and the same representation but with the latest 12 bits turned on as end range.

For each square found we perform our query and get the elements inside, removing the elements which are outside our search box.

Turning this into code is simple. Here is a Ruby example:

def spacequery(x0,y0,x1,y1,exp)
    bits=exp*2
    x_start = x0/(2**exp)
    x_end = x1/(2**exp)
    y_start = y0/(2**exp)
    y_end = y1/(2**exp)
    (x_start..x_end).each{|x|
        (y_start..y_end).each{|y|
            x_range_start = x*(2**exp)
            x_range_end = x_range_start | ((2**exp)-1)
            y_range_start = y*(2**exp)
            y_range_end = y_range_start | ((2**exp)-1)
            puts "#{x},#{y} x from #{x_range_start} to #{x_range_end}, y from #{y_range_start} to #{y_range_end}"

            # Turn it into interleaved form for ZRANGEBYLEX query.
            # We assume we need 9 bits for each integer, so the final
            # interleaved representation will be 18 bits.
            xbin = x_range_start.to_s(2).rjust(9,'0')
            ybin = y_range_start.to_s(2).rjust(9,'0')
            s = xbin.split("").zip(ybin.split("")).flatten.compact.join("")
            # Now that we have the start of the range, calculate the end
            # by replacing the specified number of bits from 0 to 1.
            e = s[0..-(bits+1)]+("1"*bits)
            puts "ZRANGEBYLEX myindex [#{s} [#{e}"
        }
    }
end

spacequery(50,100,100,300,6)

While non immediately trivial this is a very useful indexing strategy that in the future may be implemented in Redis in a native way. For now, the good thing is that the complexity may be easily encapsulated inside a library that can be used in order to perform indexing and queries. One example of such library is Redimension, a proof of concept Ruby library which indexes N-dimensional data inside Redis using the technique described here.

Multi dimensional indexes with negative or floating point numbers

The simplest way to represent negative values is just to work with unsigned integers and represent them using an offset, so that when you index, before translating numbers in the indexed representation, you add the absolute value of your smaller negative integer.

For floating point numbers, the simplest approach is probably to convert them to integers by multiplying the integer for a power of ten proportional to the number of digits after the dot you want to retain.

Non range indexes

So far we checked indexes which are useful to query by range or by single item. However other Redis data structures such as Sets or Lists can be used in order to build other kind of indexes. They are very commonly used but maybe we don't always realize they are actually a form of indexing.

For instance I can index object IDs into a Set data type in order to use the get random elements operation via SRANDMEMBER in order to retrieve a set of random objects. Sets can also be used to check for existence when all I need is to test if a given item exists or not or has a single boolean property or not.

Similarly lists can be used in order to index items into a fixed order. I can add all my items into a Redis list and rotate the list with RPOPLPUSH using the same key name as source and destination. This is useful when I want to process a given set of items again and again forever in the same order. Think of an RSS feed system that needs to refresh the local copy periodically.

Another popular index often used with Redis is a capped list, where items are added with LPUSH and trimmed with LTRIM, in order to create a view with just the latest N items encountered, in the same order they were seen.

Index inconsistency

Keeping the index updated may be challenging, in the course of months or years it is possible that inconsistencies are added because of software bugs, network partitions or other events.

Different strategies could be used. If the index data is outside Redis read repair can be a solution, where data is fixed in a lazy way when it is requested. When we index data which is stored in Redis itself the SCAN family of commands can be used in order to verify, update or rebuild the index from scratch, incrementally.

4 - Redis patterns example

Learn several Redis patterns by building a Twitter clone

This article describes the design and implementation of a very simple Twitter clone written using PHP with Redis as the only database. The programming community has traditionally considered key-value stores as a special purpose database that couldn't be used as a drop-in replacement for a relational database for the development of web applications. This article will try to show that Redis data structures on top of a key-value layer are an effective data model to implement many kinds of applications.

Note: the original version of this article was written in 2009 when Redis was released. It was not exactly clear at that time that the Redis data model was suitable to write entire applications. Now after 5 years there are many cases of applications using Redis as their main store, so the goal of the article today is to be a tutorial for Redis newcomers. You'll learn how to design a simple data layout using Redis, and how to apply different data structures.

Our Twitter clone, called Retwis, is structurally simple, has very good performance, and can be distributed among any number of web and Redis servers with little efforts. You can find the source code here.

I used PHP for the example since it can be read by everybody. The same (or better) results can be obtained using Ruby, Python, Erlang, and so on. A few clones exist (however not all the clones use the same data layout as the current version of this tutorial, so please, stick with the official PHP implementation for the sake of following the article better).

  • Retwis-RB is a port of Retwis to Ruby and Sinatra written by Daniel Lucraft! Full source code is included of course, and a link to its Git repository appears in the footer of this article. The rest of this article targets PHP, but Ruby programmers can also check the Retwis-RB source code since it's conceptually very similar.
  • Retwis-J is a port of Retwis to Java, using the Spring Data Framework, written by Costin Leau. Its source code can be found on GitHub, and there is comprehensive documentation available at springsource.org.

What is a key-value store?

The essence of a key-value store is the ability to store some data, called a value, inside a key. The value can be retrieved later only if we know the specific key it was stored in. There is no direct way to search for a key by value. In some sense, it is like a very large hash/dictionary, but it is persistent, i.e. when your application ends, the data doesn't go away. So, for example, I can use the command SET to store the value bar in the key foo:

SET foo bar

Redis stores data permanently, so if I later ask "What is the value stored in key foo?" Redis will reply with bar:

GET foo => bar

Other common operations provided by key-value stores are DEL, to delete a given key and its associated value, SET-if-not-exists (called SETNX on Redis), to assign a value to a key only if the key does not already exist, and INCR, to atomically increment a number stored in a given key:

SET foo 10
INCR foo => 11
INCR foo => 12
INCR foo => 13

Atomic operations

There is something special about INCR. You may wonder why Redis provides such an operation if we can do it ourselves with a bit of code? After all, it is as simple as:

x = GET foo
x = x + 1
SET foo x

The problem is that incrementing this way will work as long as there is only one client working with the key foo at one time. See what happens if two clients are accessing this key at the same time:

x = GET foo (yields 10)
y = GET foo (yields 10)
x = x + 1 (x is now 11)
y = y + 1 (y is now 11)
SET foo x (foo is now 11)
SET foo y (foo is now 11)

Something is wrong! We incremented the value two times, but instead of going from 10 to 12, our key holds 11. This is because the increment done with GET / increment / SET is not an atomic operation. Instead the INCR provided by Redis, Memcached, ..., are atomic implementations, and the server will take care of protecting the key during the time needed to complete the increment in order to prevent simultaneous accesses.

What makes Redis different from other key-value stores is that it provides other operations similar to INCR that can be used to model complex problems. This is why you can use Redis to write whole web applications without using another database like an SQL database, and without going crazy.

Beyond key-value stores: lists

In this section we will see which Redis features we need to build our Twitter clone. The first thing to know is that Redis values can be more than strings. Redis supports Lists, Sets, Hashes, Sorted Sets, Bitmaps, and HyperLogLog types as values, and there are atomic operations to operate on them so we are safe even with multiple accesses to the same key. Let's start with Lists:

LPUSH mylist a (now mylist holds 'a')
LPUSH mylist b (now mylist holds 'b','a')
LPUSH mylist c (now mylist holds 'c','b','a')

LPUSH means Left Push, that is, add an element to the left (or to the head) of the list stored in mylist. If the key mylist does not exist it is automatically created as an empty list before the PUSH operation. As you can imagine, there is also an RPUSH operation that adds the element to the right of the list (on the tail). This is very useful for our Twitter clone. User updates can be added to a list stored in username:updates, for instance.

There are operations to get data from Lists, of course. For instance, LRANGE returns a range from the list, or the whole list.

LRANGE mylist 0 1 => c,b

LRANGE uses zero-based indexes - that is the first element is 0, the second 1, and so on. The command arguments are LRANGE key first-index last-index. The last-index argument can be negative, with a special meaning: -1 is the last element of the list, -2 the penultimate, and so on. So, to get the whole list use:

LRANGE mylist 0 -1 => c,b,a

Other important operations are LLEN that returns the number of elements in the list, and LTRIM that is like LRANGE but instead of returning the specified range trims the list, so it is like Get range from mylist, Set this range as new value but does so atomically.

The Set data type

Currently we don't use the Set type in this tutorial, but since we use Sorted Sets, which are kind of a more capable version of Sets, it is better to start introducing Sets first (which are a very useful data structure per se), and later Sorted Sets.

There are more data types than just Lists. Redis also supports Sets, which are unsorted collections of elements. It is possible to add, remove, and test for existence of members, and perform the intersection between different Sets. Of course it is possible to get the elements of a Set. Some examples will make it more clear. Keep in mind that SADD is the add to set operation, SREM is the remove from set operation, SISMEMBER is the test if member operation, and SINTER is the perform intersection operation. Other operations are SCARD to get the cardinality (the number of elements) of a Set, and SMEMBERS to return all the members of a Set.

SADD myset a
SADD myset b
SADD myset foo
SADD myset bar
SCARD myset => 4
SMEMBERS myset => bar,a,foo,b

Note that SMEMBERS does not return the elements in the same order we added them since Sets are unsorted collections of elements. When you want to store in order it is better to use Lists instead. Some more operations against Sets:

SADD mynewset b
SADD mynewset foo
SADD mynewset hello
SINTER myset mynewset => foo,b

SINTER can return the intersection between Sets but it is not limited to two Sets. You may ask for the intersection of 4,5, or 10000 Sets. Finally let's check how SISMEMBER works:

SISMEMBER myset foo => 1
SISMEMBER myset notamember => 0

The Sorted Set data type

Sorted Sets are similar to Sets: collection of elements. However in Sorted Sets each element is associated with a floating point value, called the element score. Because of the score, elements inside a Sorted Set are ordered, since we can always compare two elements by score (and if the score happens to be the same, we compare the two elements as strings).

Like Sets in Sorted Sets it is not possible to add repeated elements, every element is unique. However it is possible to update an element's score.

Sorted Set commands are prefixed with Z. The following is an example of Sorted Sets usage:

ZADD zset 10 a
ZADD zset 5 b
ZADD zset 12.55 c
ZRANGE zset 0 -1 => b,a,c

In the above example we added a few elements with ZADD, and later retrieved the elements with ZRANGE. As you can see the elements are returned in order according to their score. In order to check if a given element exists, and also to retrieve its score if it exists, we use the ZSCORE command:

ZSCORE zset a => 10
ZSCORE zset non_existing_element => NULL

Sorted Sets are a very powerful data structure, you can query elements by score range, lexicographically, in reverse order, and so forth. To know more please check the Sorted Set sections in the official Redis commands documentation.

The Hash data type

This is the last data structure we use in our program, and is extremely easy to gasp since there is an equivalent in almost every programming language out there: Hashes. Redis Hashes are basically like Ruby or Python hashes, a collection of fields associated with values:

HMSET myuser name Salvatore surname Sanfilippo country Italy
HGET myuser surname => Sanfilippo

HMSET can be used to set fields in the hash, that can be retrieved with HGET later. It is possible to check if a field exists with HEXISTS, or to increment a hash field with HINCRBY and so forth.

Hashes are the ideal data structure to represent objects. For example we use Hashes in order to represent Users and Updates in our Twitter clone.

Okay, we just exposed the basics of the Redis main data structures, we are ready to start coding!

Prerequisites

If you haven't downloaded the Retwis source code already please grab it now. It contains a few PHP files, and also a copy of Predis, the PHP client library we use in this example.

Another thing you probably want is a working Redis server. Just get the source, build with make, run with ./redis-server, and you're ready to go. No configuration is required at all in order to play with or run Retwis on your computer.

Data layout

When working with a relational database, a database schema must be designed so that we'd know the tables, indexes, and so on that the database will contain. We don't have tables in Redis, so what do we need to design? We need to identify what keys are needed to represent our objects and what kind of values these keys need to hold.

Let's start with Users. We need to represent users, of course, with their username, userid, password, the set of users following a given user, the set of users a given user follows, and so on. The first question is, how should we identify a user? Like in a relational DB, a good solution is to identify different users with different numbers, so we can associate a unique ID with every user. Every other reference to this user will be done by id. Creating unique IDs is very simple to do by using our atomic INCR operation. When we create a new user we can do something like this, assuming the user is called "antirez":

INCR next_user_id => 1000
HMSET user:1000 username antirez password p1pp0

Note: you should use a hashed password in a real application, for simplicity we store the password in clear text.

We use the next_user_id key in order to always get a unique ID for every new user. Then we use this unique ID to name the key holding a Hash with user's data. This is a common design pattern with key-values stores! Keep it in mind. Besides the fields already defined, we need some more stuff in order to fully define a User. For example, sometimes it can be useful to be able to get the user ID from the username, so every time we add a user, we also populate the users key, which is a Hash, with the username as field, and its ID as value.

HSET users antirez 1000

This may appear strange at first, but remember that we are only able to access data in a direct way, without secondary indexes. It's not possible to tell Redis to return the key that holds a specific value. This is also our strength. This new paradigm is forcing us to organize data so that everything is accessible by primary key, speaking in relational DB terms.

Followers, following, and updates

There is another central need in our system. A user might have users who follow them, which we'll call their followers. A user might follow other users, which we'll call a following. We have a perfect data structure for this. That is... Sets. The uniqueness of Sets elements, and the fact we can test in constant time for existence, are two interesting features. However what about also remembering the time at which a given user started following another one? In an enhanced version of our simple Twitter clone this may be useful, so instead of using a simple Set, we use a Sorted Set, using the user ID of the following or follower user as element, and the unix time at which the relation between the users was created, as our score.

So let's define our keys:

followers:1000 => Sorted Set of uids of all the followers users
following:1000 => Sorted Set of uids of all the following users

We can add new followers with:

ZADD followers:1000 1401267618 1234 => Add user 1234 with time 1401267618

Another important thing we need is a place were we can add the updates to display in the user's home page. We'll need to access this data in chronological order later, from the most recent update to the oldest, so the perfect kind of data structure for this is a List. Basically every new update will be LPUSHed in the user updates key, and thanks to LRANGE, we can implement pagination and so on. Note that we use the words updates and posts interchangeably, since updates are actually "little posts" in some way.

posts:1000 => a List of post ids - every new post is LPUSHed here.

This list is basically the User timeline. We'll push the IDs of her/his own posts, and, the IDs of all the posts of created by the following users. Basically, we'll implement a write fanout.

Authentication

OK, we have more or less everything about the user except for authentication. We'll handle authentication in a simple but robust way: we don't want to use PHP sessions, as our system must be ready to be distributed among different web servers easily, so we'll keep the whole state in our Redis database. All we need is a random unguessable string to set as the cookie of an authenticated user, and a key that will contain the user ID of the client holding the string.

We need two things in order to make this thing work in a robust way. First: the current authentication secret (the random unguessable string) should be part of the User object, so when the user is created we also set an auth field in its Hash:

HSET user:1000 auth fea5e81ac8ca77622bed1c2132a021f9

Moreover, we need a way to map authentication secrets to user IDs, so we also take an auths key, which has as value a Hash type mapping authentication secrets to user IDs.

HSET auths fea5e81ac8ca77622bed1c2132a021f9 1000

In order to authenticate a user we'll do these simple steps (see the login.php file in the Retwis source code):

  • Get the username and password via the login form.
  • Check if the username field actually exists in the users Hash.
  • If it exists we have the user id, (i.e. 1000).
  • Check if user:1000 password matches, if not, return an error message.
  • Ok authenticated! Set "fea5e81ac8ca77622bed1c2132a021f9" (the value of user:1000 auth field) as the "auth" cookie.

This is the actual code:

include("retwis.php");

# Form sanity checks
if (!gt("username") || !gt("password"))
    goback("You need to enter both username and password to login.");

# The form is ok, check if the username is available
$username = gt("username");
$password = gt("password");
$r = redisLink();
$userid = $r->hget("users",$username);
if (!$userid)
    goback("Wrong username or password");
$realpassword = $r->hget("user:$userid","password");
if ($realpassword != $password)
    goback("Wrong username or password");

# Username / password OK, set the cookie and redirect to index.php
$authsecret = $r->hget("user:$userid","auth");
setcookie("auth",$authsecret,time()+3600*24*365);
header("Location: index.php");

This happens every time a user logs in, but we also need a function isLoggedIn in order to check if a given user is already authenticated or not. These are the logical steps preformed by the isLoggedIn function:

  • Get the "auth" cookie from the user. If there is no cookie, the user is not logged in, of course. Let's call the value of the cookie <authcookie>.
  • Check if <authcookie> field in the auths Hash exists, and what the value (the user ID) is (1000 in the example).
  • In order for the system to be more robust, also verify that user:1000 auth field also matches.
  • OK the user is authenticated, and we loaded a bit of information in the $User global variable.

The code is simpler than the description, possibly:

function isLoggedIn() {
    global $User, $_COOKIE;

    if (isset($User)) return true;

    if (isset($_COOKIE['auth'])) {
        $r = redisLink();
        $authcookie = $_COOKIE['auth'];
        if ($userid = $r->hget("auths",$authcookie)) {
            if ($r->hget("user:$userid","auth") != $authcookie) return false;
            loadUserInfo($userid);
            return true;
        }
    }
    return false;
}

function loadUserInfo($userid) {
    global $User;

    $r = redisLink();
    $User['id'] = $userid;
    $User['username'] = $r->hget("user:$userid","username");
    return true;
}

Having loadUserInfo as a separate function is overkill for our application, but it's a good approach in a complex application. The only thing that's missing from all the authentication is the logout. What do we do on logout? That's simple, we'll just change the random string in user:1000 auth field, remove the old authentication secret from the auths Hash, and add the new one.

Important: the logout procedure explains why we don't just authenticate the user after looking up the authentication secret in the auths Hash, but double check it against user:1000 auth field. The true authentication string is the latter, while the auths Hash is just an authentication field that may even be volatile, or, if there are bugs in the program or a script gets interrupted, we may even end with multiple entries in the auths key pointing to the same user ID. The logout code is the following (logout.php):

include("retwis.php");

if (!isLoggedIn()) {
    header("Location: index.php");
    exit;
}

$r = redisLink();
$newauthsecret = getrand();
$userid = $User['id'];
$oldauthsecret = $r->hget("user:$userid","auth");

$r->hset("user:$userid","auth",$newauthsecret);
$r->hset("auths",$newauthsecret,$userid);
$r->hdel("auths",$oldauthsecret);

header("Location: index.php");

That is just what we described and should be simple to understand.

Updates

Updates, also known as posts, are even simpler. In order to create a new post in the database we do something like this:

INCR next_post_id => 10343
HMSET post:10343 user_id $owner_id time $time body "I'm having fun with Retwis"

As you can see each post is just represented by a Hash with three fields. The ID of the user owning the post, the time at which the post was published, and finally, the body of the post, which is, the actual status message.

After we create a post and we obtain the post ID, we need to LPUSH the ID in the timeline of every user that is following the author of the post, and of course in the list of posts of the author itself (everybody is virtually following herself/himself). This is the file post.php that shows how this is performed:

include("retwis.php");

if (!isLoggedIn() || !gt("status")) {
    header("Location:index.php");
    exit;
}

$r = redisLink();
$postid = $r->incr("next_post_id");
$status = str_replace("\n"," ",gt("status"));
$r->hmset("post:$postid","user_id",$User['id'],"time",time(),"body",$status);
$followers = $r->zrange("followers:".$User['id'],0,-1);
$followers[] = $User['id']; /* Add the post to our own posts too */

foreach($followers as $fid) {
    $r->lpush("posts:$fid",$postid);
}
# Push the post on the timeline, and trim the timeline to the
# newest 1000 elements.
$r->lpush("timeline",$postid);
$r->ltrim("timeline",0,1000);

header("Location: index.php");

The core of the function is the foreach loop. We use ZRANGE to get all the followers of the current user, then the loop will LPUSH the push the post in every follower timeline List.

Note that we also maintain a global timeline for all the posts, so that in the Retwis home page we can show everybody's updates easily. This requires just doing an LPUSH to the timeline List. Let's face it, aren't you starting to think it was a bit strange to have to sort things added in chronological order using ORDER BY with SQL? I think so.

There is an interesting thing to notice in the code above: we used a new command called LTRIM after we perform the LPUSH operation in the global timeline. This is used in order to trim the list to just 1000 elements. The global timeline is actually only used in order to show a few posts in the home page, there is no need to have the full history of all the posts.

Basically LTRIM + LPUSH is a way to create a capped collection in Redis.

Paginating updates

Now it should be pretty clear how we can use LRANGE in order to get ranges of posts, and render these posts on the screen. The code is simple:

function showPost($id) {
    $r = redisLink();
    $post = $r->hgetall("post:$id");
    if (empty($post)) return false;

    $userid = $post['user_id'];
    $username = $r->hget("user:$userid","username");
    $elapsed = strElapsed($post['time']);
    $userlink = "<a class=\"username\" href=\"profile.php?u=".urlencode($username)."\">".utf8entities($username)."</a>";

    echo('<div class="post">'.$userlink.' '.utf8entities($post['body'])."<br>");
    echo('<i>posted '.$elapsed.' ago via web</i></div>');
    return true;
}

function showUserPosts($userid,$start,$count) {
    $r = redisLink();
    $key = ($userid == -1) ? "timeline" : "posts:$userid";
    $posts = $r->lrange($key,$start,$start+$count);
    $c = 0;
    foreach($posts as $p) {
        if (showPost($p)) $c++;
        if ($c == $count) break;
    }
    return count($posts) == $count+1;
}

showPost will simply convert and print a Post in HTML while showUserPosts gets a range of posts and then passes them to showPosts.

Note: LRANGE is not very efficient if the list of posts start to be very big, and we want to access elements which are in the middle of the list, since Redis Lists are backed by linked lists. If a system is designed for deep pagination of million of items, it is better to resort to Sorted Sets instead.

Following users

It is not hard, but we did not yet check how we create following / follower relationships. If user ID 1000 (antirez) wants to follow user ID 5000 (pippo), we need to create both a following and a follower relationship. We just need to ZADD calls:

    ZADD following:1000 5000
    ZADD followers:5000 1000

Note the same pattern again and again. In theory with a relational database, the list of following and followers would be contained in a single table with fields like following_id and follower_id. You can extract the followers or following of every user using an SQL query. With a key-value DB things are a bit different since we need to set both the 1000 is following 5000 and 5000 is followed by 1000 relations. This is the price to pay, but on the other hand accessing the data is simpler and extremely fast. Having these things as separate sets allows us to do interesting stuff. For example, using ZINTERSTORE we can have the intersection of following of two different users, so we may add a feature to our Twitter clone so that it is able to tell you very quickly when you visit somebody else's profile, "you and Alice have 34 followers in common", and things like that.

You can find the code that sets or removes a following / follower relation in the follow.php file.

Making it horizontally scalable

Gentle reader, if you read till this point you are already a hero. Thank you. Before talking about scaling horizontally it is worth checking performance on a single server. Retwis is extremely fast, without any kind of cache. On a very slow and loaded server, an Apache benchmark with 100 parallel clients issuing 100000 requests measured the average pageview to take 5 milliseconds. This means you can serve millions of users every day with just a single Linux box, and this one was monkey ass slow... Imagine the results with more recent hardware.

However you can't go with a single server forever, how do you scale a key-value store?

Retwis does not perform any multi-keys operation, so making it scalable is simple: you may use client-side sharding, or something like a sharding proxy like Twemproxy, or the upcoming Redis Cluster.

To know more about those topics please read our documentation about sharding. However, the point here to stress is that in a key-value store, if you design with care, the data set is split among many independent small keys. To distribute those keys to multiple nodes is more straightforward and predictable compared to using a semantically more complex database system.