1 - Query syntax

Details of the query syntax

Search Query Syntax

We support a simple syntax for complex queries with the following rules:

  • Multi-word phrases simply a list of tokens, e.g. foo bar baz, and imply intersection (AND) of the terms.
  • Exact phrases are wrapped in quotes, e.g "hello world".
  • OR Unions (i.e word1 OR word2), are expressed with a pipe (|), e.g. hello|hallo|shalom|hola.
  • NOT negation (i.e. word1 NOT word2) of expressions or sub-queries. e.g. hello -world. As of version 0.19.3, purely negative queries (i.e. -foo or -@title:(foo|bar)) are supported.
  • Prefix matches (all terms starting with a prefix) are expressed with a *. For performance reasons, a minimum prefix length is enforced (2 by default, but is configurable)
  • A special "wildcard query" that returns all results in the index - * (cannot be combined with anything else).
  • Selection of specific fields using the syntax hello @field:world.
  • Numeric Range matches on numeric fields with the syntax @field:[{min} {max}].
  • Geo radius matches on geo fields with the syntax @field:[{lon} {lat} {radius} {m|km|mi|ft}]
  • Tag field filters with the syntax @field:{tag | tag | ...}. See the full documentation on [tag fields|/Tags].
  • Optional terms or clauses: foo ~bar means bar is optional but documents with bar in them will rank higher.
  • Fuzzy matching on terms (as of v1.2.0): %hello% means all terms with Levenshtein distance of 1 from it.
  • An expression in a query can be wrapped in parentheses to disambiguate, e.g. (hello|hella) (world|werld).
  • Query attributes can be applied to individual clauses, e.g. (foo bar) => { $weight: 2.0; $slop: 1; $inorder: false; }
  • Combinations of the above can be used together, e.g hello (world|foo) "bar baz" bbbb

Pure negative queries

As of version 0.19.3 it is possible to have a query consisting of just a negative expression, e.g. -hello or -(@title:(foo|bar)). The results will be all the documents NOT containing the query terms.

!!! warning Any complex expression can be negated this way, however, caution should be taken here: if a negative expression has little or no results, this is equivalent to traversing and ranking all the documents in the index, which can be slow and cause high CPU consumption.

Field modifiers

As of version 0.12 it is possible to specify field modifiers in the query and not just using the INFIELDS global keyword.

Per query expression or sub-expression, it is possible to specify which fields it matches, by prepending the expression with the @ symbol, the field name and a : (colon) symbol.

If a field modifier precedes multiple words or expressions, it applies only to the adjacent expression.

If a field modifier precedes an expression in parentheses, it applies only to the expression inside the parentheses. The expression should be valid for the specified field, otherwise it is skipped.

Multiple modifiers can be combined to create complex filtering on several fields. For example, if we have an index of car models, with a vehicle class, country of origin and engine type, we can search for SUVs made in Korea with hybrid or diesel engines - with the following query:

FT.SEARCH cars "@country:korea @engine:(diesel|hybrid) @class:suv"

Multiple modifiers can be applied to the same term or grouped terms. e.g.:

FT.SEARCH idx "@title|body:(hello world) @url|image:mydomain"

This will search for documents that have "hello" and "world" either in the body or the title, and the term "mydomain" in their url or image fields.

Numeric filters in query

If a field in the schema is defined as NUMERIC, it is possible to either use the FILTER argument in the Redis request or filter with it by specifying filtering rules in the query. The syntax is @field:[{min} {max}] - e.g. @price:[100 200].

A few notes on numeric predicates

  1. It is possible to specify a numeric predicate as the entire query, whereas it is impossible to do it with the FILTER argument.

  2. It is possible to intersect or union multiple numeric filters in the same query, be it for the same field or different ones.

  3. -inf, inf and +inf are acceptable numbers in a range. Thus greater-than 100 is expressed as [(100 inf].

  4. Numeric filters are inclusive. Exclusive min or max are expressed with ( prepended to the number, e.g. [(100 (200].

  5. It is possible to negate a numeric filter by prepending a - sign to the filter, e.g. returning a result where price differs from 100 is expressed as: @title:foo -@price:[100 100].

Tag filters

RediSearch (starting with version 0.91) allows a special field type called "tag field", with simpler tokenization and encoding in the index. The values in these fields cannot be accessed by general field-less search, and can be used only with a special syntax:

@field:{ tag | tag | ...}

e.g.

@cities:{ New York | Los Angeles | Barcelona }

Tags can have multiple words or include other punctuation marks other than the field's separator (, by default). Punctuation marks in tags should be escaped with a backslash (\). It is also recommended (but not mandatory) to escape spaces; The reason is that if a multi-word tag includes stopwords, it will create a syntax error. So tags like "to be or not to be" should be escaped as "to\ be\ or\ not\ to\ be". For good measure, you can escape all spaces within tags.

Notice that multiple tags in the same clause create a union of documents containing either tags. To create an intersection of documents containing all tags, you should repeat the tag filter several times, e.g.:

# This will return all documents containing all three cities as tags:
@cities:{ New York } @cities:{Los Angeles} @cities:{ Barcelona }

# This will return all documents containing either city:
@cities:{ New York | Los Angeles | Barcelona }

Tag clauses can be combined into any sub-clause, used as negative expressions, optional expressions, etc.

Geo filters in query

As of version 0.21, it is possible to add geo radius queries directly into the query language with the syntax @field:[{lon} {lat} {radius} {m|km|mi|ft}]. This filters the result to a given radius from a lon,lat point, defined in meters, kilometers, miles or feet. See Redis' own GEORADIUS command for more details as it is used internally for that).

Radius filters can be added into the query just like numeric filters. For example, in a database of businesses, looking for Chinese restaurants near San Francisco (within a 5km radius) would be expressed as: chinese restaurant @location:[-122.41 37.77 5 km].

Vector Similarity search in query

It is possible to add vector similarity queries directly into the query language. The basic syntax is "*=>[ KNN {num|$num} @vector $query_vec ]" for running K nearest neighbors query on @vector field. It is also possilbe to run a Hybrid Query on filtered results.

A Hybrid query allows the user to specify a filter criteria that ALL results in a KNN query must satisfy. The filter criteria can only include fields with non-vector indexes (e.g. indexes created on scalar values such as TEXT, PHONETIC, NUMERIC, GEO, etc)

The General syntax is {some filter query}=>[ KNN {num|$num} @vector $query_vec]. For example:

  • @published_year:[2020 2021] - Only entities published between 2020 and 2021.

  • => - Separates filter query from vector query.

  • [KNN {num|$num} @vector_field $query_vec] - Return num nearest neighbors entities where query_vec is similar to the vector stored in @vector_field.

As of version 2.4, we allow vector similarity to be used once in the query. For more information on vector smilarity syntax, see Vector Fields, "Querying vector fields" section.

Prefix matching

On index updating, we maintain a dictionary of all terms in the index. This can be used to match all terms starting with a given prefix. Selecting prefix matches is done by appending * to a prefix token. For example:

hel* world

Will be expanded to cover (hello|help|helm|...) world.

A few notes on prefix searches

  1. As prefixes can be expanded into many many terms, use them with caution. There is no magic going on, the expansion will create a Union operation of all suffixes.

  2. As a protective measure to avoid selecting too many terms, and block redis, which is single threaded, there are two limitations on prefix matching:

  • Prefixes are limited to 2 letters or more. You can change this number by using the MINPREFIX setting on the module command line.

  • Expansion is limited to 200 terms or less. You can change this number by using the MAXEXPANSIONS setting on the module command line.

  1. Prefix matching fully supports Unicode and is case insensitive.

  2. Currently, there is no sorting or bias based on suffix popularity, but this is on the near-term roadmap.

Fuzzy matching

As of v1.2.0, the dictionary of all terms in the index can also be used to perform Fuzzy Matching. Fuzzy matches are performed based on Levenshtein distance (LD). Fuzzy matching on a term is performed by surrounding the term with '%', for example:

%hello% world

Will perform fuzzy matching on 'hello' for all terms where LD is 1.

As of v1.4.0, the LD of the fuzzy match can be set by the number of '%' surrounding it, so that %%hello%% will perform fuzzy matching on 'hello' for all terms where LD is 2.

The maximal LD for fuzzy matching is 3.

Wildcard queries

As of version 1.1.0, we provide a special query to retrieve all the documents in an index. This is meant mostly for the aggregation engine. You can call it by specifying only a single star sign as the query string - i.e. FT.SEARCH myIndex *.

This cannot be combined with any other filters, field modifiers or anything inside the query. It is technically possible to use the deprecated FILTER and GEOFILTER request parameters outside the query string in conjunction with a wildcard, but this makes the wildcard meaningless and only hurts performance.

Query attributes

As of version 1.2.0, it is possible to apply specific query modifying attributes to specific clauses of the query.

The syntax is (foo bar) => { $attribute: value; $attribute:value; ...}, e.g:

(foo bar) => { $weight: 2.0; $slop: 1; $inorder: true; }
~(bar baz) => { $weight: 0.5; }

The supported attributes are:

  • $weight: determines the weight of the sub-query or token in the overall ranking on the result (default: 1.0).
  1. $slop: determines the maximum allowed "slop" (space between terms) in the query clause (default: 0).
  2. $inorder: whether or not the terms in a query clause must appear in the same order as in the query, usually set alongside with $slop (default: false).
  3. $phonetic: whether or not to perform phonetic matching (default: true). Note: setting this attribute on for fields which were not creates as PHONETIC will produce an error.

A few query examples

  • Simple phrase query - hello AND world

      hello world
    
  • Exact phrase query - hello FOLLOWED BY world

      "hello world"
    
  • Union: documents containing either hello OR world

      hello|world
    
  • Not: documents containing hello but not world

      hello -world
    
  • Intersection of unions

      (hello|halo) (world|werld)
    
  • Negation of union

      hello -(world|werld)
    
  • Union inside phrase

      (barack|barrack) obama
    
  • Optional terms with higher priority to ones containing more matches:

      obama ~barack ~michelle
    
  • Exact phrase in one field, one word in another field:

      @title:"barack obama" @job:president
    
  • Combined AND, OR with field specifiers:

      @title:"hello world" @body:(foo bar) @category:(articles|biographies)
    
  • Prefix Queries:

      hello worl*
    
      hel* worl*
    
      hello -worl*
    
  • Numeric Filtering - products named "tv" with a price range of 200-500:

      @name:tv @price:[200 500]
    
  • Numeric Filtering - users with age greater than 18:

      @age:[(18 +inf]
    

Mapping common SQL predicates to RediSearch

SQL ConditionRediSearch EquivalentComments
WHERE x='foo' AND y='bar'@x:foo @y:barfor less ambiguity use (@x:foo) (@y:bar)
WHERE x='foo' AND y!='bar'@x:foo -@y:bar
WHERE x='foo' OR y='bar'(@x:foo)|(@y:bar)
WHERE x IN ('foo', 'bar','hello world')@x:(foo|bar|"hello world")quotes mean exact phrase
WHERE y='foo' AND x NOT IN ('foo','bar')@y:foo (-@x:foo) (-@x:bar)
WHERE x NOT IN ('foo','bar')-@x:(foo|bar)
WHERE num BETWEEN 10 AND 20@num:[10 20]
WHERE num >= 10@num:[10 +inf]
WHERE num > 10@num:[(10 +inf]
WHERE num < 10@num:[-inf (10]
WHERE num <= 10@num:[-inf 10]
WHERE num < 10 OR num > 20@num:[-inf (10] | @num:[(20 +inf]
WHERE name LIKE 'john%'@name:john*

Technical note

The query parser is built using the Lemon Parser Generator and a Ragel based lexer. You can see the grammar definition at the git repo.

2 - Stop-words

Stop-words support

Stop-Words

RediSearch has a pre-defined default list of stop-words. These are words that are usually so common that they do not add much information to search, but take up a lot of space and CPU time in the index.

When indexing, stop-words are discarded and not indexed. When searching, they are also ignored and treated as if they were not sent to the query processor. This is done when parsing the query.

At the moment, the default stop-word list applies to all full-text indexes in all languages and can be overridden manually at index creation time.

Default stop-word list

The following words are treated as stop-words by default:

 a,    is,    the,   an,   and,  are, as,  at,   be,   but,  by,   for,
 if,   in,    into,  it,   no,   not, of,  on,   or,   such, that, their,
 then, there, these, they, this, to,  was, will, with

Overriding the default stop-words

Stop-words for an index can be defined (or disabled completely) on index creation using the STOPWORDS argument in the [FT.CREATE command.

The format is STOPWORDS {number} {stopword} ... where number is the number of stopwords given. The STOPWORDS argument must come before the SCHEMA argument. For example:

FT.CREATE myIndex STOPWORDS 3 foo bar baz SCHEMA title TEXT body TEXT 

Disabling stop-words completely

Disabling stopwords completely can be done by passing STOPWORDS 0 on FT.CREATE.

Avoiding stop-word detection in search queries

In rare use cases, where queries are very long and are guaranteed by the client application to not contain stopwords, it is possible to avoid checking for them when parsing the query. This saves some CPU time and is only worth it if the query has dozens or more terms in it. Using this without verifying that the query doesn't contain stop-words might result in empty queries.

3 - Aggregations

Details of FT.AGGREGATE. Grouping and projections and functions

RediSearch Aggregations

Aggregations are a way to process the results of a search query, group, sort and transform them - and extract analytic insights from them. Much like aggregation queries in other databases and search engines, they can be used to create analytics reports, or perform Faceted Search style queries.

For example, indexing a web-server's logs, we can create a report for unique users by hour, country or any other breakdown; or create different reports for errors, warnings, etc.

Core concepts

The basic idea of an aggregate query is this:

  • Perform a search query, filtering for records you wish to process.
  • Build a pipeline of operations that transform the results by zero or more steps of:
    • Group and Reduce: grouping by fields in the results, and applying reducer functions on each group.
    • Sort: sort the results based on one or more fields.
    • Apply Transformations: Apply mathematical and string functions on fields in the pipeline, optionally creating new fields or replacing existing ones
    • Limit: Limit the result, regardless of sorting the result.
    • Filter: Filter the results (post-query) based on predicates relating to its values.

The pipeline is dynamic and re-entrant, and every operation can be repeated. For example, you can group by property X, sort the top 100 results by group size, then group by property Y and sort the results by some other property, then apply a transformation on the output.

Figure 1: Aggregation Pipeline Example Aggregation Pipeline

Aggregate request format

The aggregate request's syntax is defined as follows:

FT.AGGREGATE
  {index_name:string}
  {query_string:string}
  [VERBATIM]
  [LOAD {nargs:integer} {property:string} ...]
  [GROUPBY
    {nargs:integer} {property:string} ...
    REDUCE
      {FUNC:string}
      {nargs:integer} {arg:string} ...
      [AS {name:string}]
    ...
  ] ...
  [SORTBY
    {nargs:integer} {string} ...
    [MAX {num:integer}] ...
  ] ...
  [APPLY
    {EXPR:string}
    AS {name:string}
  ] ...
  [FILTER {EXPR:string}] ...
  [LIMIT {offset:integer} {num:integer} ] ...
  [PARAMS {nargs} {name} {value} ... ]

Parameters in detail

Parameters which may take a variable number of arguments are expressed in the form of param {nargs} {property_1... property_N}. The first argument to the parameter is the number of arguments following the parameter. This allows RediSearch to avoid a parsing ambiguity in case one of your arguments has the name of another parameter. For example, to sort by first name, last name, and country, one would specify SORTBY 6 firstName ASC lastName DESC country ASC.

  • index_name: The index the query is executed against.

  • query_string: The base filtering query that retrieves the documents. It follows the exact same syntax as the search query, including filters, unions, not, optional, etc.

  • LOAD {nargs} {property} …: Load document fields from the document HASH objects. This should be avoided as a general rule of thumb. Fields needed for aggregations should be stored as SORTABLE (and optionally UNF to avoid any normalization), where they are available to the aggregation pipeline with very low latency. LOAD hurts the performance of aggregate queries considerably since every processed record needs to execute the equivalent of HMGET against a redis key, which when executed over millions of keys, amounts to very high processing times. The document ID can be loaded using @__key.

  • GROUPBY {nargs} {property}: Group the results in the pipeline based on one or more properties. Each group should have at least one reducer (See below), a function that handles the group entries, either counting them or performing multiple aggregate operations (see below).

  • REDUCE {func} {nargs} {arg} … [AS {name}]: Reduce the matching results in each group into a single record, using a reduction function. For example, COUNT will count the number of records in the group. See the Reducers section below for more details on available reducers.

    The reducers can have their own property names using the AS {name} optional argument. If a name is not given, the resulting name will be the name of the reduce function and the group properties. For example, if a name is not given to COUNT_DISTINCT by property @foo, the resulting name will be count_distinct(@foo).

  • SORTBY {nargs} {property} {ASC|DESC} [MAX {num}]: Sort the pipeline up until the point of SORTBY, using a list of properties. By default, sorting is ascending, but ASC or DESC can be added for each property. nargs is the number of sorting parameters, including ASC and DESC. for example: SORTBY 4 @foo ASC @bar DESC.

    MAX is used to optimized sorting, by sorting only for the n-largest elements. Although it is not connected to LIMIT, you usually need just SORTBY … MAX for common queries.

  • APPLY {expr} AS {name}: Apply a 1-to-1 transformation on one or more properties, and either store the result as a new property down the pipeline, or replace any property using this transformation. expr is an expression that can be used to perform arithmetic operations on numeric properties, or functions that can be applied on properties depending on their types (see below), or any combination thereof. For example: APPLY "sqrt(@foo)/log(@bar) + 5" AS baz will evaluate this expression dynamically for each record in the pipeline and store the result as a new property called baz, that can be referenced by further APPLY / SORTBY / GROUPBY / REDUCE operations down the pipeline.

  • LIMIT {offset} {num}. Limit the number of results to return just num results starting at index offset (zero based). AS mentioned above, it is much more efficient to use SORTBY … MAX if you are interested in just limiting the output of a sort operation.

    However, limit can be used to limit results without sorting, or for paging the n-largest results as determined by SORTBY MAX. For example, getting results 50-100 of the top 100 results is most efficiently expressed as SORTBY 1 @foo MAX 100 LIMIT 50 50. Removing the MAX from SORTBY will result in the pipeline sorting all the records and then paging over results 50-100.

  • FILTER {expr}. Filter the results using predicate expressions relating to values in each result. They are is applied post-query and relate to the current state of the pipeline. See FILTER Expressions below for full details.

  • PARAMS {nargs} {name} {value}. Define one or more value parameters. Each parameter has a name and a value. Parameters can be referenced in the query string by a $, followed by the parameter name, e.g., $user, and each such reference in the search query to a parameter name is substituted by the corresponding parameter value. For example, with parameter definition PARAMS 4 lon 29.69465 lat 34.95126, the expression @loc:[$lon $lat 10 km] would be evaluated to @loc:[29.69465 34.95126 10 km]. Parameters cannot be referenced in the query string where concrete values are not allowed, such as in field names, e.g., @loc

Quick example

Let's assume we have log of visits to our website, each record containing the following fields/properties:

  • url (text, sortable)
  • timestamp (numeric, sortable) - unix timestamp of visit entry.
  • country (tag, sortable)
  • user_id (text, sortable, not indexed)

Example 1: unique users by hour, ordered chronologically.

First of all, we want all records in the index, because why not. The first step is to determine the index name and the filtering query. A filter query of * means "get all records":

FT.AGGREGATE myIndex "*"

Now we want to group the results by hour. Since we have the visit times as unix timestamps in second resolution, we need to extract the hour component of the timestamp. So we first add an APPLY step, that strips the sub-hour information from the timestamp and stores is as a new property, hour:

FT.AGGREGATE myIndex "*"
  APPLY "@timestamp - (@timestamp % 3600)" AS hour

Now we want to group the results by hour, and count the distinct user ids in each hour. This is done by a GROUPBY/REDUCE step:

FT.AGGREGATE myIndex "*"
  APPLY "@timestamp - (@timestamp % 3600)" AS hour
  
  GROUPBY 1 @hour
  	REDUCE COUNT_DISTINCT 1 @user_id AS num_users

Now we'd like to sort the results by hour, ascending:

FT.AGGREGATE myIndex "*"
  APPLY "@timestamp - (@timestamp % 3600)" AS hour
  
  GROUPBY 1 @hour
  	REDUCE COUNT_DISTINCT 1 @user_id AS num_users
  	
  SORTBY 2 @hour ASC

And as a final step, we can format the hour as a human readable timestamp. This is done by calling the transformation function timefmt that formats unix timestamps. You can specify a format to be passed to the system's strftime function (see documentation), but not specifying one is equivalent to specifying %FT%TZ to strftime.

FT.AGGREGATE myIndex "*"
  APPLY "@timestamp - (@timestamp % 3600)" AS hour
  
  GROUPBY 1 @hour
  	REDUCE COUNT_DISTINCT 1 @user_id AS num_users
  	
  SORTBY 2 @hour ASC
  
  APPLY timefmt(@hour) AS hour

Example 2: Sort visits to a specific URL by day and country:

In this example we filter by the url, transform the timestamp to its day part, and group by the day and country, simply counting the number of visits per group. sorting by day ascending and country descending.

FT.AGGREGATE myIndex "@url:\"about.html\""
    APPLY "@timestamp - (@timestamp % 86400)" AS day
    GROUPBY 2 @day @country
    	REDUCE count 0 AS num_visits 
    SORTBY 4 @day ASC @country DESC

GROUPBY reducers

GROUPBY step work similarly to SQL GROUP BY clauses, and create groups of results based on one or more properties in each record. For each group, we return the "group keys", or the values common to all records in the group, by which they were grouped together - along with the results of zero or more REDUCE clauses.

Each GROUPBY step in the pipeline may be accompanied by zero or more REDUCE clauses. Reducers apply some accumulation function to each record in the group and reduce them into a single record representing the group. When we are finished processing all the records upstream of the GROUPBY step, each group emits its reduced record.

For example, the simplest reducer is COUNT, which simply counts the number of records in each group.

If multiple REDUCE clauses exist for a single GROUPBY step, each reducer works independently on each result and writes its final output once. Each reducer may have its own alias determined using the AS optional parameter. If AS is not specified, the alias is the reduce function and its parameters, e.g. count_distinct(foo,bar).

Supported GROUPBY reducers

COUNT

Format

REDUCE COUNT 0

Description

Count the number of records in each group

COUNT_DISTINCT

Format

REDUCE COUNT_DISTINCT 1 {property}

Description

Count the number of distinct values for property.

!!! note The reducer creates a hash-set per group, and hashes each record. This can be memory heavy if the groups are big.

COUNT_DISTINCTISH

Format

REDUCE COUNT_DISTINCTISH 1 {property}

Description

Same as COUNT_DISTINCT - but provide an approximation instead of an exact count, at the expense of less memory and CPU in big groups.

!!! note The reducer uses HyperLogLog counters per group, at ~3% error rate, and 1024 Bytes of constant space allocation per group. This means it is ideal for few huge groups and not ideal for many small groups. In the former case, it can be an order of magnitude faster and consume much less memory than COUNT_DISTINCT, but again, it does not fit every user case.

SUM

Format

REDUCE SUM 1 {property}

Description

Return the sum of all numeric values of a given property in a group. Non numeric values if the group are counted as 0.

MIN

Format

REDUCE MIN 1 {property}

Description

Return the minimal value of a property, whether it is a string, number or NULL.

MAX

Format

REDUCE MAX 1 {property}

Description

Return the maximal value of a property, whether it is a string, number or NULL.

AVG

Format

REDUCE AVG 1 {property}

Description

Return the average value of a numeric property. This is equivalent to reducing by sum and count, and later on applying the ratio of them as an APPLY step.

STDDEV

Format

REDUCE STDDEV 1 {property}

Description

Return the standard deviation of a numeric property in the group.

QUANTILE

Format

REDUCE QUANTILE 2 {property} {quantile}

Description

Return the value of a numeric property at a given quantile of the results. Quantile is expressed as a number between 0 and 1. For example, the median can be expressed as the quantile at 0.5, e.g. REDUCE QUANTILE 2 @foo 0.5 AS median .

If multiple quantiles are required, just repeat the QUANTILE reducer for each quantile. e.g. REDUCE QUANTILE 2 @foo 0.5 AS median REDUCE QUANTILE 2 @foo 0.99 AS p99

TOLIST

Format

REDUCE TOLIST 1 {property}

Description

Merge all distinct values of a given property into a single array.

FIRST_VALUE

Format

REDUCE FIRST_VALUE {nargs} {property} [BY {property} [ASC|DESC]]

Description

Return the first or top value of a given property in the group, optionally by comparing that or another property. For example, you can extract the name of the oldest user in the group:

REDUCE FIRST_VALUE 4 @name BY @age DESC

If no BY is specified, we return the first value we encounter in the group.

If you with to get the top or bottom value in the group sorted by the same value, you are better off using the MIN/MAX reducers, but the same effect will be achieved by doing REDUCE FIRST_VALUE 4 @foo BY @foo DESC.

RANDOM_SAMPLE

Format

REDUCE RANDOM_SAMPLE {nargs} {property} {sample_size}

Description

Perform a reservoir sampling of the group elements with a given size, and return an array of the sampled items with an even distribution.

APPLY expressions

APPLY performs a 1-to-1 transformation on one or more properties in each record. It either stores the result as a new property down the pipeline, or replaces any property using this transformation.

The transformations are expressed as a combination of arithmetic expressions and built in functions. Evaluating functions and expressions is recursively nested and can be composed without limit. For example: sqrt(log(foo) * floor(@bar/baz)) + (3^@qaz % 6) or simply @foo/@bar.

If an expression or a function is applied to values that do not match the expected types, no error is emitted but a NULL value is set as the result.

APPLY steps must have an explicit alias determined by the AS parameter.

Literals inside expressions

  • Numbers are expressed as integers or floating point numbers, i.e. 2, 3.141, -34, etc. inf and -inf are acceptable as well.
  • Strings are quoted with either single or double quotes. Single quotes are acceptable inside strings quoted with double quotes and vice versa. Punctuation marks can be escaped with backslashes. e.g. "foo's bar" ,'foo\'s bar', "foo \"bar\"" .
  • Any literal or sub expression can be wrapped in parentheses to resolve ambiguities of operator precedence.

Arithmetic operations

For numeric expressions and properties, we support addition (+), subtraction (-), multiplication (*), division (/), modulo (%) and power (^). We currently do not support bitwise logical operators.

Note that these operators apply only to numeric values and numeric sub expressions. Any attempt to multiply a string by a number, for instance, will result in a NULL output.

List of field APPLY functions

FunctionDescriptionExample
exists(s)Checks whether a field exists in a document.exists(@field)

List of numeric APPLY functions

FunctionDescriptionExample
log(x)Return the logarithm of a number, property or sub-expressionlog(@foo)
abs(x)Return the absolute number of a numeric expressionabs(@foo-@bar)
ceil(x)Round to the smallest value not less than xceil(@foo/3.14)
floor(x)Round to largest value not greater than xfloor(@foo/3.14)
log2(x)Return the logarithm of x to base 2log2(2^@foo)
exp(x)Return the exponent of x, i.e. e^xexp(@foo)
sqrt(x)Return the square root of xsqrt(@foo)

List of string APPLY functions

Function
upper(s)Return the uppercase conversion of supper('hello world')
lower(s)Return the lowercase conversion of slower("HELLO WORLD")
startswith(s1,s2)Return 1 if s2 is the prefix of s1, 0 otherwise.startswith(@field, "company")
contains(s1,s2)Return the number of occurrences of s2 in s1, 0 otherwise. If s2 is an empty string, return length(s1) + 1.contains(@field, "pa")
substr(s, offset, count)Return the substring of s, starting at offset and having count characters.
If offset is negative, it represents the distance from the end of the string.
If count is -1, it means "the rest of the string starting at offset".
substr("hello", 0, 3)
substr("hello", -2, -1)
format( fmt, ...)Use the arguments following fmt to format a string.
Currently the only format argument supported is %s and it applies to all types of arguments.
format("Hello, %s, you are %s years old", @name, @age)
matched_terms([max_terms=100])Return the query terms that matched for each record (up to 100), as a list. If a limit is specified, we will return the first N matches we find - based on query order.matched_terms()
split(s, [sep=","], [strip=" "])Split a string by any character in the string sep, and strip any characters in strip. If only s is specified, we split by commas and strip spaces. The output is an array.split("foo,bar")

List of date/time APPLY functions

FunctionDescription
timefmt(x, [fmt])Return a formatted time string based on a numeric timestamp value x.
See strftime for formatting options.
Not specifying fmt is equivalent to %FT%TZ.
parsetime(timesharing, [fmt])The opposite of timefmt() - parse a time format using a given format string
day(timestamp)Round a Unix timestamp to midnight (00:00) start of the current day.
hour(timestamp)Round a Unix timestamp to the beginning of the current hour.
minute(timestamp)Round a Unix timestamp to the beginning of the current minute.
month(timestamp)Round a unix timestamp to the beginning of the current month.
dayofweek(timestamp)Convert a Unix timestamp to the day number (Sunday = 0).
dayofmonth(timestamp)Convert a Unix timestamp to the day of month number (1 .. 31).
dayofyear(timestamp)Convert a Unix timestamp to the day of year number (0 .. 365).
year(timestamp)Convert a Unix timestamp to the current year (e.g. 2018).
monthofyear(timestamp)Convert a Unix timestamp to the current month (0 .. 11).

List of geo APPLY functions

FunctionDescriptionExample
geodistance(field,field)Return distance in meters.geodistance(@field1,@field2)
geodistance(field,"lon,lat")Return distance in meters.geodistance(@field,"1.2,-3.4")
geodistance(field,lon,lat)Return distance in meters.geodistance(@field,1.2,-3.4)
geodistance("lon,lat",field)Return distance in meters.geodistance("1.2,-3.4",@field)
geodistance("lon,lat","lon,lat")Return distance in meters.geodistance("1.2,-3.4","5.6,-7.8")
geodistance("lon,lat",lon,lat)Return distance in meters.geodistance("1.2,-3.4",5.6,-7.8)
geodistance(lon,lat,field)Return distance in meters.geodistance(1.2,-3.4,@field)
geodistance(lon,lat,"lon,lat")Return distance in meters.geodistance(1.2,-3.4,"5.6,-7.8")
geodistance(lon,lat,lon,lat)Return distance in meters.geodistance(1.2,-3.4,5.6,-7.8)
FT.AGGREGATE myIdx "*"  LOAD 1 location  APPLY "geodistance(@location,\"-1.1,2.2\")" AS dist

To print out the distance:

FT.AGGREGATE myIdx "*"  LOAD 1 location  APPLY "geodistance(@location,\"-1.1,2.2\")" AS dist

Note: Geo field must be preloaded using LOAD.

Results can also be sorted by distance:

FT.AGGREGATE idx "*" LOAD 1 @location FILTER "exists(@location)" APPLY "geodistance(@location,-117.824722,33.68590)" AS dist SORTBY 2 @dist DESC

Note: Make sure no location is missing, otherwise the SORTBY will not return any result. Use FILTER to make sure you do the sorting on all valid locations.

FILTER expressions

FILTER expressions filter the results using predicates relating to values in the result set.

The FILTER expressions are evaluated post-query and relate to the current state of the pipeline. Thus they can be useful to prune the results based on group calculations. Note that the filters are not indexed and will not speed the processing per se.

Filter expressions follow the syntax of APPLY expressions, with the addition of the conditions ==, !=, <, <=, >, >=. Two or more predicates can be combined with logical AND (&&) and OR (||). A single predicate can be negated with a NOT prefix (!).

For example, filtering all results where the user name is 'foo' and the age is less than 20 is expressed as:

FT.AGGREGATE 
  ...
  FILTER "@name=='foo' && @age < 20"
  ...

Several filter steps can be added, although at the same stage in the pipeline, it is more efficient to combine several predicates into a single filter step.

Cursor API

FT.AGGREGATE ... WITHCURSOR [COUNT {read size} MAXIDLE {idle timeout}]
FT.CURSOR READ {idx} {cid} [COUNT {read size}]
FT.CURSOR DEL {idx} {cid}

You can use cursors with FT.AGGREGATE, with the WITHCURSOR keyword. Cursors allow you to consume only part of the response, allowing you to fetch additional results as needed. This is much quicker than using LIMIT with offset, since the query is executed only once, and its state is stored on the server.

To use cursors, specify the WITHCURSOR keyword in FT.AGGREGATE, e.g.

FT.AGGREGATE idx * WITHCURSOR

This will return a response of an array with two elements. The first element is the actual (partial) results, and the second is the cursor ID. The cursor ID can then be fed to FT.CURSOR READ repeatedly, until the cursor ID is 0, in which case all results have been returned.

To read from an existing cursor, use FT.CURSOR READ, e.g.

FT.CURSOR READ idx 342459320

Assuming 342459320 is the cursor ID returned from the FT.AGGREGATE request.

Here is an example in pseudo-code:

response, cursor = FT.AGGREGATE "idx" "redis" "WITHCURSOR";
while (1) {
  processResponse(response)
  if (!cursor) {
    break;
  }
  response, cursor = FT.CURSOR read "idx" cursor
}

Note that even if the cursor is 0, a partial result may still be returned.

Cursor settings

Read size

You can control how many rows are read per each cursor fetch by using the COUNT parameter. This parameter can be specified both in FT.AGGREGATE (immediately after WITHCURSOR) or in FT.CURSOR READ.

FT.AGGREGATE idx query WITHCURSOR COUNT 10

Will read 10 rows at a time.

You can override this setting by also specifying COUNT in CURSOR READ, e.g.

FT.CURSOR READ idx 342459320 COUNT 50

Will return at most 50 results.

The default read size is 1000

Timeouts and limits

Because cursors are stateful resources which occupy memory on the server, they have a limited lifetime. In order to safeguard against orphaned/stale cursors, cursors have an idle timeout value. If no activity occurs on the cursor before the idle timeout, the cursor is deleted. The idle timer resets to 0 whenever the cursor is read from using CURSOR READ.

The default idle timeout is 300000 milliseconds (or 300 seconds). You can modify the idle timeout using the MAXIDLE keyword when creating the cursor. Note that the value cannot exceed the default 300s.

FT.AGGREGATE idx query WITHCURSOR MAXIDLE 10000

Will set the limit for 10 seconds.

Other cursor commands

Cursors can be explicitly deleted using the CURSOR DEL command, e.g.

FT.CURSOR DEL idx 342459320

Note that cursors are automatically deleted if all their results have been returned, or if they have been timed out.

All idle cursors can be forcefully purged at once using FT.CURSOR GC idx 0 command. By default, RediSearch uses a lazy throttled approach to garbage collection, which collects idle cursors every 500 operations, or every second - whichever is later.

4 - Tokenization

Controlling Text Tokenization and Escaping

Controlling Text Tokenization and Escaping

At the moment, RediSearch uses a very simple tokenizer for documents and a slightly more sophisticated tokenizer for queries. Both allow a degree of control over string escaping and tokenization.

Note: There is a different mechanism for tokenizing text and tag fields, this document refers only to text fields. For tag fields please refer to the Tag Fields documentation.

The rules of text field tokenization

  1. All punctuation marks and whitespace (besides underscores) separate the document and queries into tokens. e.g. any character of ,.<>{}[]"':;!@#$%^&*()-+=~ will break the text into terms. So the text foo-bar.baz...bag will be tokenized into [foo, bar, baz, bag]

  2. Escaping separators in both queries and documents is done by prepending a backslash to any separator. e.g. the text hello\-world hello-world will be tokenized as [hello-world, hello, world]. NOTE that in most languages you will need an extra backslash when formatting the document or query, to signify an actual backslash, so the actual text in redis-cli for example, will be entered as hello\\-world.

  3. Underscores (_) are not used as separators in either document or query. So the text hello_world will remain as is after tokenization.

  4. Repeating spaces or punctuation marks are stripped.

  5. In Latin characters, everything gets converted to lowercase.

  6. A backslash before the first digit will tokenize it as a term. This will translate - sign as NOT which otherwise will make the number negative. Add a backslash before . if you are searching for a float. (ex. -20 -> {-20} vs -\20 -> {NOT{20}})

5 - Sorting

Support for sorting query results

Sorting by Indexed Fields

As of RediSearch 0.15, it is possible to bypass the scoring function mechanism, and order search results by the value of different document properties (fields) directly - even if the sorting field is not used by the query. For example, you can search for first name and sort by last name.

Declaring Sortable Fields

When creating the index with FT.CREATE, you can declare TEXT and NUMERIC properties to be SORTABLE. When a property is sortable, we can later decide to order the results by its values. For example, in the following schema:

> FT.CREATE users SCHEMA first_name TEXT last_name TEXT SORTABLE age NUMERIC SORTABLE

The fields last_name and age are sortable, but first_name isn't. This means we can search by either first and/or last name, and sort by last name or age.

Note on sortable TEXT fields

In the current implementation, when declaring a sortable field, its content gets copied into a special location in the index, for fast access on sorting. This means that making long text fields sortable is very expensive, and you should be careful with it.

Normalization (UNF option)

By default, text fields get normalized and lowercased in a Unicode-safe way when stored for sorting. This means that America and america are considered equal in terms of sorting.

Using the argument UNF (un-normalized form) it is possible to disable the normalization and keep the original form of the value. Therefore, America will come before america.

Specifying SORTBY

If an index includes sortable fields, you can add the SORTBY parameter to the search request (outside the query body), and order the results by it. This overrides the scoring function mechanism, and the two cannot be combined. If WITHSCORES is specified along with SORTBY, the scores returned are simply the relative position of each result in the result set.

The syntax for SORTBY is:

SORTBY {field_name} [ASC|DESC]
  • field_name must be a sortable field defined in the schema.

  • ASC means the order will be ascending, DESC that it will be descending.

  • The default ordering is ASC if not specified otherwise.

Quick example

> FT.CREATE users SCHEMA first_name TEXT SORTABLE last_name TEXT age NUMERIC SORTABLE

# Add some users
> FT.ADD users user1 1.0 FIELDS first_name "alice" last_name "jones" age 35
> FT.ADD users user2 1.0 FIELDS first_name "bob" last_name "jones" age 36

# Searching while sorting

# Searching by last name and sorting by first name
> FT.SEARCH users "@last_name:jones" SORTBY first_name DESC

# Searching by both first and last name, and sorting by age
> FT.SEARCH users "alice jones" SORTBY age ASC

6 - Tags

Details about tag fields

Tag Fields

Tag fields are similar to full-text fields but use simpler tokenization and encoding in the index. The values in these fields cannot be accessed by general field-less search and can be used only with a special syntax.

The main differences between tag and full-text fields are:

  1. We do not perform stemming on tag indexes.

  2. The tokenization is simpler: The user can determine a separator (defaults to a comma) for multiple tags, and we only do whitespace trimming at the end of tags. Thus, tags can contain spaces, punctuation marks, accents, etc.

  3. The only two transformations we perform are lower-casing (for latin languages only as of now) and whitespace trimming. Lower-case transformation can be disabled by passing CASESENSITIVE.

  4. Tags cannot be found from a general full-text search. If a document has a field called "tags" with the values "foo" and "bar", searching for foo or bar without a special tag modifier (see below) will not return this document.

  5. The index is much simpler and more compressed: We do not store frequencies, offset vectors of field flags. The index contains only document IDs encoded as deltas. This means that an entry in a tag index is usually one or two bytes long. This makes them very memory efficient and fast.

  6. We can create up to 1024 tag fields per index.

Creating a tag field

Tag fields can be added to the schema in FT.ADD with the following syntax:

FT.CREATE ... SCHEMA ... {field_name} TAG [SEPARATOR {sep}] [CASESENSITIVE]

SEPARATOR defaults to a comma (,), and can be any printable ASCII character. For example:

CASESENSITIVE can be specified to keep the original letters case.

FT.CREATE idx ON HASH PREFIX 1 test: SCHEMA tags TAG SEPARATOR ";"

Querying tag fields

As mentioned above, just searching for a tag without any modifiers will not retrieve documents containing it.

The syntax for matching tags in a query is as follows (the curly braces are part of the syntax in this case):

   @<field_name>:{ <tag> | <tag> | ...}

For example, this query finds documents with either the tag hello world or foo bar:

    FT.SEARCH idx "@tags:{ hello world | foo bar }"

Tag clauses can be combined into any sub-clause, used as negative expressions, optional expressions, etc. For example, given the following index:

FT.CREATE idx ON HASH PREFIX 1 test: SCHEMA title TEXT price NUMERIC tags TAG SEPARATOR ";"

You can combine a full-text search on the title field, a numerical range on price, and match either the foo bar or hello world tag like this:

FT.SEARCH idx "@title:hello @price:[0 100] @tags:{ foo bar | hello world }

Tags support prefix matching with the regular * character:

FT.SEARCH idx "@tags:{ hell* }"
FT.SEARCH idx "@tags:{ hello\\ w* }"

Multiple tags in a single filter

Notice that including multiple tags in the same clause creates a union of all documents that contain any of the included tags. To create an intersection of documents containing all of the given tags, you should repeat the tag filter several times.

For example, imagine an index of travellers, with a tag field for the cities each traveller has visited:

FT.CREATE myIndex ON HASH PREFIX 1 traveller: SCHEMA name TEXT cities TAG

HSET traveller:1 name "John Doe" cities "New York, Barcelona, San Francisco"

For this index, the following query will return all the people who visited at least one of the following cities:

FT.SEARCH myIndex "@cities:{ New York | Los Angeles | Barcelona }"

But the next query will return all people who have visited all three cities:

FT.SEARCH myIndex "@cities:{ New York } @cities:{Los Angeles} @cities:{ Barcelona }"

Including punctuation in tags

A tag can include punctuation other than the field's separator (by default, a comma). You do not need to escape punctuation when using the HSET command to add the value to a Redis Hash.

For example, given the following index:

FT.CREATE punctuation ON HASH PREFIX 1 test: SCHEMA tags TAG

You can add tags that contain punctuation like this:

HSET test:1 tags "Andrew's Top 5,Justin's Top 5"

However, when you query for tags that contain punctuation, you must escape that punctuation with a backslash character (\).

NOTE: In most languages you will need an extra backslash. This is also the case in the redis-cli.

For example, querying for the tag Andrew's Top 5 in the redis-cli looks like this:

FT.SEARCH punctuation "@tags:{ Andrew\\'s Top 5 }"

Tags that contain multiple words

As the examples in this document show, a single tag can include multiple words. We recommend that you escape spaces when querying, though doing so is not required.

You escape spaces the same way that you escape punctuation -- by preceding the space with a backslash character (or two backslashes, depending on the programming language and environment).

Thus, you would escape the tag "to be or not to be" like so when querying in the redis-cli:

FT.SEARCH idx "@tags:{ to\\ be\\ or\\ not\\ to\\ be }"

You should escape spaces because if a tag includes multiple words and some of them are stop words like "to" or "be," a query that includes these words without escaping spaces will create a syntax error.

You can see what that looks like in the following example:

127.0.0.1:6379> FT.SEARCH idx "@tags:{ to be or not to be }"
(error) Syntax error at offset 27 near be

NOTE: Stop words are words that are so common that a search engine ignores them. We have a dedicated page about stop words in RediSearch if you would like to learn more.

Given the potential for syntax errors, we recommend that you escape all spaces within tag queries.

7 - Highlighting

Highlighting full-text results

Highlighting API

The highlighting API allows you to have only the relevant portions of document matching a search query returned as a result. This allows users to quickly see how a document relates to their query, with the search terms highlighted, usually in bold letters.

RediSearch implements high performance highlighting and summarization algorithms, with the following API:

Command syntax

FT.SEARCH ...
    SUMMARIZE [FIELDS {num} {field}] [FRAGS {numFrags}] [LEN {fragLen}] [SEPARATOR {sepstr}]
    HIGHLIGHT [FIELDS {num} {field}] [TAGS {openTag} {closeTag}]

There are two sub-commands commands used for highlighting. One is HIGHLIGHT which surrounds matching text with an open and/or close tag, and the other is SUMMARIZE which splits a field into contextual fragments surrounding the found terms. It is possible to summarize a field, highlight a field, or perform both actions in the same query.

Summarization

FT.SEARCH ...
    SUMMARIZE [FIELDS {num} {field}] [FRAGS {numFrags}] [LEN {fragLen}] [SEPARATOR {sepStr}]

Summarization will fragment the text into smaller sized snippets; each snippet will contain the found term(s) and some additional surrounding context.

RediSearch can perform summarization using the SUMMARIZE keyword. If no additional arguments are passed, all returned fields are summarized using built-in defaults.

The SUMMARIZE keyword accepts the following arguments:

  • FIELDS: If present, must be the first argument. This should be followed by the number of fields to summarize, which itself is followed by a list of fields. Each field present is summarized. If no FIELDS directive is passed, then all fields returned are summarized.

  • FRAGS: How many fragments should be returned. If not specified, a default of 3 is used.

  • LEN The number of context words each fragment should contain. Context words surround the found term. A higher value will return a larger block of text. If not specified, the default value is 20.

  • SEPARATOR The string used to divide between individual summary snippets. The default is ... which is common among search engines; but you may override this with any other string if you desire to programmatically divide them later on. You may use a newline sequence, as newlines are stripped from the result body anyway (thus, it will not be conflated with an embedded newline in the text)

Highlighting

FT.SEARCH ... HIGHLIGHT [FIELDS {num} {field}] [TAGS {openTag} {closeTag}]

Highlighting will highlight the found term (and its variants) with a user-defined tag. This may be used to display the matched text in a different typeface using a markup language, or to otherwise make the text appear differently.

RediSearch can perform highlighting using the HIGHLIGHT keyword. If no additional arguments are passed, all returned fields are highlighted using built-in defaults.

The HIGHLIGHT keyword accepts the following arguments:

  • FIELDS If present, must be the first argument. This should be followed by the number of fields to highlight, which itself is followed by a list of fields. Each field present is highlighted. If no FIELDS directive is passed, then all fields returned are highlighted.

  • TAGS If present, must be followed by two strings; the first is prepended to each term match, and the second is appended to it. If no TAGS are specified, a built-in tag value is appended and prepended.

Field selection

If no specific fields are passed to the RETURN, SUMMARIZE, or HIGHLIGHT keywords, then all of a document's fields are returned. However, if any of these keywords contain a FIELD directive, then the SEARCH command will only return the sum total of all fields enumerated in any of those directives.

The RETURN keyword is treated specially, as it overrides any fields specified in SUMMARIZE or HIGHLIGHT.

In the command RETURN 1 foo SUMMARIZE FIELDS 1 bar HIGHLIGHT FIELDS 1 baz, the fields foo is returned as-is, while bar and baz are not returned, because RETURN was specified, but did not include those fields.

In the command SUMMARIZE FIELDS 1 bar HIGHLIGHT FIELDS 1 baz, bar is returned summarized and baz is returned highlighted.

8 - Scoring

Full-text scoring functions

Scoring in RediSearch

RediSearch comes with a few very basic scoring functions to evaluate document relevance. They are all based on document scores and term frequency. This is regardless of the ability to use sortable fields. Scoring functions are specified by adding the SCORER {scorer_name} argument to a search query.

If you prefer a custom scoring function, it is possible to add more functions using the Extension API.

These are the pre-bundled scoring functions available in RediSearch and how they work. Each function is mentioned by registered name, that can be passed as a SCORER argument in FT.SEARCH.

TFIDF (default)

Basic TF-IDF scoring with a few extra features thrown inside:

  1. For each term in each result, we calculate the TF-IDF score of that term to that document. Frequencies are weighted based on field weights that are pre-determined, and each term's frequency is normalized by the highest term frequency in each document.

  2. We multiply the total TF-IDF for the query term by the a priory document score given on FT.ADD.

  3. We give a penalty to each result based on "slop" or cumulative distance between the search terms: exact matches will get no penalty, but matches where the search terms are distant see their score reduced significantly. For each 2-gram of consecutive terms, we find the minimal distance between them. The penalty is the square root of the sum of the distances, squared - 1/sqrt(d(t2-t1)^2 + d(t3-t2)^2 + ...).

So for N terms in document D, T1...Tn, the resulting score could be described with this python function:

def get_score(terms, doc):
    # the sum of tf-idf
    score = 0

    # the distance penalty for all terms
    dist_penalty = 0

    for i, term in enumerate(terms):
        # tf normalized by maximum frequency
        tf = doc.freq(term) / doc.max_freq

        # idf is global for the index, and not calculated each time in real life
        idf = log2(1 + total_docs / docs_with_term(term))

        score += tf*idf

        # sum up the distance penalty
        if i > 0:
            dist_penalty += min_distance(term, terms[i-1])**2

    # multiply the score by the document score
    score *= doc.score

    # divide the score by the root of the cumulative distance
    if len(terms) > 1:
        score /= sqrt(dist_penalty)

    return score

TFIDF.DOCNORM

Identical to the default TFIDF scorer, with one important distinction:

Term frequencies are normalized by the length of the document (expressed as the total number of terms). The length is weighted, so that if a document contains two terms, one in a field that has a weight 1 and one in a field with a weight of 5, the total frequency is 6, not 2.

FT.SEARCH myIndex "foo" SCORER TFIDF.DOCNORM

BM25

A variation on the basic TF-IDF scorer, see this Wikipedia article for more info.

We also multiply the relevance score for each document by the a priory document score and apply a penalty based on slop as in TFIDF.

FT.SEARCH myIndex "foo" SCORER BM25

DISMAX

A simple scorer that sums up the frequencies of the matched terms; in the case of union clauses, it will give the maximum value of those matches. No other penalties or factors are applied.

It is not a 1 to 1 implementation of Solr's DISMAX algorithm but follows it in broad terms.

FT.SEARCH myIndex "foo" SCORER DISMAX

DOCSCORE

A scoring function that just returns the a priory score of the document without applying any calculations to it. Since document scores can be updated, this can be useful if you'd like to use an external score and nothing further.

FT.SEARCH myIndex "foo" SCORER DOCSCORE

HAMMING

Scoring by the (inverse) Hamming Distance between the documents' payload and the query payload. Since we are interested in the nearest neighbors, we inverse the hamming distance (1/(1+d)) so that a distance of 0 gives a perfect score of 1 and is the highest rank.

This works only if:

  1. The document has a payload.
  2. The query has a payload.
  3. Both are exactly the same length.

Payloads are binary-safe, and having payloads with a length that's a multiple of 64 bits yields slightly faster results.

Example:

127.0.0.1:6379> FT.CREATE idx SCHEMA foo TEXT
OK
127.0.0.1:6379> FT.ADD idx 1 1 PAYLOAD "aaaabbbb" FIELDS foo hello
OK
127.0.0.1:6379> FT.ADD idx 2 1 PAYLOAD "aaaacccc" FIELDS foo bar
OK

127.0.0.1:6379> FT.SEARCH idx "*" PAYLOAD "aaaabbbc" SCORER HAMMING WITHSCORES
1) (integer) 2
2) "1"
3) "0.5" // hamming distance of 1 --> 1/(1+1) == 0.5
4) 1) "foo"
   2) "hello"
5) "2"
6) "0.25" // hamming distance of 3 --> 1/(1+3) == 0.25
7) 1) "foo"
   2) "bar"

9 - Extensions

Details about extensions for query expanders and scoring functions

Extending RediSearch

RediSearch supports an extension mechanism, much like Redis supports modules. The API is very minimal at the moment, and it does not yet support dynamic loading of extensions in run-time. Instead, extensions must be written in C (or a language that has an interface with C) and compiled into dynamic libraries that will be loaded at run-time.

There are two kinds of extension APIs at the moment:

  1. Query Expanders, whose role is to expand query tokens (i.e. stemmers).
  2. Scoring Functions, whose role is to rank search results in query time.

Registering and loading extensions

Extensions should be compiled into .so files, and loaded into RediSearch on initialization of the module.

  • Compiling

    Extensions should be compiled and linked as dynamic libraries. An example Makefile for an extension can be found here.

    That folder also contains an example extension that is used for testing and can be taken as a skeleton for implementing your own extension.

  • Loading

    Loading an extension is done by appending EXTLOAD {path/to/ext.so} after the loadmodule configuration directive when loading RediSearch. For example:

    $ redis-server --loadmodule ./redisearch.so EXTLOAD ./ext/my_extension.so
    

    This causes RediSearch to automatically load the extension and register its expanders and scorers.

Initializing an extension

The entry point of an extension is a function with the signature:

int RS_ExtensionInit(RSExtensionCtx *ctx);

When loading the extension, RediSearch looks for this function and calls it. This function is responsible for registering and initializing the expanders and scorers.

It should return REDISEARCH_ERR on error or REDISEARCH_OK on success.

Example init function


#include <redisearch.h> //must be in the include path

int RS_ExtensionInit(RSExtensionCtx *ctx) {

  /* Register  a scoring function with an alias my_scorer and no special private data and free function */
  if (ctx->RegisterScoringFunction("my_scorer", MyCustomScorer, NULL, NULL) == REDISEARCH_ERR) {
    return REDISEARCH_ERR;
  }

  /* Register a query expander  */
  if (ctx->RegisterQueryExpander("my_expander", MyExpander, NULL, NULL) ==
      REDISEARCH_ERR) {
    return REDISEARCH_ERR;
  }

  return REDISEARCH_OK;
}

Calling your custom functions

When performing a query, you can tell RediSearch to use your scorers or expanders by specifying the SCORER or EXPANDER arguments, with the given alias. e.g.:

FT.SEARCH my_index "foo bar" EXPANDER my_expander SCORER my_scorer

NOTE: Expander and scorer aliases are case sensitive.

The query expander API

At the moment, we only support basic query expansion, one token at a time. An expander can decide to expand any given token with as many tokens it wishes, that will be Union-merged in query time.

The API for an expander is the following:

#include <redisearch.h> //must be in the include path

void MyQueryExpander(RSQueryExpanderCtx *ctx, RSToken *token) {
    ...
}

RSQueryExpanderCtx

RSQueryExpanderCtx is a context that contains private data of the extension, and a callback method to expand the query. It is defined as:

typedef struct RSQueryExpanderCtx {

  /* Opaque query object used internally by the engine, and should not be accessed */
  struct RSQuery *query;

  /* Opaque query node object used internally by the engine, and should not be accessed */
  struct RSQueryNode **currentNode;

  /* Private data of the extension, set on extension initialization */
  void *privdata;

  /* The language of the query, defaults to "english" */
  const char *language;

  /* ExpandToken allows the user to add an expansion of the token in the query, that will be
   * union-merged with the given token in query time. str is the expanded string, len is its length,
   * and flags is a 32 bit flag mask that can be used by the extension to set private information on
   * the token */
  void (*ExpandToken)(struct RSQueryExpanderCtx *ctx, const char *str, size_t len,
                      RSTokenFlags flags);

  /* SetPayload allows the query expander to set GLOBAL payload on the query (not unique per token)
   */
  void (*SetPayload)(struct RSQueryExpanderCtx *ctx, RSPayload payload);

} RSQueryExpanderCtx;

RSToken

RSToken represents a single query token to be expanded and is defined as:

/* A token in the query. The expanders receive query tokens and can expand the query with more query
 * tokens */
typedef struct {
  /* The token string - which may or may not be NULL terminated */
  const char *str;
  /* The token length */
  size_t len;
  
  /* 1 if the token is the result of query expansion */
  uint8_t expanded:1;

  /* Extension specific token flags that can be examined later by the scoring function */
  RSTokenFlags flags;
} RSToken;

The scoring function API

A scoring function receives each document being evaluated by the query, for final ranking. It has access to all the query terms that brought up the document,and to metadata about the document such as its a-priory score, length, etc.

Since the scoring function is evaluated per each document, potentially millions of times, and since redis is single threaded - it is important that it works as fast as possible and be heavily optimized.

A scoring function is applied to each potential result (per document) and is implemented with the following signature:

double MyScoringFunction(RSScoringFunctionCtx *ctx, RSIndexResult *res,
                                    RSDocumentMetadata *dmd, double minScore);

RSScoringFunctionCtx is a context that implements some helper methods.

RSIndexResult is the result information - containing the document id, frequency, terms, and offsets.

RSDocumentMetadata is an object holding global information about the document, such as its a-priory score.

minSocre is the minimal score that will yield a result that will be relevant to the search. It can be used to stop processing mid-way of before we even start.

The return value of the function is double representing the final score of the result. Returning 0 causes the result to be counted, but if there are results with a score greater than 0, they will appear above it. To completely filter out a result and not count it in the totals, the scorer should return the special value RS_SCORE_FILTEROUT (which is internally set to negative infinity, or -1/0).

RSScoringFunctionCtx

This is an object containing the following members:

  • *void privdata: a pointer to an object set by the extension on initialization time.
  • RSPayload payload: A Payload object set either by the query expander or the client.
  • int GetSlop(RSIndexResult *res): A callback method that yields the total minimal distance between the query terms. This can be used to prefer results where the "slop" is smaller and the terms are nearer to each other.

RSIndexResult

This is an object holding the information about the current result in the index, which is an aggregate of all the terms that resulted in the current document being considered a valid result.

See redisearch.h for details

RSDocumentMetadata

This is an object describing global information, unrelated to the current query, about the document being evaluated by the scoring function.

Example query expander

This example query expander expands each token with the term foo:

#include <redisearch.h> //must be in the include path

void DummyExpander(RSQueryExpanderCtx *ctx, RSToken *token) {
    ctx->ExpandToken(ctx, strdup("foo"), strlen("foo"), 0x1337);  
}

Example scoring function

This is an actual scoring function, calculating TF-IDF for the document, multiplying that by the document score, and dividing that by the slop:

#include <redisearch.h> //must be in the include path

double TFIDFScorer(RSScoringFunctionCtx *ctx, RSIndexResult *h, RSDocumentMetadata *dmd,
                   double minScore) {
  // no need to evaluate documents with score 0 
  if (dmd->score == 0) return 0;

  // calculate sum(tf-idf) for each term in the result
  double tfidf = 0;
  for (int i = 0; i < h->numRecords; i++) {
    // take the term frequency and multiply by the term IDF, add that to the total
    tfidf += (float)h->records[i].freq * (h->records[i].term ? h->records[i].term->idf : 0);
  }
  // normalize by the maximal frequency of any term in the document   
  tfidf /=  (double)dmd->maxFreq;

  // multiply by the document score (between 0 and 1)
  tfidf *= dmd->score;

  // no need to factor the slop if tfidf is already below minimal score
  if (tfidf < minScore) {
    return 0;
  }

  // get the slop and divide the result by it, making sure we prefer results with closer terms
  tfidf /= (double)ctx->GetSlop(h);
  
  return tfidf;
}

10 - Stemming

Stemming support

Stemming Support

RediSearch supports stemming - that is adding the base form of a word to the index. This allows the query for "going" to also return results for "go" and "gone", for example.

The current stemming support is based on the Snowball stemmer library, which supports most European languages, as well as Arabic and other. We hope to include more languages soon (if you need a specific language support, please open an issue).

For further details see the Snowball Stemmer website.

Supported languages

The following languages are supported and can be passed to the engine when indexing or querying, with lowercase letters:

  • arabic
  • armenian
  • danish
  • dutch
  • english
  • finnish
  • french
  • german
  • hungarian
  • italian
  • norwegian
  • portuguese
  • romanian
  • russian
  • serbian
  • spanish
  • swedish
  • tamil
  • turkish
  • yiddish
  • chinese (see below)

Chinese support

Indexing a Chinese document is different than indexing a document in most other languages because of how tokens are extracted. While most languages can have their tokens distinguished by separation characters and whitespace, this is not common in Chinese.

Chinese tokenization is done by scanning the input text and checking every character or sequence of characters against a dictionary of predefined terms and determining the most likely (based on the surrounding terms and characters) match.

RediSearch makes use of the Friso chinese tokenization library for this purpose. This is largely transparent to the user and often no additional configuration is required.

Using custom dictionaries

If you wish to use a custom dictionary, you can do so at the module level when loading the module. The FRISOINI setting can point to the location of a friso.ini file which contains the relevant settings and paths to the dictionary files.

Note that there is no "default" friso.ini file location. RedisSearch comes with its own friso.ini and dictionary files which are compiled into the module binary at build-time.

11 - Synonym

Synonym support

Synonyms Support

Overview

RediSearch supports synonyms - that is searching for synonyms words defined by the synonym data structure.

The synonym data structure is a set of groups, each group contains synonym terms. For example, the following synonym data structure contains three groups, each group contains three synonym terms:

{boy, child, baby}
{girl, child, baby}
{man, person, adult}

When these three groups are located inside the synonym data structure, it is possible to search for 'child' and receive documents contains 'boy', 'girl', 'child' and 'baby'.

The synonym search technique

We use a simple HashMap to map between the terms and the group ids. During building the index, we check if the current term appears in the synonym map, and if it does we take all the group ids that the term belongs to.

For each group id, we add another record to the inverted index called "~<id>" that contains the same information as the term itself. When performing a search, we check if the searched term appears in the synonym map, and if it does we take all the group ids the term is belong to. For each group id, we search for "~<id>" and return the combined results. This technique ensures that we return all the synonyms of a given term.

Handling concurrency

Since the indexing is performed in a separate thread, the synonyms map may change during the indexing, which in turn may cause data corruption or crashes during indexing/searches. To solve this issue, we create a read-only copy for indexing purposes. The read-only copy is maintained using ref count.

As long as the synonyms map does not change, the original synonym map holds a reference to its read-only copy so it will not be freed. Once the data inside the synonyms map has changed, the synonyms map decreses the reference count of its read only copy. This ensures that when all the indexers are done using the read only copy, then the read only copy will automatically freed. Also it ensures that the next time an indexer asks for a read-only copy, the synonyms map will create a new copy (contains the new data) and return it.

Quick example

# Create an index
> FT.CREATE idx schema t text

# Create a synonym group 
> FT.SYNUPDATE idx group1 hello world

# Insert documents
> HSET foo t hello
(integer) 1
> HSET bar t world
(integer) 1

# Search
> FT.SEARCH idx hello
1) (integer) 2
2) "foo"
3) 1) "t"
   2) "hello"
4) "bar"
5) 1) "t"
   2) "world"

12 - Payload

Payload support(deprecated)

Document Payloads

!!! note The payload feature is deprecated in 2.0

Usually, RediSearch stores documents as hash keys. But if you want to access some data for aggregation or scoring functions, we might want to store that data as an inline payload. This will allow us to evaluate properties of a document for scoring purposes at very low cost.

Since the scoring functions already have access to the DocumentMetaData, which contains document flags and score, We can add custom payloads that can be evaluated in run-time.

Payloads are NOT indexed and are not treated by the engine in any way. They are simply there for the purpose of evaluating them in query time, and optionally retrieving them. They can be JSON objects, strings, or preferably, if you are interested in fast evaluation, some sort of binary encoded data which is fast to decode.

Adding payloads for documents

When inserting a document using FT.ADD, you can ask RediSearch to store an arbitrary binary safe string as the document payload. This is done with the PAYLOAD keyword:

FT.ADD {index_name} {doc_id} {score} PAYLOAD {payload} FIELDS {field} {data}...

Evaluating payloads in query time

When implementing a scoring function, the signature of the function exposed is:

double (*ScoringFunction)(DocumentMetadata *dmd, IndexResult *h);

!!! note Currently, scoring functions cannot be dynamically added, and forking the engine and replacing them is required.

DocumentMetaData includes a few fields, one of them being the payload. It wraps a simple byte array with arbitrary length:

typedef struct  {
    char *data,
    uint32_t len;
} DocumentPayload;

If no payload was set to the document, it is simply NULL. If it is not, you can go ahead and decode it. It is recommended to encode some metadata about the payload inside it, like a leading version number, etc.

Retrieving payloads from documents

When searching, it is possible to request the document payloads from the engine.

This is done by adding the keyword WITHPAYLOADS to FT.SEARCH.

If WITHPAYLOADS is set, the payloads follow the document id in the returned result. If WITHSCORES is set as well, the payloads follow the scores. e.g.:

127.0.0.1:6379> FT.CREATE foo SCHEMA bar TEXT
OK
127.0.0.1:6379> FT.ADD foo doc2 1.0 PAYLOAD "hi there!" FIELDS bar "hello"
OK
127.0.0.1:6379> FT.SEARCH foo "hello" WITHPAYLOADS WITHSCORES
1) (integer) 1
2) "doc2"           # id
3) "1"              # score
4) "hi there!"      # payload
5) 1) "bar"         # fields
   2) "hello"

13 - Spellchecking

Query spelling correction support

Query Spelling Correction

Query spelling correction, a.k.a "did you mean", provides suggestions for misspelled search terms. For example, the term 'reids' may be a misspelled 'redis'.

In such cases and as of v1.4 RediSearch can be used for generating alternatives to misspelled query terms. A misspelled term is a full text term (i.e., a word) that is:

  1. Not a stop word
  2. Not in the index
  3. At least 3 characters long

The alternatives for a misspelled term are generated from the corpus of already-indexed terms and, optionally, one or more custom dictionaries. Alternatives become spelling suggestions based on their respective Levenshtein distances (LD) from the misspelled term. Each spelling suggestion is given a normalized score based on its occurances in the index.

To obtain the spelling corrections for a query, refer to the documentation of the FT.SPELLCHECK command.

Custom dictionaries

A dictionary is a set of terms. Dictionaries can be added with terms, have terms deleted from them and have their entire contents dumped using the FT.DICTADD, FT.DICTDEL and FT.DICTDUMP commands, respectively.

Dictionaries can be used to modify the behavior of RediSearch's query spelling correction, by including or excluding their contents from potential spelling correction suggestions.

When used for term inclusion, the terms in a dictionary can be provided as spelling suggestions regardless their occurances (or lack of) in the index. Scores of suggestions from inclusion dictionaries are always 0.

Conversely, terms in an exlusion dictionary will never be returned as spelling alternatives.

14 - Phonetic

Phonetic matching

Phonetic Matching

Phonetic matching, a.k.a "Jon or John", allows searching for terms based on their pronunciation. This capability can be a useful tool when searching for names of people.

Phonetic matching is based on the use of a phonetic algorithm. A phonetic algorithm transforms the input term to an approximate representation of its pronunciation. This allows indexing terms, and consequently searching, by their pronunciation.

As of v1.4 RediSearch provides phonetic matching via the definition of text fields with the PHONETIC attribute. This causes the terms in such fields to be indexed both by their textual value as well as their phonetic approximation.

Performing a search on PHONETIC fields will, by default, also return results for phonetically similar terms. This behavior can be controlled with the $phonetic query attribute.

Phonetic algorithms support

RediSearch currently supports a single phonetic algorithm, the Double Metaphone (DM). It uses the implementation at slacy/double-metaphone, which provides general support for Latin languages.

15 - Vector similarity

Details about vector fields and vector similarity queries

Vector Fields

Vector fields offers the ability to use vector similarity queries in the FT.SEARCH command.

Vector Similarity search capability offers the ability to load, index and query vectors stored as fields in a redis hashes.

At present, the key functionalites offered are:

Creating a vector field

Vector fields can be added to the schema in FT.CREATE with the following syntax:

FT.CREATE ... SCHEMA ... {field_name} VECTOR {algorithm} {count} [{attribute_name} {attribute_value} ...]
  • {algorithm}

    Must be specified and be a supported vector similarity index algorithm. the supported algorithms are:

    FLAT - brute force algorithm.

    HNSW - Hierarchical Navigable Small World algorithm.

    The algorithm attribute specify which algorithm to use when searching for the k most similar vectors in the index.

  • {count}

    Specify the number of attributes for the index. Must be specified.

    Notice that this attribute counts the total number of attributes passed for the index in the command, although algorithm parameters should be submitted as named arguments. For example:

    FT.CREATE my_idx SCHEMA vec_field VECTOR FLAT 6 TYPE FLOAT32 DIM 128 DISTANCE_METRIC L2
    

    Here we pass 3 parameters for the index (TYPE, DIM, DISTANCE_METRIC), and count counts the total number of attributes (6).

  • {attribute_name} {attribute_value}

    Algorithm attributes for the creation of the vector index. Every algorithm has its own mandatory and optional attributes.

Specific creation attributes per algorithm

FLAT

  • Mandatory parameters

    • TYPE - Vector type. Current supported type is FLOAT32.

    • DIM - Vector dimention. should be positive integer.

    • DISTANCE_METRIC - Supported distance metric. Currently one of {L2, IP, COSINE}

  • Optional parameters

    • INITIAL_CAP - Initial vector capacity in the index. Effects memory allocation size of the index.

    • BLOCK_SIZE - block size to hold BLOCK_SIZE amount of vectors in a contiguous array. This is useful when the index is dynamic with respect to addition and deletion. Defaults to 1048576 (1024*1024).

  • Example

    FT.CREATE my_index1 
    SCHEMA vector_field VECTOR 
    FLAT 
    10 
    TYPE FLOAT32 
    DIM 128 
    DISTANCE_METRIC L2 
    INITIAL_CAP 1000000 
    BLOCK_SIZE 1000
    

HNSW

  • Mandatory parameters

    • TYPE - Vector type. Current supported type is FLOAT32.

    • DIM - Vector dimention. should be positive integer.

    • DISTANCE_METRIC - Supported distance metric. Currently one of {L2, IP, COSINE}

  • Optional parameters

    • INITIAL_CAP - Initial vector capacity in the index. Effects memory allocation size of the index.

    • M - Number the maximal allowed outgoing edges for each node in the graph in each layer. on layer zero the maximal number of outgoing edges will be 2M. Defaults to 16.

    • EF_CONSTRUCTION - Number the maximal allowed potential outgoing edges candidates for each node in the graph, during the graph building. Defaults to 200.

    • EF_RUNTIME - The number of maximum top candidates to hold during the KNN search. Higher values of EF_RUNTIME will lead to a more accurate results on the expense of a longer runtime. Defaults to 10.

  • Example

    FT.CREATE my_index2 
    SCHEMA vector_field VECTOR 
    HNSW 
    14 
    TYPE FLOAT32 
    DIM 128 
    DISTANCE_METRIC L2 
    INITIAL_CAP 1000000 
    M 40 
    EF_CONSTRUCTION 250 
    EF_RUNTIME 20
    

Querying vector fields

We allow using vector similarity queries in the FT.SEARCH "query" parameter. The syntax for vector similarity queries is *=>[{vector similarity query}] for running the query on an entire vector field, or {primary filter query}=>[{vector similarity query}] for running similarity query on the result of the primary filter query. To use a vector similarity query, you must specify the option DIALECT 2 in the command itself, or by setting the DEFAULT_DIALECT option to 2, either with the commandFT.CONFIG SET or when loading the redisearch module and passing it the argument DEFAULT_DIALECT 2.

As of version 2.4, we allow vector similarity to be used once in the query, and over the entire query filter.

  • Invalid example: "(@title:Matrix)=>[KNN 10 @v $B] @year:[2020 2022]"

  • Valid example: "(@title:Matrix @year:[2020 2022])=>[KNN 10 @v $B]"

The {vector similarity query} part inside the square brackets needs to be in the following format:

KNN { number | $number_attribute } @{vector field} $blob_attribute [{vector query param name} {value|$value_attribute} [...]] [ AS {score field name | $score_field_name_attribute}]

Every "*_attribute" parameter should refer to an attribute in the PARAMS section.

  • { number | $number_attribute } - The number of requested results ("K").

  • @{vector field} - vector field should be a name of a vector field in the index.

  • $blob_attribute - An attribute that holds the query vector as blob. must be passed through the PARAMS section.

  • [{vector query param name} {value|$value_attribute} [...]] - An optional part for passing vector similarity query parameters. Parameters should come in key-value pairs and should be valid parameters for the query. see what runtime parameters are valid for each algorithm.

  • [ AS {score field name | $score_field_name_attribute}] - An optional part for specifying a score field name, for later sorting by the similarity score. By default the score field name is "__{vector field}_score" and it can be used for sorting without using AS {score field name} in the query.

Specific runtime attributes per algorithm

FLAT

Currently there are no runtime parameters available for FLAT indexes

HNSW

  • Optional parameters

    • EF_RUNTIME - The number of maximum top candidates to hold during the KNN search. Higher values of EF_RUNTIME will lead to a more accurate results on the expense of a longer runtime. Defaults to the EF_RUNTIME value passed on creation (which defaults to 10).

A few notes

  1. Although specifing K requested results, the default LIMIT in RediSearch is 10, so for getting all the returned results, make sure to specify LIMIT 0 {K} in your command.

  2. By default, the resluts are sorted by their documents default RediSearch score. for getting the results sorted by similarity score, use SORTBY {score field name} as explained earlier.

Examples for querying vector fields

  • FT.SEARCH idx "*=>[KNN 100 @vec $BLOB]" PARAMS 2 BLOB "\12\a9\f5\6c" DIALECT 2
    
  • FT.SEARCH idx "*=>[KNN 100 @vec $BLOB]" PARAMS 2 BLOB "\12\a9\f5\6c" SORTBY __vec_score DIALECT 2
    
  • FT.SEARCH idx "*=>[KNN $K @vec $BLOB EF_RUNTIME $EF]" PARAMS 6 BLOB "\12\a9\f5\6c" K 10 EF 150 DIALECT 2
    
  • FT.SEARCH idx "*=>[KNN $K @vec $BLOB AS my_scores]" PARAMS 4 BLOB "\12\a9\f5\6c" K 10 SORTBY my_scores DIALECT 2
    
  • FT.SEARCH idx "(@title:Dune @num:[2020 2022])=>[KNN $K @vec $BLOB AS my_scores]" PARAMS 4 BLOB "\12\a9\f5\6c" K 10 SORTBY my_scores DIALECT 2
    
  • FT.SEARCH idx "(@type:{shirt} ~@color:{blue})=>[KNN $K @vec $BLOB AS my_scores]" PARAMS 4 BLOB "\12\a9\f5\6c" K 10 SORTBY my_scores DIALECT 2