Fast phrase matching using Roaring Bitmaps

A couple weeks ago I wrote about RoaringLite, the SQLite Roaring Bitmaps extension. Today we are going to delve a little bit deeper and explore how can Roaring Bitmaps be used to implement fast (and space efficient) phrase search over a trigram index and then we compare that to SQLite’s FTS5 implementation.

Assumptions

We are not going to implement a full fledged full text search engine, rather, a very simplified phrase search engine that assumes the following about the data:

We will be processing log lines, which are generally limited in size, usually under a KB
The log lines don’t have structured attributes, just a sequence of characters delimited by a newline
No stemming, synonym support or typo tolerance is required, only exact sequences are matched
No ranking of the result set, it is always sorted by date of insertion (we will only be doing count queries anyway), this could actually be acceptable in log search, since we are mostly interested in the analysis of the log data.
Since we are interested in exact matches then we will be using a trigram index for both the Roaring Bitmaps implementation and the FTS5 one.

A trigram index is simply a sequence of three letter tokens, as such:

'SQLite rocks!'.tokenize
# ['SQL', 'QLi', 'Lit', 'ite', 'te ', 'e r', ' ro' , 'roc', 'ock', 'cks', 'ks!']

Data & Schema

For experimentation we will be using the OpenSSH logs from loghub, it is a ~78MB log file with over 600K log lines.

The database schema looks as such:

-- our data table
CREATE TABLE logs (
  id INTEGER PRIMARY KEY, 
  line TEXT
);

-- the fts5 index, using the trigram tokenizer and referring to the data table
CREATE VIRTUAL TABLE logs_idx USING fts5 (
  line,
  content='logs',
  tokenize='trigram'
); 

-- our Roaring Bitmaps index
CREATE TABLE trigrams (
  trigram TEXT,
  pos INTEGER, -- the position of the trigram in the log line
  line_ids BLOB, -- the roaring bitmap
  PRIMARY KEY (trigram, pos)
) WITHOUT ROWID;

We can get away with exposing the pos as a col since there should not be so many positions in a log line, if you are indexing something like books then expect that each trigram will expand over a huge number of rows.

Populating the index

To populate the FTS5 index we run the following straight forward SQL

INSERT INTO logs_idx(rowid, line) SELECT id, line FROM logs;

This takes ~24 seconds to complete, which is decent

Afterwards we check the size of the created index

-- merge FTS5 data segments to save space and improve performance
INSERT INTO logs_idx(logs_idx) VALUES ('optimize');
-- we need to query the shadow table created by FTS5 that contains the actual data
SELECT count(*) * 4.0 / 1024 FROM dbstat('main') WHERE name = 'logs_idx_data';

The size returned was ~200MB, which is quite sizable for a ~78M input file.

For the Roaring Bitmaps we use the following query

WITH RECURSIVE extract(id, trigram, line, pos) AS (
  SELECT id, lower(substr(line, 1, 3)), lower(line), 0 FROM logs
  UNION ALL
  SELECT id, substr(line, pos + 1, 3), line, pos + 1 FROM extract
  WHERE pos >= length(line) - 3
) INSERT INTO trigrams 
  SELECT trigram, pos, rb_group_create(id) FROM extract 
  GROUP BY trigram, pos 
  ON CONFLICT 
  DO UPDATE SET line_ids = rb_or(line_ids, EXCLUDED.line_ids);

This query iterates recursively over each log line, converting it into tri-grams then inserting those into the trigrams table against the proper trigram and position combo.

This query took ~120 seconds to complete, which is 5 times slower than FTS5. Not a great start for the Roaring Bitmaps, though to be fair, this query spends most of its time in the tokenization part, moving that to an extension function would probably end up with similar timings compared to FTS5.

Now we look at the index size

-- optimize the table layout
VACUUM;
SELECT count(*) * 4.0 / 1024 FROM dbstat('main') WHERE name = 'trigrams';

The size measured was ~110MB, which 45% smaller than the FTS5 version, that’s a win in my book given FTS5 already uses a pretty compact (varint) format extensively.

Querying the index

Querying phrases in FTS5 is straight forward and elegant, the query looks like this (as supplied to the prepare function):

SELECT :term AS phrase, count(*) AS count FROM logs_idx('\"'||:term||'\"')

On the other hand, querying the trigrams table is quite involved, we use the following query

WITH RECURSIVE extract_trigrams(trigram, term, pos, next_pos) AS MATERIALIZED (
    SELECT substr(:term, 1, 3), :term, 0, 3
    UNION ALL
    SELECT 
      substr(term, iif(length(term) - next_pos >= 3, 
      next_pos, length(term) - 3) + 1, 3), 
      term, 
      iif(length(term) - next_pos >= 3 , next_pos, length(term) - 3), 
      next_pos + 3 
    FROM find_term WHERE next_pos < length(term)
), stats AS MATERIALIZED (
    SELECT count(*) AS term_count, sum(pos) AS pos_sum FROM extract_trigrams
), all_bitmaps AS MATERIALIZED (
    SELECT 
      t.trigram AS tr, 
      rb_group_and(t.line_ids) AS bm,  
      (t.pos - s.pos) AS rel_pos, 
    FROM trigrams AS t, extract_trigrams AS s, stats 
    WHERE t.trigram = s.trigram 
    GROUP BY rel_pos 
    HAVING sum(s.pos) = pos_sum AND count(*) = term_count
    ORDER BY rel_pos DESC
) SELECT 
      :term AS phrase, 
      rb_count(rb_group_or(bm)) AS count 
  FROM all_bitmaps;

Let’s quickly explain what this query does, it is composed of a few sub queries:

extract_trigrams, a recursive query to tokenize the search phrase into trigrams
stats, collects aggregate info about the phrase, like the count of tokens and the sum of positions
all_bitmaps, the actual query, it extracts bitmap data from the trigrams table, aggregates them based on the relative position of each term in the phrase and then filters out those instances that don’t account for all the terms in the phrase.
Outermost query, returns the count of all the records matching the phrase

To test the correctness of the implementation we ran a few phrases on both engines and the results were identical

Phrase	SQLite FTS5 Index results count	Roaring Bitmaps Index results count
invalid user	49472	49472
failed password	197573	197573
connection closed	68957	68957
getaddrinfo	18909	18909
ssh2	198184	198184
break-in	19406	19406
received disconnect	48692	48692
received disconnect from	48692	48692
jan	352822	352822
dec 30	40372	40372
bye bye	46580	46580
fatal	961	961
connection reset by peer	955	955
pam_unix(sshd:auth):	174895	174895

Phrase search results (log line counts)

It is worthy to mention that a much simpler query could have been used by relying on whatever host language you are calling SQLite from (Ruby in my case). But pushing as much logic down the SQLite pipes usually results in faster execution as it minimizes the amount of objects created in the host environment and hence lowers the pressure on thing like the garbage collector and minimize the calls that cross from C to Ruby and the other way around.

Performance Benchmarks

The same set of phrases were used to benchmark both solutions, the benchmark code would loop for around one second repeatedly issuing a certain phrase to the index then counting the number of invocations. Here are the results:

Phrase	SQLite FTS5 Index searches/sec	Roaring Bitmaps Index searches/sec
invalid user	35	809
failed password	9	679
connection closed	20	1069
getaddrinfo	124	2768
ssh2	30	674
break-in	135	1487
received disconnect	25	676
received disconnect from	18	563
jan	39	27214
dec 30	125	11162
bye bye	73	1686
fatal	4415	14959
connection reset by peer	135	1450
pam_unix(sshd:auth):	8	612

Phrase search performance (searches per second)

A few comments on the results, please take note that these apply only in the context of phrase search, on a trigram index without result ranking:

Roaring bitmaps are generally much faster than FTS5 for counting the results of a phrase search
The performance advantage ranges from 3 times faster all the way to almost 700 times faster
SQLite performance is very sensitive to two variables:
- The size of the phrase (hence the number of trigrams)
- The number of docs that a phrase can be found in
  That’s why the best performing phrase is “fatal” since it is short (3 trigrams) and only found in 951 log lines
On the other hand, this Roaring Bitmaps implementation’s performance is more sensitive to
- The number of unique positions a trigram can be found in
  That’s why it is very fast for the phrase “jan”, a single trigram, only found in a single position in all the log lines

In conclusion, by using a modern and efficient data structure, and by imposing some constraints on the problem at hand we were able to perform phrase search much faster (sometimes orders of magnitude faster) than SQLite’s FTS5. Of course FTS5 can do a lot more than this simplistic implementation, but it is still promising to see these results.

Fast phrase matching using Roaring Bitmaps

Assumptions

Data & Schema

Populating the index

Querying the index

Performance Benchmarks

Leave a comment Cancel reply

Oldmoe's blog

Fast phrase matching using Roaring Bitmaps

Assumptions

Data & Schema

Populating the index

Querying the index

Performance Benchmarks

Share this:

Leave a comment Cancel reply