Blog at WordPress.com.

A couple weeks ago I wrote about RoaringLite, the SQLite Roaring Bitmaps extension. Today we are going to delve a little bit deeper and explore how can Roaring Bitmaps be used to implement fast (and space efficient) phrase search over a trigram index and then we compare that to SQLite’s FTS5 implementation.

Assumptions

We are not going to implement a full fledged full text search engine, rather, a very simplified phrase search engine that assumes the following about the data:

  • We will be processing log lines, which are generally limited in size, usually under a KB
  • The log lines don’t have structured attributes, just a sequence of characters delimited by a newline
  • No stemming, synonym support or typo tolerance is required, only exact sequences are matched
  • No ranking of the result set, it is always sorted by date of insertion (we will only be doing count queries anyway), this could actually be acceptable in log search, since we are mostly interested in the analysis of the log data.
  • Since we are interested in exact matches then we will be using a trigram index for both the Roaring Bitmaps implementation and the FTS5 one.

A trigram index is simply a sequence of three letter tokens, as such:

'SQLite rocks!'.tokenize
# ['SQL', 'QLi', 'Lit', 'ite', 'te ', 'e r', ' ro' , 'roc', 'ock', 'cks', 'ks!'] 

Data & Schema

For experimentation we will be using the OpenSSH logs from loghub, it is a ~78MB log file with over 600K log lines.

The database schema looks as such:

-- our data table
CREATE TABLE logs (
  id INTEGER PRIMARY KEY, 
  line TEXT
);

-- the fts5 index, using the trigram tokenizer and referring to the data table
CREATE VIRTUAL TABLE logs_idx USING fts5 (
  line,
  content='logs',
  tokenize='trigram'
); 

-- our Roaring Bitmaps index
CREATE TABLE trigrams (
  trigram TEXT,
  pos INTEGER, -- the position of the trigram in the log line
  line_ids BLOB, -- the roaring bitmap
  PRIMARY KEY (trigram, pos)
) WITHOUT ROWID;

We can get away with exposing the pos as a col since there should not be so many positions in a log line, if you are indexing something like books then expect that each trigram will expand over a huge number of rows.

Populating the index

To populate the FTS5 index we run the following straight forward SQL

INSERT INTO logs_idx(rowid, line) SELECT id, line FROM logs;

This takes ~24 seconds to complete, which is decent

Afterwards we check the size of the created index

-- merge FTS5 data segments to save space and improve performance
INSERT INTO logs_idx(logs_idx) VALUES ('optimize');
-- we need to query the shadow table created by FTS5 that contains the actual data
SELECT count(*) * 4.0 / 1024 FROM dbstat('main') WHERE name = 'logs_idx_data';

The size returned was ~200MB, which is quite sizable for a ~78M input file.

For the Roaring Bitmaps we use the following query

WITH RECURSIVE extract(id, trigram, line, pos) AS (
  SELECT id, lower(substr(line, 1, 3)), lower(line), 0 FROM logs
  UNION ALL
  SELECT id, substr(line, pos + 1, 3), line, pos + 1 FROM extract
  WHERE pos >= length(line) - 3
) INSERT INTO trigrams 
  SELECT trigram, pos, rb_group_create(id) FROM extract 
  GROUP BY trigram, pos 
  ON CONFLICT 
  DO UPDATE SET line_ids = rb_or(line_ids, EXCLUDED.line_ids);

This query iterates recursively over each log line, converting it into tri-grams then inserting those into the trigrams table against the proper trigram and position combo.

This query took ~120 seconds to complete, which is 5 times slower than FTS5. Not a great start for the Roaring Bitmaps, though to be fair, this query spends most of its time in the tokenization part, moving that to an extension function would probably end up with similar timings compared to FTS5.

Now we look at the index size

-- optimize the table layout
VACUUM;
SELECT count(*) * 4.0 / 1024 FROM dbstat('main') WHERE name = 'trigrams';

The size measured was ~110MB, which 45% smaller than the FTS5 version, that’s a win in my book given FTS5 already uses a pretty compact (varint) format extensively.

Querying the index

Querying phrases in FTS5 is straight forward and elegant, the query looks like this (as supplied to the prepare function):

SELECT :term AS phrase, count(*) AS count FROM logs_idx('\"'||:term||'\"')

On the other hand, querying the trigrams table is quite involved, we use the following query

WITH RECURSIVE extract_trigrams(trigram, term, pos, next_pos) AS MATERIALIZED (
    SELECT substr(:term, 1, 3), :term, 0, 3
    UNION ALL
    SELECT 
      substr(term, iif(length(term) - next_pos >= 3, 
      next_pos, length(term) - 3) + 1, 3), 
      term, 
      iif(length(term) - next_pos >= 3 , next_pos, length(term) - 3), 
      next_pos + 3 
    FROM find_term WHERE next_pos < length(term)
), stats AS MATERIALIZED (
    SELECT count(*) AS term_count, sum(pos) AS pos_sum FROM extract_trigrams
), all_bitmaps AS MATERIALIZED (
    SELECT 
      t.trigram AS tr, 
      rb_group_and(t.line_ids) AS bm,  
      (t.pos - s.pos) AS rel_pos, 
    FROM trigrams AS t, extract_trigrams AS s, stats 
    WHERE t.trigram = s.trigram 
    GROUP BY rel_pos 
    HAVING sum(s.pos) = pos_sum AND count(*) = term_count
    ORDER BY rel_pos DESC
) SELECT 
      :term AS phrase, 
      rb_count(rb_group_or(bm)) AS count 
  FROM all_bitmaps;

Let’s quickly explain what this query does, it is composed of a few sub queries:

  • extract_trigrams, a recursive query to tokenize the search phrase into trigrams
  • stats, collects aggregate info about the phrase, like the count of tokens and the sum of positions
  • all_bitmaps, the actual query, it extracts bitmap data from the trigrams table, aggregates them based on the relative position of each term in the phrase and then filters out those instances that don’t account for all the terms in the phrase.
  • Outermost query, returns the count of all the records matching the phrase

To test the correctness of the implementation we ran a few phrases on both engines and the results were identical

PhraseSQLite FTS5 Index results countRoaring Bitmaps Index results count
invalid user4947249472
failed password197573197573
connection closed6895768957
getaddrinfo1890918909
ssh2198184198184
break-in1940619406
received disconnect4869248692
received disconnect from4869248692
jan352822352822
dec 304037240372
bye bye4658046580
fatal961961
connection reset by peer955955
pam_unix(sshd:auth):174895174895
Phrase search results (log line counts)

It is worthy to mention that a much simpler query could have been used by relying on whatever host language you are calling SQLite from (Ruby in my case). But pushing as much logic down the SQLite pipes usually results in faster execution as it minimizes the amount of objects created in the host environment and hence lowers the pressure on thing like the garbage collector and minimize the calls that cross from C to Ruby and the other way around.

Performance Benchmarks

The same set of phrases were used to benchmark both solutions, the benchmark code would loop for around one second repeatedly issuing a certain phrase to the index then counting the number of invocations. Here are the results:

PhraseSQLite FTS5 Index searches/secRoaring Bitmaps Index searches/sec
invalid user35809
failed password9679
connection closed201069
getaddrinfo1242768
ssh230674
break-in1351487
received disconnect25676
received disconnect from18563
jan3927214
dec 3012511162
bye bye731686
fatal441514959
connection reset by peer1351450
pam_unix(sshd:auth):8612
Phrase search performance (searches per second)

A few comments on the results, please take note that these apply only in the context of phrase search, on a trigram index without result ranking:

  • Roaring bitmaps are generally much faster than FTS5 for counting the results of a phrase search
  • The performance advantage ranges from 3 times faster all the way to almost 700 times faster
  • SQLite performance is very sensitive to two variables:
    • The size of the phrase (hence the number of trigrams)
    • The number of docs that a phrase can be found in
      That’s why the best performing phrase is “fatal” since it is short (3 trigrams) and only found in 951 log lines
  • On the other hand, this Roaring Bitmaps implementation’s performance is more sensitive to
    • The number of unique positions a trigram can be found in
      That’s why it is very fast for the phrase “jan”, a single trigram, only found in a single position in all the log lines

In conclusion, by using a modern and efficient data structure, and by imposing some constraints on the problem at hand we were able to perform phrase search much faster (sometimes orders of magnitude faster) than SQLite’s FTS5. Of course FTS5 can do a lot more than this simplistic implementation, but it is still promising to see these results.

Leave a comment