For many projects FTS is now a requirement. SQLite's fts5 extension provides full-text search (fts) functionality that's simple and fast. Here I document a solution for serverless applications that can run SQLite indexes in memory for fts.
Data
My data is books. A 20 byte ID, a title, an array of authors and the publish year ~ 50-100 bytes per book. Data size per item is a very important optimisation as it will determine the minimum size of your indexes. The smaller the better.
Building the index
CREATE VIRTUAL TABLE IF NOT EXISTS books_search USING fts5(
id, title, authors, year,
);
Queries can now be run returning the book ID's title and their ranking:
SELECT id, title, rank FROM books_search
WHERE id, title MATCH $title
ORDER BY rank LIMIT 10;
150K books can be stored in a changeset file of around 22 MB.
Running Serverless
The usual way to use these indexes would be to store the data on disk and have SQLite query the disk directly. Going to serverless we lose the option for disk IO and have to use the cloud provided storage solutions. We can therefore run the SQLite index in-memory, fetching the data when the table is needed.
In my case I'm running a go application under google cloud run. Here I store the changesets under google's cloud storage and apply them as a changeset.
The library I used for go is crawshaw.io/sqlite, which importantly supports sqlite's session extensions. Below is a snippet on how to apply the sqlite index to an in memory database:
filterFn := func(tn string) bool {
return true
}
conflictFn := func(ct sqlite.ConflictType, ci sqlite.ChangesetIter) sqlite.ConflictAction {
return sqlite.SQLITE_CHANGESET_ABORT
}
blob, err := bkt.NewReader(ctx, name, nil)
if err != nil {
return err
}
defer blob.Close()
var r io.Reader
r = blob
if strings.HasSuffix(name, ".gz") {
gr, err := gzip.NewReader(r)
if err != nil {
return err
}
defer gr.Close()
r = gr
}
if err := c.ChangesetApply(r, filterFn, conflictFn); err != nil {
return err
}
We can then apply the changeset when a search query is run. The first query will be slow, downloading the index and applying it. But subsequent queries will run very quickly on the in memory indexes.
Going Smaller
The above worked great on a small dataset of around 150K books at 22 MB.
However, going to a million books upped the index size to 192 MB.
A bit on the heavy side for cloud run's RAM usuage.
To get a reasonable larger dataset we can compress, or more accurately throw away, more of the data.
SQLite fts5
tables have a content
option. This allows you to specify the backing table for the search index.
Setting content=''
creates a contentless table, the columns won't be available under SQLite.
Importantly for contentless fts tables we now need to set the rowid
on inserts.
We must have an explicit rowid INTEGER PRIMARY KEY
.
This is to avoid SQLite VACUUM rearrange the rowids and corrupting what the search index data points to.
CREATE TABLE IF NOT EXISTS books_index (
rowid INTEGER PRIMARY KEY,
bookid TEXT
);
CREATE INDEX idx_books_index ON books_index (bookid);
CREATE VIRTUAL TABLE IF NOT EXISTS books_search USING fts5(
title, authors, year, content=''
);
Queries can now be run returning the book ID's and their ranking:
SELECT bookid, rank FROM books_search S
JOIN books_index I ON S.rowid = I.rowid
WHERE title MATCH $title
ORDER BY rank LIMIT 10;
Content isn't available via the index so all information must be fetchable based
on the bookid
.
1 Million books can now be stored in 91 MB. Going further we can compress the index on disk with gzip for a 47 MB gzipped file.
Alternatives
Alternatively serverless solutions like Algolia are great and allow serverless applications to remain stateless without worrying about the implementation. SQLite also has some interesting solutions on running it client side. This is done by compiling SQLite to wasm and dynamically fetching parts of the database from storage when queries are run (write up). For small and simple application running the SQLite index as part of the application was exciting to see how easy it is to get a simple workable solution.