Exploring pq, a plain-text database for plan9

I came across a cat-v article titled all software sucks, that lists software the cat-v author considers “harmful” and bloated, along with less harmful alternatives.

On that list I found SQL. SQL is undoubtedly the default choice for applications these days, and it also undoubtedly carries a lot of complexity. Backups, testing, encryption, performance are really hard problems when you use a SQL database. Serialization is also messy. SQL as a language doesn’t compose. Self-hosting a SQL database is rarely considered a good idea, instead you’re recommended to use extremely expensive SAAS solutions. So there are quite a few drawbacks. Is there a cure?

The author lists a few alternatives. One of them that piqued my interest was pq. A search…

I came across a cat-v article titled all software sucks, that lists software the cat-v author considers “harmful” and bloated, along with less harmful alternatives.

The author lists a few alternatives. One of them that piqued my interest was pq. A search for “pq” and “pq database” yielded nothing of interest, and so the first time around I gave up on trying to find it. Months later, I returned to the article and realized that pq must be connected to plan9. And sure enough, when I searched for “pq plan9 database” I was able to find it. Will it solve all which I lament about SQL? There is hardly any information about pq online, but I managed to find it in the addons section on 9p.io. I downloaded it, and began reading the manual by invoking man -l man/1/pq. The manual begins:

Pq connects to an Implicit Relational Database (IRDB) directory and outputs the record(s) that match the query arguments. The directory contacted may be specified by a modules argument (see dispatch(7)).

I also found a reddit thread where a previous maintainer talks about it:

pq (named before convenient names for web searches were a primary concern) is here: https://9p.io/sources/extra/pq.tgz I’ve done two updated releases since that, but both are wonky in different ways; I still hope to get back to cleaning them up. I have done a lot of work with pq and really like it. One nice property is that you can treat it as just a front-end for plain files (the data file indexing is optional, and the data files themselves are just character-separated text)[…]

The previous maintainer also mentions the difference between pq and SQL:

[…] I’ve done a lot of SQL. They’re very different models. I think the biggest difference, from a language/interface perspective, is that in pq the server knows the relationships between items (we call it an “implicit relational database”), rather than you having to tell it. If you’re familiar with SQL, you can think of it as omitting the FROM and JOIN clauses, leaving just SELECT and WHERE. A bit more constrained setting up the schema initially, but all your queries become faster. The data store being just plain text files, and the very modular structure, is a big deal, too. You could easily put a pq interface on /etc/passwd, for example, and integrate that with the rest of your schema. pq is very plan9-ish in that the module interface is just open/close/read/write, and anything can implement that. So modules can query network resources, other servers, data files, or synthesize values, and you can mix that all together arbitrarily.

My first thought is that this sounds a bit like a predecessor to duckdb, a database where you can query arbitrary files. Or maybe it is like recutils, a plain-text database engine with human-writable and readable file format? Different from both tools is that pq carries the minimalism and brevity that you find everywhere in plan9. The source code seems to sit at around 5000 lines. Different from recutils is that pq comes with the ability to build an indexing layer for faster queries, and can query multiple sources and join them together.

Unfortunately, I could not get pq to compile on Debian, not even with plan9ports. So I had to boot up plan9 in a VM and compile it there. After fumbling around when trying to navigate the esoteric rio interface, I somehow managed to manually modify and move some files, and I was finally able to get it to run.

Plan9 running with a terminal executing a pq command

There are some files to be aware of in pq. The primary file that you will include for queries will be the dispatch file, which lists the queryable attributes and where to find them. The program comes with a basic example of a database with phone numbers and home addresses, and this example of a dispatch file:

> +ssn first middle last street street2 city state zip
ev /usr/glenda/pq/example/name
> +ssn number type
ev /usr/glenda/pq/example/phone
opt call host.com

Reading the example readme, it tells us the above dispatch file has relations for two directories and a pointer to a host for attributes that cannot be found locally. I had to remove the opt call host.com line to get the examples to run.

To query with pq, We use cli-tool to query and include a module argument, which will be the dispatch file:

% pq -m 'join /usr/glenda/pq/example/dispatch' -o '%number' 'last=Smith|type=Home'
+1 204 567 8900

Here pq has performed a join to find the phone number of a user in the phone ev database. It’s not explicitly listed in the manual but I guessed the + in +ssn indicates that this is the field that can be used to join other files. The -o parameter specifies how you want the output to look like, so you can do something like -o 'your number is %number'.

ev in the dispatch file in this case denotes a module. A module is a particular IRDB directory or network address, and ev is a module which stores data (I’ll get to it later). Some other modules are call, which connects to a network address, or opt, which was displayed in the dispatch file above. The manual is a bit unclear about what opt does, as one source defines opt as “Optimize[s] the performance of the module argument”. Another manual states that it acts as a pointer to find non-local attributes. So, I don’t quite understand what it is for, but I roughly guess that it’s used to find attributes. On a high level the dispatch file is the file that contains what columns are available, and where you find them. I imagine it’d be possible to extend pq and add more modules, like for example supporting encrypted data or adding modules for other file formats and protocols.

Going back to the ev module. An ev module contains a data file, a proto file and optionally an index file for efficiency. The data file is similar to a csv file without a header, with an arbitrary character separating each column. Here’s an example:

111-22-3333|+1 204 567 8900|Home
111-22-3333|+1 204 567 2345|Work
111-22-3333|+1 204 567 6666|Mobile
123-45-6789|+1 303 456 5432|Work
123-45-6789|+1 303 456 1234|Mobile

I was surprised that you don’t define the separator directly in the dispatch file, because pq works with any character-separated plain-text file. If your data has a different separator than |, you specify it on invocation with the -s parameter.

The proto file contains the database attributes and is used to create the index files to resolve runtime queries. For the above data, the example looks like this:

ssn     1       .       .       e       .       .
number  2       .       .       e       .       .
type    3       .       .       e       .       .

The minimalist in me finds the proto file format nice to look at, and I suppose it is nice that pq is prudent enough to not overwhelm me with verbosity. But I also must admit that the file format is a bit cryptic, and I don’t know what I’m looking at. Luckily the manual has my back, these are space or tab-separated columns, and here is what each column correspond to:

Attribute name
Attribute address - position of the attribute in the data file 3,4. Primary and secondary length - used for indexing and pointing to the length of entries in the index file. Denoted as a dot if index files are not used.
Match - single character representing the matching algorithm used for this attribute in queries. There are 4 available algorithms.
Used to denote when a field is a multiple field.
Default index.

So the above proto file in the example doesn’t seem to have an index generated for it, which is why it has so many dots. Indices are generated using pqgen, and need to be regenerated each time you make a change to the file. I had to dig into the code to understand how the indexing works since the manual for pqgen was missing, but it essentially seems to be a two-layer binary file that is used for faster look-ups. I could not figure out how to use pqgen, the manual says you just point it to an ev database directory, but that didn’t seem to work.

There is also a pqsrv command that takes a module parameter and sets up a server to handles incoming pq requests. I imagine you’d set up such a server and then reference it in the dispatch file to query across different systems.

As far as I can tell, pq does not support writing data, which would’ve been really cool. There is also a note that index-generation doesn’t support real-time updates. Thus, you must manually regenerate values by hand periodically, i.e. when the file changes. So it might be slow for data that changes often.

I decided to end it there with my initial exploration. I am not entirely sure how I’d use pq. It seems like the system could potentially be a cool minimal tool to query system data across many servers or microservices. With some extensions, maybe it could also serve as a lightweight alternative to duckdb. pq also reminds me a bit of Graphql, but without mutations.

I think plain-text data is quite interesting, as it’s easy to manipulate with basically any tool you want. I like that unlike recfiles, pq has a way to index the data for faster look-ups, and it’d be interesting to see some performance benchmarks for it.

I don’t think I’ll replace Postgres with pq, but I was left very inspired to maybe attempt to build my own database. I really like the minimalism of pq, and the query language. The dispatch file format is also quite nice with how it shows all the attributes that are available to query, and it seems like it can very easily connect many systems together. It was also fun to be a plan9-archaeologist and explore these older tools.

If I ever were to build my own version of this tool, here are some features I’d look into:

Support for more protocols
Support for more file formats: maybe a recfile-like format, maybe some binary format.
Support for encrypted data.
Some way to mutate data
Some protocol to update the indices on file changes.

Thanks to anth, the previous maintainer, for guiding me on how to install pq on IRC.

Similar Posts