kvspool data stream tools

Back to the kvspool Github page. Back to my other software.

This library is deprecated.

kv-spool ("key-value" spool): a Linux-based C library to stream data between programs as key-value dictionaries. It has tools to pipe the streams around, such as by network pub/sub or tee, to concentrate streams together, and to monitor the consumption status of streams.

kvspool’s niche

Kvspool is a tiny API to stream dictionaries between programs. The dictionaries have text keys and values. The dictionary is a set of key-value pairs.

To use kvspool, two programs open the same spool- which is just a directory. The writer puts dictionaries into the spool. The reader gets dictionaries from the spool. It blocks when it’s caught up, waiting for more data. Like this,

A spool writer and reader

Why did I write kvspool?

I wanted a very simple library that only writes to the local file system, so applications can use kvspool without having to set anything up ahead of time- no servers to run, no configuration files. I wanted no "side effects" to happen in my programs-- no thread creation, no sockets, nothing going on underneath. I wanted fewer rather than more features- taking the Unix pipe as a role model.

Loose coupling

Because the spooled data goes into the disk, the reader and writer are decoupled. They don’t have to run at the same time. They can come and go. Also, if the reader exits and restarts, it picks up where it left off.

Space management

In a streaming data environment, the writer and reader can run for months on end. A key requirement is that we don’t fill up the disk. So, when we make a spool directory, we tell kvspool what its maximum size should be:

% mkdir spool
% kvsp-init -s 1G spool

This configures a maximum of 1 GB of data in the spool. After it reaches that size, it stays that size, by deleting old data as new data comes into the spool. Deletion is done in units that are about 10% of the total spool size.

A reader that’s offline long enough can eventually lose data- that is, miss the opportunity to read it before it’s deleted. Data loss is a deliberate feature - otherwise it’d be necessary to block the writer or use unbounded disk space.

Shared memory I/O

You can locate a spool on a RAM disk if you want the speed of shared memory without true disk persistence- kvspool comes with a ramdisk utility to make one.

Rewind and replay

After data has been read from the spool, it remains in the spool directory until a future time when kvspool deletes it to make room for new data. This is the basis for a handy "rewind and replay" feature:

% kvsp-rewind spool

After a rewind, reading starts at the beginning of the spool. The reader should be restarted for this to take effect. Rewind and replay is a convenient way to develop a program on a consistent source of test data.

Snapshot

A snapshot is just a copy of the spool. You can bring the copy back to a development environment, rewind it, and use it as a consistent source of test data.

% cp -r spool snapshot

Copying a spool of production data lets you develop programs on real data but without needing a writer present- the data has been "canned".

Local and Network Replication

You can also publish a spool over the network, like this:

% kvsp-pub -d spool tcp://192.168.1.9:1110

Now, on the remote computers where you wish to subscribe to the spool, run:

% kvsp-sub -d spool tcp://192.168.1.9:1110

If you give multiple addresses to kvsp-sub, it connects to all of them and concentrates their published output into a single spool.

% kvsp-sub -d spool tcp://192.168.1.9:1110 tcp://192.168.1.10:1111

Kvspool includes a few kinds of replication utilities described below.

The big picture

Before moving on- let’s take a deep breath and recap. With kvspool, the writer is unaware of whether network replication is taking place. The writer just writes to the local spool. We run the kvsp-pub utility in the background; as data comes into the spool, it transmits it on the network.

On the other computer- the receiving side- we run kvsp-sub in the background. It receives the network transmissions, and writes them to its local spool.

Use kvsp-pub and kvsp-sub to maintain a live, continuous replication. As data is written to the source spool, it just "shows up" in the remote spool. The reader and writer are completely uninvolved in the replication process.

Publish and Subscribe

You can run kvsp-sub and kvsp-pub in the background at system boot by setting them up as services or pmtr jobs.

License

Kvspool uses the MIT license.

Getting kvspool

You can clone kvspool from github:

% git clone git://github.com/troydhanson/kvspool.git

To build and install kvspool, you need autotools installed. The configure script looks to see if optional libraries including ZeroMQ, Nanomsg, Jansson and librdkafka re installed. It builds additional utilities if they are present.

% cd kvspool

% ./autogen.sh
% ./configure
% make
% sudo make install

Utilities

Basic

command	example
`kvsp-init`	`kvsp-init -s 1G spool`
`kvsp-status`	`kvsp-status spool`
`kvsp-rewind`	`kvsp-rewind spool`
`kvsp-tee`	`kvsp-tee -s spool copy1 copy2`
`kvsp-concen`	`kvsp-concen -d spool1 -d spool2 spool`
`kvsp-bcat`	`kvsp-bcat -b config -d spool`

The kvsp-init command is used when a spool directory is first created, to set the maximum capacity of the spool. It accepts k/m/g/t suffixes. If kvsp-init is run later, after the spool already exists and has data, it is resized.

Run kvsp-status to see what percentage of the spool has been consumed by a reader. It can take multiple spools as arguments. For each spool it prints a line like:

/tmp/spool            99%        877kb         28secs

The fields show what percentage of the spool has been read, how much data is currently in the spool, and how recently data was written into the spool.

The kvsp-rewind command resets the reader position to the beginning (oldest frame) in the spool. Use this command in order to "replay" the spooled data. Disconnect (terminate) any readers before running this command.

Use kvsp-tee to support multiple readers from one input spool. First make a separate spool directory for each reader (and use kvsp-init to set the capacity of each one); then use kvsp-tee as the reader on the source spool. It maintains a continuous copy of the spool to the multiple destination spools. This command needs to be left running to maintain the tee. The -W flag can be used to run tee in the more efficient raw mode, which copies the binary input to the binary output representation in the spool file. It is possible to filter the incoming data using -k key -r regex options. In this usage the tee only passes a dictionary if it has the key and its value matches regex.

The kvsp-concen utility is the opposite of kvsp-tee. It takes multiple source spools and makes a single output spool from them. It is a spool concentrator. The source spools are flagged with -d spool and the final argument is the output spool.

The kvsp-bcat command operates like kvsp-bpub (see below). It writes the binary encoded spool content to standard output.

Network utilities

command	example
`kvsp-pub`	`kvsp-pub -d spool tcp://192.168.1.9:1110`
`kvsp-sub`	`kvsp-sub -d spool tcp://192.168.1.9:1110`
`kvsp-bpub`	`kvsp-bpub -b cast.cfg -d spool tcp://192.168.1.9:2110`
`kvsp-bsub`	`kvsp-bsub -b cast.cfg -d spool tcp://192.168.1.9:2110`
`kvsp-kkpub`	`kvsp-kkpub -b kafka.host.name -t topic`
`kvsp-tpub`	`kvsp-tpub -b cast.cfg -d spool -p 2110`

The network utilities keep a local spool continuously replicated to a remote spool.

kvsp-pub/kvsp-sub

The utilities kvsp-pub and kvsp-sub exist to publish a source spool to a remote spool. This pair of utilities communicates in JSON over ZeroMQ. The publisher listens on the specified TCP port. The subscriber connects to it. If more than one subscriber connects, each receives a copy of the data. When no subscribers are connected, the publisher drops data. However, when the -s flag is given to both kvsp-pub and kvsp-sub, two things change: the publisher queues data when waiting for a subscriber instead of dropping it; and secondly, if more than one subscriber connects, the data gets divided among them rather than duplicated to all of them.

A subscriber can concentrate data (that is, "fan-in" the data) from many publishers, simply by listing multiple ZeroMQ endpoints on the command line.

kvsp-bpub/kvsp-bsub

The kvsp-bpub and kvsp-bsub utilities implement binary-over-ZeroMQ replication. Their usage is the same as kvsp-pub and kvsp-sub except the required -b cast.cfg argument. This file lists the binary data type for each required key. Here’s an example cast.cfg:

i32 timestamp
str sensor_name
i8 temp

In this example, every dictionary in the source spool is expected to have three keys (timestamp, sensor_name and temp); any other keys get ignored. Their values are parsed to the binary types listed (i32, str and i8). The binary buffer is transmitted without any keys. On the receiving subscriber, the binary is unpacked using the same file. The reason for using binary publishing is speed- it’s often an order of magnitude faster. These data types may appear in cast.cfg

i8        //  byte (8-bit int)
i16       //  short (16-bit int)
i32       //  integer (32-bit int)
ipv4      //  dotted quad IP address
ipv46     //  IPv4 or IPv6 address
mac       //  MAC address (aa:bb:cc:dd:ee:ff)
str       //  string
str8      //  string (max length 255)
d64       //  double (64-bit float)

kvsp-kkpub

Publishes the spool in JSON encoding to a Kakfa topic. This tool requires librdkakfa and Jansson to be installed.

kvsp-tpub

Finally there is a "plain TCP" binary publisher. It has no subscriber counterpart yet, so you have to code your own subscriber to use it. It takes a cast.cfg of the same form as above. It listens on the specified port, and when a subscriber connects to it, the dictionaries in the spool are transmitted as length-prefixed binary messages. The length prefix is a 32-bit integer (host-endianness) specifying the message length that follows. The remaining binary data is transmitted in host-endianness, except IP addresses in network order.

Other utilities

command	example
`kvsp-spr`	`kvsp-spr -B 0 spool`
`kvsp-spw`	`kvsp-spw -i 10 spool`
`kvsp-mod`	`kvsp-mod -k key -o spool2 spool`
`kvsp-speed`	`kvsp-speed`
`ramdisk`	`ramdisk -c -s 1G /mnt/ramdisk`

The kvsp-spr utility is used to manually read a spool and print its frames to the screen. Normally it will block waiting for data once it reaches the end of the spool but the -B 0 (no-block) option tells it to stop reading if the end of the spool is reached.

The kvsp-spw utility is used only for testing; it writes a frame of data to the spool (or several frames if the -i option is used with a count); in the latter mode there is a sleep (delay) between each frame, which can be adjusted using the -d <seconds> option.

The kvsp-mod command "obfuscates" selected values from a source spool to hash values in the output spool (named with the -o option). For each key named with the -k flag, its value in the output spool is replaced with a mathematical hash. The hash numbers preserve consistency (so the same input value produces the same output value) but the value itself is a meaningless number.

A simple benchmark is performed by the kvsp-speed utility to measure read and write performance.

The ramdisk utility creates, queries or unmounts a ramdisk- a Linux tmpfs filesystem. In the form shown in the table above it creates a 1G ramdisk on the /mnt/ramdisk mount point (this directory must already exist). A ramdisk created this way will appear in the /proc/mounts listing. If a ramdisk already exists on that mount point, the command does nothing. In create (-c) mode, the -d <dir> option may be used one or more times to specify (as absolute paths) directories to create within the ramdisk. Using ramdisk -u /mnt/ramdisk unmounts it. The -q option queries a directory to see if its a ramdisk and show its size. The ramdisk utility is included with kvspool because it is often convenient to locate a spool on a ramdisk for performance.

API

C programs must be linked with -lkvspool -lshr.

#include "kvspool.h"
...
void *sp = kv_spoolwriter_new("spool");
void *set = kv_set_new();
kv_adds(set, "day", "Wednesday");
kv_adds(set, "user", "Troy");
kv_spool_write(sp,set);
...
kv_set_free(set);
kv_spoolwriter_free(sp);

In C, a spool is opened for writing by calling kv_spoolwriter_new which takes the spool directory as argument. It returns an opaque handle (which you should eventually free with kv_spoolwriter_free).

Kvspool provides a data structure that implements a dictionary in C. It is created using kv_set_new which returns an opaque handle (which should eventually be freed using kv_set_free). In the meantime, it can be used for the whole lifetime of the program, by adding key-value pairs to it, writing them out, clearing it, and reusing it over and over. To add a key-value pair, use kv_adds which takes the set handle, the key and the value. The key and value get copied into the set (the set does not keep a pointer to them). To write the set into the spool, call kv_spool_write with the spool and set handle. To re-use the set, call kv_set_clear with the set handle. For debugging you can dump a set to stderr using kv_set_dump(set,stderr).

To open a spool for reading, call kv_spoolreader_new which takes the spool directory and returns an opaque handle to the spool. Then call kv_spool_read to read the spool.

void *sp = kv_spoolreader_new("spool");
void *set = kv_set_new();
kv_spool_read(sp,set,1);

The final argument to kv_spool_read specifies whether it should block if there is no data ready in the spool. A positive return value means success (data was read from the spool and it’s been populated into the set). A zero value means that non-blocking mode was used, but no data is currently available in the spool.

A C program can iterate through all the key-value pairs in the result set like this:

kv_t *kv = NULL;
while ( (kv = kv_next(set, kv))) {
  printf("key is %s\n", kv->key);
  printf("value is %s\n", kv->val);
}

You can also fetch a particular key-value pair from the set using kv_get, providing the key as the final argument:

kv_t *kv = kv_get(set, "user");

The number of key-value pairs in the set can be obtained using kv_len:

int count = kv_len(set);

The C API also has a function to rewind the spool, which works if there is no reader that has the spool open at the time. It takes the spool directory as its only argument.

sp_reset(dir);