Back to the kvspool Github page. Back to my other software.
This library is deprecated.
- kv-spool ("key-value" spool)
-
a Linux-based C library to stream data between programs as key-value dictionaries. It has tools to pipe the streams around, such as by network pub/sub or tee, to concentrate streams together, and to monitor the consumption status of streams.
kvspool’s niche
Kvspool is a tiny API to stream dictionaries between programs. The dictionaries have text keys and values. The dictionary is a set of key-value pairs.
To use kvspool, two programs open the same spool- which is just a directory. The writer puts dictionaries into the spool. The reader gets dictionaries from the spool. It blocks when it’s caught up, waiting for more data. Like this,
Because the spooled data goes into the disk, the reader and writer are decoupled. They don’t have to run at the same time. They can come and go. Also, if the reader exits and restarts, it picks up where it left off.
Space management
In a streaming data environment, the writer and reader can run for months on end. A key requirement is that we don’t fill up the disk. So, when we make a spool directory, we tell kvspool what its maximum size should be:
% mkdir spool
% kvsp-init -s 1G spool
This configures a maximum of 1 GB of data in the spool. After it reaches that size, it stays that size, by deleting old data as new data comes into the spool. Deletion is done in units that are about 10% of the total spool size.
A reader that’s offline long enough can eventually lose data- that is, miss the opportunity to read it before it’s deleted. Data loss is a deliberate feature - otherwise it’d be necessary to block the writer or use unbounded disk space.
Shared memory I/O
You can locate a spool on a RAM disk if you want the speed of shared memory without true
disk persistence- kvspool comes with a ramdisk
utility to make one.
Rewind and replay
After data has been read from the spool, it remains in the spool directory until a future time when kvspool deletes it to make room for new data. This is the basis for a handy "rewind and replay" feature:
% kvsp-rewind spool
After a rewind, reading starts at the beginning of the spool. The reader should be restarted for this to take effect. Rewind and replay is a convenient way to develop a program on a consistent source of test data.
Snapshot
A snapshot is just a copy of the spool. You can bring the copy back to a development environment, rewind it, and use it as a consistent source of test data.
% cp -r spool snapshot
Copying a spool of production data lets you develop programs on real data but without needing a writer present- the data has been "canned".
Local and Network Replication
You can also publish a spool over the network, like this:
% kvsp-pub -d spool tcp://192.168.1.9:1110
Now, on the remote computers where you wish to subscribe to the spool, run:
% kvsp-sub -d spool tcp://192.168.1.9:1110
If you give multiple addresses to kvsp-sub
, it connects to all of them and concentrates
their published output into a single spool.
% kvsp-sub -d spool tcp://192.168.1.9:1110 tcp://192.168.1.10:1111
Kvspool includes a few kinds of replication utilities described below.
You can run kvsp-sub
and kvsp-pub
in the background at system boot by setting
them up as services or pmtr jobs.
License
Kvspool uses the MIT license.
Getting kvspool
You can clone kvspool from github:
% git clone git://github.com/troydhanson/kvspool.git
To build and install kvspool, you need autotools installed. The configure script looks to see if optional libraries including ZeroMQ, Nanomsg, Jansson and librdkafka re installed. It builds additional utilities if they are present.
% cd kvspool
% ./autogen.sh
% ./configure
% make
% sudo make install
Utilities
Basic
command | example |
---|---|
|
|
|
|
|
|
|
|
|
|
|
|
The kvsp-init
command is used when a spool directory is first created, to set
the maximum capacity of the spool. It accepts k/m/g/t suffixes. If kvsp-init
is
run later, after the spool already exists and has data, it is resized.
Run kvsp-status
to see what percentage of the spool has been consumed by a reader. It
can take multiple spools as arguments. For each spool it prints a line like:
/tmp/spool 99% 877kb 28secs
The fields show what percentage of the spool has been read, how much data is currently in the spool, and how recently data was written into the spool.
The kvsp-rewind
command resets the reader position to the beginning (oldest frame) in the
spool. Use this command in order to "replay" the spooled data. Disconnect (terminate) any
readers before running this command.
Use kvsp-tee
to support multiple readers from one input spool. First make a separate
spool directory for each reader (and use kvsp-init
to set the capacity of each one);
then use kvsp-tee
as the reader on the source spool. It maintains a continuous copy of
the spool to the multiple destination spools. This command needs to be left running to
maintain the tee. The -W
flag can be used to run tee in the more efficient raw mode,
which copies the binary input to the binary output representation in the spool file.
It is possible to filter the incoming data using -k key -r regex
options. In this
usage the tee only passes a dictionary if it has the key and its value matches regex.
The kvsp-concen
utility is the opposite of kvsp-tee
. It takes multiple source
spools and makes a single output spool from them. It is a spool concentrator. The
source spools are flagged with -d spool
and the final argument is the output spool.
The kvsp-bcat
command operates like kvsp-bpub
(see below). It writes the binary
encoded spool content to standard output.
Network utilities
command | example |
---|---|
|
|
|
|
|
|
|
|
|
|
|
|
The network utilities keep a local spool continuously replicated to a remote spool.
kvsp-pub/kvsp-sub
The utilities kvsp-pub
and kvsp-sub
exist to publish a source spool to a remote spool.
This pair of utilities communicates in JSON over ZeroMQ. The publisher listens on the
specified TCP port. The subscriber connects to it. If more than one subscriber connects,
each receives a copy of the data. When no subscribers are connected, the publisher drops
data. However, when the -s
flag is given to both kvsp-pub
and kvsp-sub
, two things
change: the publisher queues data when waiting for a subscriber instead of dropping it;
and secondly, if more than one subscriber connects, the data gets divided among them
rather than duplicated to all of them.
A subscriber can concentrate data (that is, "fan-in" the data) from many publishers, simply by listing multiple ZeroMQ endpoints on the command line.
kvsp-bpub/kvsp-bsub
The kvsp-bpub
and kvsp-bsub
utilities implement binary-over-ZeroMQ replication. Their
usage is the same as kvsp-pub
and kvsp-sub
except the required -b cast.cfg
argument.
This file lists the binary data type for each required key. Here’s an example cast.cfg
:
i32 timestamp
str sensor_name
i8 temp
In this example, every dictionary in the source spool is expected to have three keys
(timestamp, sensor_name and temp); any other keys get ignored. Their values are parsed
to the binary types listed (i32, str and i8). The binary buffer is transmitted
without any keys. On the receiving subscriber, the binary is unpacked using the same file.
The reason for using binary publishing is speed- it’s often an order of magnitude faster.
These data types may appear in cast.cfg
i8 // byte (8-bit int)
i16 // short (16-bit int)
i32 // integer (32-bit int)
ipv4 // dotted quad IP address
ipv46 // IPv4 or IPv6 address
mac // MAC address (aa:bb:cc:dd:ee:ff)
str // string
str8 // string (max length 255)
d64 // double (64-bit float)
kvsp-kkpub
Publishes the spool in JSON encoding to a Kakfa topic. This tool requires librdkakfa and Jansson to be installed.
kvsp-tpub
Finally there is a "plain TCP" binary publisher. It has no subscriber counterpart yet, so you have to code your own subscriber to use it. It takes a cast.cfg of the same form as above. It listens on the specified port, and when a subscriber connects to it, the dictionaries in the spool are transmitted as length-prefixed binary messages. The length prefix is a 32-bit integer (host-endianness) specifying the message length that follows. The remaining binary data is transmitted in host-endianness, except IP addresses in network order.
Other utilities
command | example |
---|---|
|
|
|
|
|
|
|
|
|
|
The kvsp-spr
utility is used to manually read a spool and print its frames to the
screen. Normally it will block waiting for data once it reaches the end of the spool but
the -B 0
(no-block) option tells it to stop reading if the end of the spool is reached.
The kvsp-spw
utility is used only for testing; it writes a frame of data to the spool
(or several frames if the -i
option is used with a count); in the latter mode there is a
sleep (delay) between each frame, which can be adjusted using the -d <seconds>
option.
The kvsp-mod
command "obfuscates" selected values from a source spool to hash values
in the output spool (named with the -o
option). For each key named with the -k
flag,
its value in the output spool is replaced with a mathematical hash. The hash numbers
preserve consistency (so the same input value produces the same output value) but the
value itself is a meaningless number.
A simple benchmark is performed by the kvsp-speed
utility to measure read and write
performance.
The ramdisk
utility creates, queries or unmounts a ramdisk- a Linux tmpfs filesystem.
In the form shown in the table above it creates a 1G ramdisk on the /mnt/ramdisk
mount
point (this directory must already exist). A ramdisk created this way will appear in the
/proc/mounts
listing. If a ramdisk already exists on that mount point, the command does
nothing. In create (-c
) mode, the -d <dir>
option may be used one or more times to
specify (as absolute paths) directories to create within the ramdisk. Using ramdisk -u
/mnt/ramdisk
unmounts it. The -q
option queries a directory to see if its a ramdisk
and show its size. The ramdisk
utility is included with kvspool because it is often
convenient to locate a spool on a ramdisk for performance.
API
C programs must be linked with -lkvspool -lshr.
#include "kvspool.h" ... void *sp = kv_spoolwriter_new("spool"); void *set = kv_set_new(); kv_adds(set, "day", "Wednesday"); kv_adds(set, "user", "Troy"); kv_spool_write(sp,set); ... kv_set_free(set); kv_spoolwriter_free(sp);
In C, a spool is opened for writing by calling kv_spoolwriter_new
which takes the spool
directory as argument. It returns an opaque handle (which you should eventually free with
kv_spoolwriter_free
).
Kvspool provides a data structure that implements a dictionary in C. It is created using
kv_set_new
which returns an opaque handle (which should eventually be freed using
kv_set_free
). In the meantime, it can be used for the whole lifetime of the program, by
adding key-value pairs to it, writing them out, clearing it, and reusing it over and over.
To add a key-value pair, use kv_adds
which takes the set handle, the key and the value.
The key and value get copied into the set (the set does not keep a pointer to them).
To write the set into the spool, call kv_spool_write
with the spool and set handle.
To re-use the set, call kv_set_clear
with the set handle. For debugging you can dump a
set to stderr using kv_set_dump(set,stderr)
.
To open a spool for reading, call kv_spoolreader_new
which takes the spool directory and
returns an opaque handle to the spool. Then call kv_spool_read
to read the spool.
void *sp = kv_spoolreader_new("spool"); void *set = kv_set_new(); kv_spool_read(sp,set,1);
The final argument to kv_spool_read
specifies whether it should block if there is no
data ready in the spool. A positive return value means success (data was read from the
spool and it’s been populated into the set). A zero value means that non-blocking mode
was used, but no data is currently available in the spool.
A C program can iterate through all the key-value pairs in the result set like this:
kv_t *kv = NULL; while ( (kv = kv_next(set, kv))) { printf("key is %s\n", kv->key); printf("value is %s\n", kv->val); }
You can also fetch a particular key-value pair from the set using kv_get
, providing
the key as the final argument:
kv_t *kv = kv_get(set, "user");
The number of key-value pairs in the set can be obtained using kv_len
:
int count = kv_len(set);
The C API also has a function to rewind the spool, which works if there is no reader that has the spool open at the time. It takes the spool directory as its only argument.
sp_reset(dir);