HFT

My time at my previous company made me realise that within software, I have a thing for doing creative designs and abstractions.

I worked on two low-latency C++ projects:

With HFIR, I spent 3.5 months learning C++ from scratch, continually reading about HFT infrastructure on Reddit, and using my intuition to experiment with designs and speed up the algo, one part at a time. Then, in the space of 3 weeks, I started exch entirely by myself, in my vision of a well-organised, technical-debt-free C++ Linux project. This is the kind of stuff I got up to.

The code snippets are unmodified except for the Bitcoin exchange name I made up.

Visual Studio, this ain’t. Countless hours had accumulated fixing compilation errors caused by a bad setting in the VS project properties, or not noticing you’re editing the wrong build config. There were hundreds of property fields, about 3 of which are relevant for including an external library, which had to be downloaded with vcpkg or nuget or… copy+paste?

The basic idea here is to have transparent and minimal source code and setup instructions, with no magical static files. Low size and Git-friendly. Then, you can generate a dynamic build directory, and anyone else can, and everyone’ll always get the same thing. CMake is hard to learn, but its files are minimal and self-documenting – “what exactly is required to build this?”.

At the top level:

# == global setup ==
cmake_minimum_required(VERSION 3.12)
project(EXCH)                                           # corresponds to VS solution
set(CMAKE_CXX_STANDARD 17)
set(CMAKE_DISABLE_IN_SOURCE_BUILD ON)                   # builds everything into top-level build folder instead of all over source tree
include_directories(.)                                  # lets us include headers from source tree (e.g. config/, common/ folders)

# == external library interfaces ==                     # subordinate CMake files to package external libraries with their dependencies, ready for linking
include(cmake-external/openssl.cmake)                   # ext-openssl
include(cmake-external/g3log.cmake)                     # ext-g3log
include(cmake-external/simple-websocket-server.cmake)   # ext-simple-websocket-server

# == projects to build ==                               # comment out to skip
add_subdirectory(ws)
add_subdirectory(test)
add_subdirectory(json)

For an external library:

# packages g3log alongside its dependency (the native Threads library)

# dependency installation: none
# self installation:
#   1. clone https://github.com/KjellKod/g3log
#   2. in that root directory, run "mkdir build" then "cd build" then "cmake .." then "make install"
# * if permissions error, run "sudo chown -R $(whoami) /usr/local" to take ownership of local installation folder and try again

include_guard(GLOBAL)
add_library(ext-g3log INTERFACE)
find_package(Threads REQUIRED)
target_link_libraries(ext-g3log INTERFACE ${CMAKE_THREAD_LIBS_INIT})
target_link_libraries(ext-g3log INTERFACE g3logger)

And for our own projects, ez:

add_library(WebSocketClient STATIC WebSocketClient.cpp)
target_link_libraries(WebSocketClient ext-simple-websocket-server)
target_link_libraries(WebSocketClient ext-g3log)

add_executable(ws_ demo.cpp)
target_link_libraries(ws_ WebSocketClient)
target_link_libraries(ws_ ext-g3log)

Check out the selective linking of cpp files as well! It’s really easy to quickly write an informal test script (“demo.cpp”) and compile it against the rest of the project, and then retain it for later testing use, knowing it’s just one file you can switch in/out of the build process. If I write some mad-optimised .cpp, I can keep the original comprehensible version, use a common header, and just switch the .cpp in CMake.

add_executable(json_ 
    demo.cpp
    JSONEncoder.cpp
)
target_link_libraries(json_ ext-g3log)
target_link_libraries(json_ ext-openssl)

This was my 2nd iteration of a CMake project, the first being in VS on Windows, which was a little buggier. CMake is still syntactically terrible, with e.g. those arbitrary variables with no naming pattern that are set by find_package, and I once had it randomly stop finding Boost until I restarted the OS, but it’s a step in the right direction.

This took 5h to figure out… I had the conviction that this is the way you’re supposed to do it in idiomatic C++, but it wound up as arguable premature-optimisation when it took so long. It’s a shame idiomatic C++ is so hard and disgusting.

We want to expose an exchange’s API endpoints as functions in our exchange class – a conversion layer. All message conform to JSON-RPC, so only the internal params field depends on per-endpoint arguments. method identifies the endpoint, and the request needs to be signed with hashed credentials iff it’s to a “private” endpoint.

Code duplication vs dynamic data structures.

A basic encode(method, params) method avoids code duplication. We could switch based on method to different blocks dealing with different endpoints, but params varies, so would need to have a dynamic map-like type, which kinda circumvents the type system and makes a mess of passing data around with slow, dynamic data structures. Would we have to make a function table if we didn’t want to put every params encoder variant inline? And have lots of if statements distinguishing public/private at each point where we need to work the signing algorithm?

Writing specialised functions keeps them lean – we pick the bits we need – but… all the code is duplicated. Targeting an ever-changing API is hard enough without our own inconsistency.

Seems like a job for the compiler.

This is what I got. Minimal header: the public encode method we expose will call _resolve_params and _resolve_sig. The former two are variadic templates – we specialise them not only on the endpoint (which is just an enum) but the arbitrary call signature of whatever data _resolve_params needs, which gets forwarded from encode.

class JSONEncoderBitypto : public JSONEncoder {
private:
    std::string _sig_raw;                                                                                   // a buffer to store the generated signature of a message, before it's encoded; unbounded length (on heap), so allocated once here
    template <BityptoEndpoint, typename... Args> void _resolve_body(Args const&... args) {};                // called within encode method to encode the payload ("params")
    void _resolve_sig(char* buffer, int pos_m_begin, int pos_m_end, int pos_p_begin, int pos_p_end);        // called within encode method to encode the signature ("sig"); requires an output buffer, 4 ints to index the rest of the JSON, and writes to _sig_raw
public:
    JSONEncoderBitypto() : JSONEncoder() { _sig_raw.reserve(128); }
    template <BityptoEndpoint, typename... Args> const char* encode(int id, Args const&... args);           // encodes a JSON message (which lives in _output) and returns a pointer to _output's c-string
    void clear() {JSONEncoder::clear(); _sig_raw.clear(); }                                                 // overrides base method
};

The implementation relies on constexpr to evaluate code-branching of private (signed) vs public (not) instances of encode, determined from the endpoint at compile-time.

// helper function to find values for "method" field (API endpoint URL) from the enum of endpoints.
constexpr const char* method(BityptoEndpoint endpoint) {
    switch (endpoint) {
        case BityptoEndpoint::Auth:         return "public/auth";
        case BityptoEndpoint::Logout:       return "private/logout";
...
template <BityptoEndpoint endpoint, typename... Args> const char* JSONEncoderBitypto::encode(int id, Args const& ... args) {
    // private endpoints need to be signed; we evaluate this at compile-time per--template instance using:
    // - constexpr variables (guaranteed to be evaluated by compiler)
    // - constexpr functions (can be evaluated by compiler, so get them to return to constexpr variables)
    // - if constexpr (compiler discards code if false)
    constexpr const char* endpoint_text(method(endpoint));
    constexpr bool sign = (endpoint_text[1] == 'r');        // should the request be signed? yes iff private (= 2nd char is 'r')
    int pos_m_begin, pos_m_end, pos_p_begin, pos_p_end;     // have to declare all variables whether signing or not

    _writer.StartObject();
        _writer.Key("jsonrpc");         _writer.String("2.0");
        _writer.Key("id");              _writer.Int(id);
        if constexpr (sign) { pos_m_begin = _output.GetSize(); }
        _writer.Key("method");          _writer.String(endpoint_text);
        if constexpr (sign) { pos_m_end = _output.GetSize(); }
        _writer.Key("params");
            _writer.StartObject();
                if constexpr (sign) { pos_p_begin = _output.GetSize(); }
                _resolve_body<endpoint>(args...);
                if constexpr (sign) { pos_p_end = _output.GetSize(); }
            _writer.EndObject();
        if constexpr (sign) {
            char sig[128]; _resolve_sig(sig, pos_m_begin, pos_m_end, pos_p_begin, pos_p_end); // should be length 71, but unsafe against overflow
            _writer.Key("sig");         _writer.String(sig, 71);
        }
    _writer.EndObject();

    return _output.GetString();
}

To keep the hpp/cpp separation (for interface/implementation distinction and faster building), we need to explicitly instantiate this template, so the compiler knows which ones it needs to compile.

template const char* JSONEncoderBitypto::encode<BityptoEndpoint::Auth>(int id);
template const char* JSONEncoderBitypto::encode<BityptoEndpoint::Buy, ProtoOrder>(int id, ProtoOrder const& order);
template const char* JSONEncoderBitypto::encode<BityptoEndpoint::Sell, ProtoOrder>(int id, ProtoOrder const& order);
template const char* JSONEncoderBitypto::encode<BityptoEndpoint::Cancel, std::string>(int id, std::string const& orderId);
template const char* JSONEncoderBitypto::encode<BityptoEndpoint::CancelAll>(int id);

And now, we can specialise all of the _resolve_params instances (which overrides the default empty definition in the header).

template <> void JSONEncoderBitypto::_resolve_body<BityptoEndpoint::Auth>() {
    _writer.Key("grant_type");          _writer.String("client_credentials");
    _writer.Key("scope");               _writer.String("session:oi");
    _writer.Key("client_id");           _writer.String(Keys::Bitypto::api_key.c_str());
    _writer.Key("client_secret");       _writer.String(Keys::Bitypto::api_secret.c_str());
};
template <> void JSONEncoderBitypto::_resolve_body<BityptoEndpoint::Buy>(ProtoOrder const& order) {
    _writer.Key("amount");              _writer.Int(std::abs(order.quantity));
    _writer.Key("instrument_name");     _writer.String(!std::strcmp(order.symbol, "bp") ? "BTC-PERPETUAL" : "???");
    _writer.Key("post_only");           _writer.Bool(true);
    _writer.Key("price");               _writer.Double(order.price);
    _writer.Key("type");                _writer.String(order.type == 'l' ? "limit" : "market");
};
...
template <> void JSONEncoderBitypto::_resolve_body<BityptoEndpoint::Cancel>(std::string const& orderId) {
    _writer.Key("order_id");             _writer.String(orderId.c_str());
}
template <> void JSONEncoderBitypto::_resolve_body<BityptoEndpoint::CancelAll>() {
    _writer.Key("type");                _writer.String("all");
}

It’s pretty heavy. There’s not really enough syntactic distinction between compile-time and runtime code in C++; the template syntax uses all these random symbols and orderings; there’s inconsistent use of qualifiers (const&)… it sucks. But the code is clear, hopefully enough to cover most further development needs, and I tried my best to explain every syntactic and theoretical detail in further comments, for when it’s not.

I’m def a fan of RAII. Managing your program’s data/object/memory allocation using scopes and stack intuition extends how we’re used to seeing function calls. It’s very predictable where objects end up in memory, what their lifecycles are, and it makes it easy to take care of memory (avoiding leaks) and push towards optimising for cache locality (just intuitively; I’ve not tried any scientific approach or memory-related compiler intrinsics).

I haven’t tried to set up a debugger on Ubuntu, but managed to, in my WebSocket client, solve two memory violations just by thinking about RAII.

1: Destruction of WebSocket thread

The client always seg-faulted on exit. I first had a trivial destructor, meaning the detached thread that received messages and ran callbacks would always do something weird. Easy fix: set a flag that the thread checks before each reconnect, trigger a disconnect by sending a close message, then await the thread termination.

WebSocketClient::~WebSocketClient() {
    active = false;                                                 // when run() returns, manage() will check this and exit
    bool status = close(1000);                                      // status = was there a connection / did anything happen?
    if (_thread.joinable()) {_thread.join();}                       // awaits thread if it had been run (and so is joinable)
    LOGG(_log_id, INFO, "Ended thread; close sent = %d", status);
}

2: But the callback…!

My destruction of the app (which uses the WebSocket client) was causing a seg-fault anytime a message came in during destruction – indeed, it happened in the callback. It’s up to the app to provide a callback to the WebSocket to run on received messages. It’s a part of the app, with access to all its private variables…

Well, in what is just one of those C++ things, the consequence of good stack-based RAII is that a class’s members are destroyed in the opposite order they are declared. Which is nice! You know the exact process that happens on class birth+death. But it’s a bit of a low-level detail… I should be able to equivalently declare things in the app in whichever order I want, right?

Not declare the WebSocket client before some random string that’s set by the callback, and then have the callback, running concurrently to the destruction, violate memory cos the string was gone. Whoopsie. Well tbh, RAII is what made this easy to spot. To fix it, we could swap the declarations, but that sucks because it’s arbitrary, not a constraint the programmer should ideally have to think about. Alas, I spotted it late and couldn’t think of a way around it in time.

2+: Better solution?

What I’d like to do is force a compile error if the declarations are mis-ordered. For starters, I should rewrite the callback as a native lambda, rather than a lambda wrapping a method, as I had done:

class Bitypto {
private:
    WebSocketClient W;
...
public:
    std::string order_id;                   // each placed order ID gets saved     
    Bitypto() :
        W(url, [this](std::string const& text){ return this->receive(text); }, 'd')
    {
        W.start();                          // start WebSocket connection manager (runs background thread; connects; reconnects on failure)
        LOGG("c0", INFO, "Bitypto started.");
    }

The callback would then be viewed as an object, bringing its definition under the flow of the constructor (of the app, Bitypto), RAII-style. Now, if we fully define the callback in the constructor’s initialiser list (ew – maybe use a helper function that requires all the pieces as args), the callback definition would hopefully cause a compile error if all the variables it needs haven’t been constructed as part of that very same initialiser list. Seems like wishful thinking, though. Maybe some things would have to be defined on declaration? A curious unsolved problem.

In the quest for the fewest allocations possible, I had an idea. HFIR’s arbitrage strategy needed to react to market data and send orders via FIX asap, but these messages were almost identical to each other. By padding some zeros, I could force all the FIX messages to use the same length per field, and so allocate one buffer to handle everything for multiple messages, updating it from structs of new data and running a char pointer over it to send out the message on TCP. My project was for one message, but easily extensible to multiple.

The simple way to do a fixed-length string in C++ is with a stack-allocated char array. If there’s a stdlib or Boost type I should use, it wasn’t clear. We’ll be doing direct substring editing with pointer arithmetic, so unit-testing is imperative, and TDD is helpful; I wrote a parallel section on how I tested this project under the Testing section.

Two sections:

  • Editing the buffer fast: some bread-and-butter algorithmic thinking, cutting down on work and such.
  • Sending by bypassing the default QuickFIX method: some fun hacking, exposing internal methods, co-opting a mutex, writing socket code.

Initialisation

A simple start is to initialise the string with a default and have some enums specifying field types and lengths, to make future code legible and uncluttered with random numbers.

class FIXSendArbSingle {
public:
	static const int L = 262;
private:
	char _buffer[L + 1] = "";
...
FIXSendArbSingle::FIXSendArbSingle(bool FE_enabled)
	: FE(FE_enabled)
{
	std::vector<std::string> string_template = {
		// general header
		"8=FIX.4.2",						// BeginString
		"9=239",							// BodyLength
		"35=D",								// MsgType			= NewOrderSingle
		// D message header
		"34=00000",							// MsgSeqNum
		"49=20290",							// SenderCompID
		"52=00000000-00:00:00.000",			// SendingTime
...
	};

	for (std::string& item : string_template) {
		strcat_s(_buffer, L + 1, item.c_str());
		strcat_s(_buffer, L + 1, "\x1");
	}
}
// the things to be edited, with their initial position
enum CharIndex {
	I_MsgSeqNum = 24,		// ddddd
	I_SendingTime = 42,		// yyyymmdd-hh:mm:ss.mmm
	I_ClOrdID = 111,		// dddddddddddd
	I_OrderQty = 132,		// ddd
	I_Price = 144,			// dddd.ddd
...
enum CharSize {
	S_MsgSeqNum = 5,
...

Now, methods to set the data. Which comes in many forms:

  • Live: incoming orders from an Order struct: OrderQty, Price, Side, Symbol (= market ID);
  • Just-in-time: the timestamp of the message (SendingTime, TransactTime) and the sequence number, MsgSeqNum;
  • Dynamic: the CheckSum, computed from the rest of the message. BodyLength is constant!
  • Prepared: the ClOrdId (order ID) can be set after a send, to prepare for the next one, off the critical path. And everything else can just be left alone and forgotten about, staying in memory with no management overhead.

Timestamps

We have an initialisation set_timestamps_initial() method to set the date, which will persist through an instance, whose lifetime is bound by a trading day. The timestamp is only needed for FIX validation, so can be out by 2 mins, so we could also skip setting the seconds. Write it in-place using a print_int stringifier. Some benchmarking would be needed to find the fastest datetime library.

// sets the time part (chrono, https://stackoverflow.com/questions/43963072/efficiently-generating-a-utc-timestamp)
// vulnerable to unpredictable slowdown sometimes
void FIXSendArbSingle::set_timestamps() {
	// chrono objects
	using namespace std::chrono;
	using days = duration<int, std::ratio<86400>>;
	auto now = time_point_cast<seconds>(system_clock::now());
	auto s = now - time_point_cast<days>(now);			// initialised to seconds passed today
	auto h = duration_cast<hours>(s);		s -= h;		// then hours subtracted
	auto M = duration_cast<minutes>(s);		s -= M;		// then minutes subtracted
	// print (format yyyymmdd-hh:MM:ss)
	auto out = &_buffer[I_SendingTime];
	print_int(&out[10], 2, h.count());
	print_int(&out[13], 2, M.count());
	print_int(&out[16], 2, s.count());
	// copy to TransactTime
	memcpy(&_buffer[I_TransactTime], &_buffer[I_SendingTime], S_SendingTime);
}

Stringifying Integers

We can write efficient algorithms with pointers, pasting numbers right-to-left until we hit an = delimiter or run out of length. The correctness is nothing to worry about, cos we have unit-tests for print_int, both for functions that use it and to check what it does if its input is too big. Price needs to respect a decimal point ., so:

void FIXSendArbSingle::set_live(Order const& order) {
	print_int(&_buffer[I_OrderQty + S_OrderQty - 1], order.quantity);
	print_int_dp(&_buffer[I_Price + S_Price - 1], order.price);
...
void FIXSendArbSingle::print_int_dp(char* end, int n) {
	while (n > 0 && *end != '=') {
		if (*end != '.') {
			*end-- = char(n % 10) + '0';
			n /= 10;
		} else {
			*end--;
		}
	}
	while (*end != '=') {
		if (*end != '.') {
			*end-- = '0';
		}
		else {
			*end--;
		}
	}
}

These functions are short (≤7 chars), so would seemingly be better implemented using recursive templates, to remove the loops and conditionals.

Sequence Numbers

This is a bit harder, because each FIX message we send out has to have a seq. number one greater than the previous, but the FIX engine manages this itself behind the scenes, and sends all kinds of automatic messages (like heartbeats), as well as any other message we send through its normal interface. We can do this by exposing control of the seq. number by adding a couple of methods to our FixEngine class (which, in the actual code, would be the existing implementation of the FIX engine itself).

void FIXSendArbSingle::set_and_increment_seqnum() {								// (call within mutex)
	int seqnum = FE.get_seqnum();												// get next sequence number
	print_int(&_buffer[I_MsgSeqNum + S_MsgSeqNum - 1], seqnum);					// adjust buffer sequence number
	FE.set_seqnum(seqnum + 1);													// manually update sequence number
}
int FixEngine::get_seqnum() const		{ return _session->m_state.getNextSenderMsgSeqNum(); }
void FixEngine::set_seqnum(int seqnum)	{ _session->m_state.setNextSenderMsgSeqNum(seqnum); }

Though _session is private in the QuickFIX library, a quick forward-declared friend FIXEngine in the QuickFIX::Session header sorts it out without requiring recompilation.

Concurrency

These seq. number increments are obviously not atomic, and we can’t assume anything about QuickFIX’s concurrency set-up for message sends, and how our own hacked sends would interfere. Going through the source code covering a normal send operation, it turns out that QuickFIX protects each connection it manages with a connection mutex. The seq. number and send are all guarded by it. So, we obtain direct access to the mutex, which is presented as an RAII-style guard, where construction/destruction = acquisition/release.

FIX::Locker FixEngine::lock_guard()		{ return FIX::Locker(_session->m_mutex); }

And so, our final sending routine looks something like this:

order.quantity = 900; order.price = 5500; order.side = '2';	memcpy(order.symbol, "92222222", 8);
FS.set_live(order);
FS.set_timestamps();
{
    FIX::Locker guard(FS.FE.lock_guard());
    FS.set_and_increment_seqnum();
    FS.set_checksum();
    success = FS.send();
}

The FIX checksum algorithm is naively implemented – iterate through each character, summing its ASCII byte-value, then take the result mod 256. Obviously, most of this could be pre-computed on the static parts of the FIX string, with the dynamic parts added on in tandem with being written to the buffer.

Socket Send

The send routine directly calls the BSD socket send function (which refers to WinSock or Unix Sockets), with the usual idiom of ramming data down the socket, incrementing the pointer based on number-of-bytes-sent (returned by send), and spinning until complete.

Unfortunately, getting the file descriptor of the socket required recompiling QuickFix after adding a single virtual function declaration to a base class, to give the compiler access to the getSocket method that all the children had in practice anyway. A really dumb reason, but a minimally-invasive change without side-effects.

int FixEngine::send_(char* data, int len) {
	char* ptr = data;												// the counter pointer through the buffer
	int socket_descriptor = _session->m_pResponder->getSocket();	// hacked in a virtual function, had to recompile quickfix (see quickfix-marlowa project)
	// printf("-SocketSend on %d: ", socket_descriptor);
	while (len > 0) {												// spin until all chars sent
		int res = send(socket_descriptor, ptr, len, 0);				// attempt send
		if (res >= 0) {												// res is the number of chars sent, or -1 if error
			ptr += res;
			len -= res;
			// printf("-%d", res);
		}
		else {
			// printf("\n-SocketError: %d\n", WSAGetLastError());		// winsock-specific. 10057 = "Socket is not connected."
			return WSAGetLastError();
		}
	}
	// printf("\n-SocketSend complete.\n");
	return 0;
}

And that’s our complete specialised-message no-allocation direct-send QuickFIX hack!

Crucial to an HFT algo’s speed is its concurrency set-up, so I thought about this a lot. When I joined the project, they were thinking about how to improve moving orderbook data from the market data class into the execution class, which ran on separate threads. The existing setup pushed copies of full mutable orderbooks, including hidden data like copies of product definitions and mutexes, into callbacks run on a threadpool, which updated shared memory that the executor live-spun to read.

They were looking for a lock-free replacement: considering encoding the price/quantity × buy/sell data – 4 integers – into one, by bitshifting lossily. I disliked this idea because:

  • There was encoding/decoding arithmetic overhead;
  • Introducing any kind of data loss added cognitive load and unpredictability in figuring out the behaviour of the algorithm consuming the data.
  • It’d be slow to test because of the lack of class interfaces and existing test coverage;
  • It wouldn’t be general enough to allow rapid change if the data being stored changed or the lossy compression was deemed inadequate.

RingBufferMulticast

So I proposed a multicast ring buffer – a lock-free, lossless implementation of a single shared variable of any type, with one writer and arbitrary readers.

// RingBufferMulticast: a ring-buffer based inter-thread broadcast system (1 writer, many readers, get latest data rather than consume all)
template <class item, int size> class RBM {
private:
	int _size = size;
	std::array<item, size> _buffer;
	int _counter = 0;
public:
	RBM() { write(item()); }
	void write(item data) {
		_buffer[_counter % _size] = data;
		++_counter;
	}
...
	item read_safe_block() {
		while (true) {
			int entry = _counter;
			item data = _buffer[(entry - 1) % _size];
			int exit = _counter;
			if (exit - entry <= _size - 2) { return data; }
		}
	}
	item read_unsafe() {
		return _buffer[(_counter - 1) % _size];
	}
};

No ring buffer is designed equally; the synchronisation overhead can be in the reader, writer, or both; there could be one/many producers/consumers; the reader could require the latest or all of the data, or the sum total of readers could partition all the data.

In this case, the item was a 4-tuple of top-of-book price/quantity × buy/sell, for about 500 products. I didn’t initially know what the use-case was, but once we optimised the rest of the execution path, the 150µs overhead of reading 500 ring-buffers on each loop tick became clear. I realised it’d be much faster for the executor to replicate market data by processing a queue of changes – {market_index, new_price, new_quantity} – especially since it already had to keep a copy of all of it.

RingBufferQueue

So, I threw out the 500 RBMs in favour of one RBQ, RingBufferQueue:

template <class item, int size> class RBQ {
private:
	int _size = size;
	std::array<item, size> _buffer;
	int _w = 0;		// index of next write
	int _c = 0;		// first uncommitted index
	int _r = 0;		// index of next read
public:
	RBQ() {
		_buffer.fill(item());
	}

	// writes iff there is space
	bool write_non_block(item input) {
		bool success = _w - _r < size;
		if (success) {
			_buffer[_w  % size] = input;
			_w++;
		}
		return success;
	}

	// live spin write
	void write(item input) {
		while (!write_non_block(input));
	}

	// the latest data is made visible to the reader iff commit is called
	void commit() {
		_c = _w;
	}

	// use this to iterate through reads or peeks (which are otherwise unsafe to use)
	int available_items() {
		return _c - _r;
	}

	// read and mark as read
	item read() {
		assert(_r < _c);
		_r++;
		return _buffer[(_r - 1) % size];
	}

	// read without marking as read; must manually offset from read index
	item peek(int offset) {
		assert(_r + offset < _c);
		return _buffer[(_r + offset) % size];
	}

	// discard and mark all as read
	void drain() {
		volatile int rem = available_items();
		for (int i = 0; i < rem; ++i) {
			read();
		}
	}
};

This is an implementation of a 1-reader, 1-writer lock-free queue. The commit function makes all written data visible to the reader, which we thought might be useful to sync market data at one point. Two important usage notes:

  • the writer thread will hang live-spinning if there is no reader (will crash an Excel VBA GUI 10 times out of 10);
  • any call to available_items must return to a volatile int, because it returns the difference of two ints that are exclusively managed by separate threads, which the compiler will not spot and could optimise to a constant.

The reading idiom is:

volatile int R = rbq.available_items();
for (int r = 0; r < R; ++r) {
    item data = rbq.read();
    // use item
}

And with that, the 150µs reading overhead fell to 1µs. With a strategy as simple as this one, it seems like it’s optimally done as a fully-procedural single-threaded routine, which uses loops and branches to jump between reading market data from the sockets, and executing orders from it. As it is, we still have the overhead of passing data between logical CPU cores (perhaps they should be on one physical core, to share cache).

I think testing is the thing I struggle with the most in programming. Designing tests with good coverage challenges your creativity. I felt my poor unit test designs were my biggest shortcoming in the Pokémon Go project, and with this being irrelevant to Distributed Systems (where MIT’s existing tests were the goal), I sought chances to work on it here. In chronological order:

  • RingBufferQueue: an experiment with GoogleTest.
  • FixSend: a test-driven project.
  • exch: informal, maintainable, manual demo tests with which I developed the code.
  • Auxiliary Testing Tools: my logger and benchmarker from exch.

My projects up until this one had some manual tests, but they tended to be difficult to maintain and figure out in hindsight. For this one, I tried out Visual Studio’s Test Explorer with GoogleTest. Check the Ring Buffer section for more on this project.

TEST(A, RunOutOfSpace) {
	RBQ<item, 64> rbq;

	// first 64 writes should pass; nothing should be readable
	for (int i = 0; i < 64; ++i) {
		ASSERT_TRUE(rbq.write_non_block({ i, i * 1000, i * 2000 }));
	}
	ASSERT_EQ(rbq.available_items(), 0);
	// next write should fail; nothing should be readable
	int i = 70;
	ASSERT_FALSE(rbq.write_non_block({ i, i * 1000, i * 2000 }));
	ASSERT_EQ(rbq.available_items(), 0);
	// until a commit makes everything readable
	rbq.commit();
	ASSERT_EQ(rbq.available_items(), 64);
	ASSERT_FALSE(rbq.write_non_block({ i, i * 1000, i * 2000 }));
	// do the reads
	rbq.drain();
	ASSERT_EQ(rbq.available_items(), 0);
	// and now we can write again
	ASSERT_TRUE(rbq.write_non_block({ i, i * 1000, i * 2000 }));
}

I wrote these kinds of procedural unit tests. I found it hard to make them short and focused, and given how short and unlikely to expand the code is, the testing has limited value beyond the initial development, but it was at least a good exercise.

Conversely, this project was really easy to unit-test, and important given the use of pointer arithmetic to manually edit substrings in a character buffer. In fact, it’s the only project I’ve done test-driven (write the interface in the header, then the test, then the implementation), since it came naturally. I used VS again, but with CMake, which is of course incompatible with Test Explorer … . I used a setup from Stack Overflow to build GTest as a sub-project with minimal syntax, but on Linux, I could probably install it and link it as an external library just like everything else.

Initialisation constructs the buffer by concatenating lots of string literals. A single change in length could break the whole thing. Hence, some nice unit-tests are:

  • Check that the FIX BodyLength field is legit (by definition, it should equal buffer.find("10=", 0) - buffer.find("35=", 0)));
  • Check that appending a single character would overflow the buffer (we use strcat_s, which is overflow-safe, and the overhead of this safety is under initialisation so unimportant). This test accesses a private testee variable, so the testee requires a friend class declaration for the tester.
// adding just one extra character should cause crash with strcat_s
TEST(FIXSendArbSingleTest, InitialFIXMessageOverflow) {
	FIXSendArbSingle FS(false);
	ASSERT_DEATH(strcat_s(FS._buffer, FS.L + 1, " "), "");
}

Each field setting method gets its own unit-test. E.g. for set_timestamps(), we can test it by using a different method to generate the timestamp. Also, check that the delimiters remain in the right place after the edit.

TEST(FIXSendArbSingleTest, SetTimestampLive) {
	const int lenD = 8;
	const int lenT = 12;
	const std::string date_now("20190301");

	// calculate time_now another way (ctime, http://www.cplusplus.com/forum/beginner/226899/)
	time_t tt;
	time(&tt);
	tm TM = *localtime(&tt);
	char time_now_c[13] = "00:00:00.000";
	int y = TM.tm_hour;	FIXSendArbSingle::print_int(&time_now_c[1], 4, y);
	int m = TM.tm_min;	FIXSendArbSingle::print_int(&time_now_c[4], 2, m);
	int d = TM.tm_sec;	FIXSendArbSingle::print_int(&time_now_c[7], 2, d);
	const std::string time_now(time_now_c);

	FIXSendArbSingle FS(false);
	FS.set_timestamps_initial();
	auto start = std::chrono::high_resolution_clock::now();
	FS.set_timestamps();
	auto end = std::chrono::high_resolution_clock::now();
	for (int index : {I_SendingTime, I_TransactTime}) {
		std::string timestamp = FS.get_buffer(index, S_SendingTime + 1);
		ASSERT_EQ(timestamp[8], '-'); ASSERT_EQ(timestamp[11], ':'); ASSERT_EQ(timestamp[14], ':'); ASSERT_EQ(timestamp[17], '.');
		ASSERT_EQ(timestamp.substr(0, lenD), date_now);
		ASSERT_EQ(timestamp.substr(lenD + 1, lenT), time_now);
		ASSERT_EQ(timestamp[S_SendingTime], '\x1');
	}
	auto benchmark = std::chrono::duration<double, std::micro>(end - start).count(); EXPECT_LE(benchmark, 2);	// sub-2 micro
}

The integer strigifiers will be tested as part of the intended behaviour of every field method that uses them, but we can also check that it behaves correctly with an unintended input.

// what happens when we try to print a number that is too big
TEST(FIXSendArbSingleTest, PrintIntOverflow) {
	char buffer[] = "8=HI9=0000010=0000.00011=00012=BI";
	FIXSendArbSingle::print_int(&buffer[11], 456789);
	ASSERT_EQ(std::string(&buffer[7], 6), "56789" "\x1");
	FIXSendArbSingle::print_int_dp(&buffer[23], 43210550);
	ASSERT_EQ(std::string(&buffer[16], 9), "3210.550" "\x1");
	FIXSendArbSingle::print_int(&buffer[30], 3, 2003);
	ASSERT_EQ(std::string(&buffer[0]), "8=HI9=5678910=3210.55011=00312=BI");
	FIXSendArbSingle::print_int(&buffer[29], 4, 6789);	// should corrupt
	ASSERT_EQ(std::string(&buffer[0]), "8=HI9=5678910=3210.55016789312=BI");
}

Finally, we can run a simulation of the send procedure (seq. number set, mutex guard and socket send), relying on the predictable automatic behaviour of QuickFIX. We can check our understanding of how the mutex guard works.

TEST(FIXSendArbSingleTest, Send) {
	FIXSendArbSingle FS(true);
	FS.set_timestamps_initial();
	double benchmark;
	std::this_thread::sleep_for(std::chrono::seconds(7));
	{
		FS.FE.lock_guard(); 
		ASSERT_EQ(FS.FE.get_seqnum(), 3);								// but we expect seqnum to be 3 (after login message and 1 heartbeat)
		FS.set_timestamps();
		FS.set_and_increment_seqnum();
		ASSERT_EQ(FS.get_buffer(I_MsgSeqNum, S_MsgSeqNum), "00003");
		FS.set_checksum();
		auto start_send = std::chrono::high_resolution_clock::now();
		ASSERT_EQ(FS.send(), 0);
		auto end_send = std::chrono::high_resolution_clock::now();
		benchmark = std::chrono::duration<double, std::micro>(end_send - start_send).count();
		ASSERT_EQ(FS.FE.get_seqnum(), 4);
	}
	std::this_thread::sleep_for(std::chrono::seconds(6));				// another heartbeat happens (moving this into mutex guard scope would disable it)
	ASSERT_EQ(FS.FE.get_seqnum(), 5);
	EXPECT_LE(benchmark, 80);											// sub-20 micro
}

For this project, I decided not to set up unit tests because I was aware the project could end at any time and, having had a huge sunk cost already finding a WebSocket library, logger, and nice CMake setup, I wanted to build some functionality instead. Fortunately, the CMake setup made it really easy to maintain test scripts that I’d written organically while developing the relevant class (see CMake section). No commenting-out or lost work; just name the scripts demo.cpp or test-*.cpp and switch them in/out of the build process. These can easily evolve into unit-tests as the project matures.

E.g., the culmination of the project was a simple client that ties together the WebSocket and JSON encoder, to place and cancel some orders. Its behaviour can be manually verified using the API website and logs of the raw data sent/received.

int main() {
    Logger::Init();
    Bitypto D;
    ProtoOrder O; O.quantity = 10;
    sleep(1); D.auth();
    for (int i = 0; i < 3; ++i) {
        sleep(1); O.price = 3000; LOGG("b0", DEBUG, "buy: %.3f", BENCH_STATIC());         D.place(O);           O.quantity = -O.quantity;
        sleep(1);                 LOGG("b0", DEBUG, "cancel-buy: %.3f", BENCH_STATIC());  D.cancel(D.order_id);
        sleep(1); O.price = 5000; LOGG("b0", DEBUG, "sell: %.3f", BENCH_STATIC());        D.place(O);           O.quantity = -O.quantity;
        sleep(1);                 LOGG("b0", DEBUG, "cancel-sell: %.3f", BENCH_STATIC()); D.cancel(D.order_id);
    }
    sleep(2); LOGG("b0", DEBUG, "buy: %.3f", BENCH_STATIC());       D.place(O);
    sleep(1); LOGG("b0", DEBUG, "buy: %.3f", BENCH_STATIC());       D.place(O);
    sleep(1); LOGG("b0", DEBUG, "cancelall: %.3f", BENCH_STATIC()); D.cancel_all();
    sleep(2);   // this is just to let the callbacks come through; comment this out to see the last callback getting cancelled
}

You need tests to develop big projects class-by-class with confidence, but it’s also cool to have informal quality-of-life testing tools to inform the direction of your code as you write it. Particularly for performant code, it’s good to have a quick benchmarker; this is mine from exch. It can wrap a function-call, give static times (for when start and end are in different functions or even threads, like measuring round-trips with callbacks), and flush CPU data cache to simulate a fresh call of a function, rather than a repeated one that’s sped up by caching (s/o to Stack Overflow for this code).

// quick disposable benchmarker (use each handle only once in a given scope; can use empty handle)
// example:
//    BENCH_START(hi);
//    [stuff]
//    BENCH_END(hi);
//    printf("%.3f", BENCH_OUT(hi));
#define BENCH_START(handle) auto bench_start_##handle = std::chrono::high_resolution_clock::now()
#define BENCH_END(handle)   auto   bench_end_##handle = std::chrono::high_resolution_clock::now()
#define BENCH_OUT(handle)   std::chrono::duration<double, std::micro>(bench_end_##handle - bench_start_##handle).count()

// epoch time in microseconds as float
#define BENCH_STATIC()      ((double)std::chrono::high_resolution_clock::now().time_since_epoch().count())/1000

// flush the CPU data cache to simulate fresh benchmarking setup
namespace CacheFlush {
    const size_t flush_size = 10 * 1024 * 1024;
    long *flush_array = new long[flush_size];
}
#define BENCH_FLUSH()       for (int i = 0; i < CacheFlush::flush_size; i++) { CacheFlush::flush_array[i] = rand(); }

Also, my coloured console logger (built from the highly-flexible g3log library) made it easy to analyse synchronous code from different classes by colouring their log lines – benchmark times were in blue, raw JSON handled by the WebSocket was in red, etc. I had hoped to be able to use it without worrying about performance impact (it’s an asynchronous logger), which was a big problem in HFIR, but it did tend to incur about 40µs at the logging call-site, and I didn’t get to investigate how to do better. The API was nice, with settings like:

// declaration of 1st and 2nd chars identifying files and different console colours
// unrecognised IDs passed in LOGG call will get ignored by all file loggers and be printed with default settings in console
static std::unordered_map<char, std::unordered_map<char, ColourCode>> LoggerTable {
    {'j', {
        {'0', MAGENTA},
    }},
    {'b', {
        {'0', CYAN}
    }},
...

and calls like LOGG("b0", DEBUG, "buy: %.3f", BENCH_STATIC());.

Here’s a few misc. things I’ve learnt to look out for while optimising existing code. Some of them are just my intuition, and I might make things worse by carrying them out, but I’m slowly learning the details of C++ and how to test changes, hoping to get it more right in the future.

Classes: Separation of Concerns

A subtlety with designing classes is that sometimes, your abstraction of a thing gets too broad because you didn’t consider the separate needs different parts of your program have for it.

Example #1: mutability; I have an orderbook that replicates market data and syncs updates – it is mutable. The execution part only needs to receive a feed of updates – a lightweight immutable struct of like 3 numbers. If you conflate the two, you can end up passing copies of your entire mutable object, maybe including vectors, maps, mutexes, into callbacks.

Example #2: messages in vs out; to place an order, I can use an Order object, set its price, quantity… I send it to the exchange, which will do something. Place an order at a different price, round something, reject it. The in-order is our intention whereas the out-order is reality. Conflating the two literally caused:

  • infinite retries placing an order at a price that kept getting rounded and so never matched;
  • market orders to get detected as limit because they came back from the API with prices;
  • order IDs to inflate really fast because of the production every 5s of an intended order that got cancelled because it was already up on the exchange.

So, with the benefit of hindsight, I implemented ProtoOrder and Order to give the programmer more predictable, tightly-scoped objects to work with.

APIs: Object Reuse

I made the very biggest timesaves by looking closely at how external libraries were being used, and looking at which objects could be reused. Object creation as-a-rule has a lot of allocations because libraries (e.g. linear programming, FIX engines) tend to not be designed for performance, so don’t need to compromise on initialising everything they want for each instance, to save time.

Dynamic Allocations

Allocations of dynamic data types like std::vector and std::string (on the heap) are very slow. You can usually anticipate your length requirements, in which case allocating once on init and clearing between uses will eliminate this cost, despite this still not resolving the indirection to access the data compared to stack allocation. Ideally, try to do that, but be wary of writing unmanageable char buffer–based code :P.

Hidden Allocations

Following on from separation-of-concerns, if you copy a hefty data type that abstracts what you want to copy badly, you could end up copying a ton of heap-allocated structures – vectors, strings, maps – implicitly and obliviously. I try to make classes as lightweight and stack-based as possible (with RAII, you can see the exact data you would be reallocating), and if I fail, I delete the copy constructor xD.

Indexing and AoS/SoA

Accessing an array and map are both constant time, but the map requires an extra hash + heap indirection, so I try to go for arrays. To store data on many financial products, I assigned them a global integer indexing, like in a database, which doubles as a key into an array. But here lies a big subtlety: should you store this data as an array of structs (AoS) or struct of arrays (SoA)? Compare storing a matrix by flattening its rows vs its columns.

With the former, the structs are a Product object, so it’s idiomatic OoP. But the latter lets you pick out the properties you’re interested in, which are stored contiguously for all entries, so can be more CPU cache- and prefetch-friendly if you loop through every product looking for a specific property or two. But when you start looping through subsets, and your algorithm accesses the data in different ways during one iteration, it gets hard to choose the best option.

As far as I know, C++ does not optimise AoS/SoA, but I noticed that recent experimental languages have been trying.

 

Other Things I Did at my Previous Job