Fast de/serialization with numerical arrays


Introduction

This post demonstrates two tasks. First, it shows how slow string parsing is with Python and why more efficient serialization patterns are necessary for packing and unpacking an array of numerical values. Secondly, it demonstrates two methods - one that works within a “contained” Python-specific ecosystem and one that is more flexible and language and package agnostic.

The impetus for this evaluation was the need to serialize an array of numerical values to be passed through a Redis cache and read by a subsequent Python microservice.

Python array serialization and unpacking

Imagine a condition where you need to pass through an array of values from one service to another. For example, you need to xadd some value scores to a Redis cache which is represented as an array of values: [1.1, 2.4, 12.2 ... 2.3]. Attempting to write a list to a Redis cache will fail: redis.exceptions.DataError: Invalid input of type: 'list'. Convert to a bytes, string, int or float first.. Instead, it will be necessary to serialize and deserialize it on the comsuming service’s end.

We can first assume some array a defined as:

a = [round(random.random() * 100, 2) for _ in range(10)]

A naive serialization can be to just convert this list to a string. For example: ",".join(map(str, a)).

However, Python is quite slow at parsing strings and converting them back to numerical types. We can write a little test of this to explore the time cost of this conversion:

import random

data = [[round(random.random() * 100, 2) for _ in range(100)] for _ in range(10_000)]
data_serialized = [",".join(map(str, a)) for a in data]

times = []
for i in range(100):
    start = time.time()
    res = [list(map(float, n.split(","))) for n in data_serialized]
    stop = time.time()
    times.append(round(stop-start, 3))
    
np.mean(times).round(3), np.percentile(times, 0.5).round(3), np.percentile(times, 0.9).round(3)
# (0.333, 0.302, 0.303)

This will demonstrate that it takes about 0.3 seconds to parse a typical set of arrays that we’ve generated. However, when dealing with 100s of thousands of these array, this performance becomes problematic within a larger pipeline that is expecting all these values to be ingested “quickly” (e.g. the entire half million arrays processed within a second or two).

Python array serialization optimization with numpy

numpy allows for an out-of-the box optimization to address this by allowing arrays of numbers to be serialized with an easy to/from pattern. A given array can be serialized with the tobytes() method: np.array([1.1,2.2,3.3]).tobytes(). It can then be read back with frombuffer such that:

raw = [1.1,2.2,3.3]
a = np.array(raw).tobytes()
parsed = np.frombuffer(a)
assert parsed.tolist() == raw  # true

We can then redesign our test from earlier but replace the serialization and deserialiation pattern with the numpy library’s utility:

import random

data = [[round(random.random() * 100, 2) for _ in range(100)] for _ in range(10_000)]
data_serialized = [np.array(a).tobytes() for a in data]

times = []
for i in range(100):
    start = time.time()
    res = [np.frombuffer(a) for a in data_serialized]
    stop = time.time()
    times.append(round(stop-start, 3))
    
np.mean(times).round(3), np.percentile(times, 0.5).round(3), np.percentile(times, 0.9).round(3)
# (0.009, 0.008, 0.008)

We can of course see major improvements. We have gone from 0.3 seconds per 10k to 0.008 seconds or 8 ms. This acheives about a 35x speed up.

However, with this method there is a downside in that both the service producing the binary and the service consuming the binary need to have Python and numpy available. What is the producing service is instead a NodeJS service for example?

Base64 encoded binaries

Imagine a condition where the service writing new entries to is a NodeJS service. A Python service then reads from the Redis cache. In order to address the deserialization performance issues while also avoiding reliance on convenience utilities like tobytes in numpy that tie us to that ecosystem, we can create the binaries ourself and write the encoded binaries to Redis instead.

We can utilize built-in libraries in NodeJS (Buffer) to create a binary representation of the list, and then convert it to a string using base64 encoding. Redis can store the resulting string as a value associated with a key, which can then be retrieved in Python and deserialized back into a list of floats using the base64 and struct modules.

function listAsBinary(arrNums, size) {
    // generate binary
    const b = Buffer.alloc(arrNums.length * size);

    for (let i = 0; i < arrNums.length; i++) {
        // write number to each paired slot
        b.writeFloatLE(arrNums[i], i * size);
    }

    // encode
    return b.toString('base64');
}

Once written to Redis, the Python step can, instead from using numpy.frombuffer, deserialize the list using Python’s struct library:

def _deserialize_arr(str_arr: str, size: int) -> Tuple[float]:
    decoded = base64.b64decode(str_arr)
    l = len(decoded) // size
    return struct.unpack(f"<{l}f", decoded)

First, the base64-encoded string is decoded. Next, the entries for each slot are accessed via the unpack method in struct to generate an array of numerical values assocaited with each entry.

Performance results should be broadly in line with numpy (with numpy often performing slightly better due to various optimizations). However, the order of magnitude of completion time is in line with numpy and addresses the severe performance costs of string conversion.