Affordable Computing Architecture for Deep CNNs (2)

In the last article, there were two teaser keywords which probably didn't make sense to anyone who hasn't worked with us: bucket and sentinel.

These are the two key components which are used as the building blocks for ShuqStyle, although this technology can be used as a foundational building block for any product which uses deep CNNs in the future.

This article covers the sentinel part of our stack.

The sentinel actually consists of multiple components, the sentinel, broker, and client. A sentinel instance runs multiple sentinel plug-ins, which in the case of ShuqStyle are neural networks. A sentinel plug-in is essentially a pure function, which takes input and returns the computation result of said function.

We won't be dealing with broker in this article, because there isn't much magic there.

Not this sentinel.

Contrary to common belief, Sentinel was not named after this guy. (Source: Marvel vs Capcom Wiki)

Sentinels are wrappers for heterogeneous compute functions, some can be CPU only, some can be CUDA/cuDNN, and so forth. Framework-wise, we have placeholder support for OpenCL and Xeon Phi, but there is little to no libraries that use this, so it hasn't been finished.

Sentinels have built in resource management mapping to each compute device. The plug-in has to report the minimum and maximum, and the sentinel will try to find a fitting combination across multiple compute devices. (This can be either RAM or VRAM, although our production plug-ins only track VRAM)

When a sentinel starts, it will try to get as much out of whatever compute devices it can access, and get as many instances of available plug-ins up and running as possible.

Here is the VRAM status taken from one of our production nodes. In this screenshot, it's running 3 convolutional neural networks on each GPU.

VRAM usage on a production node

A sentinel plug-in is effectively the same as f in this figure. Each sentinel plug-in is a different function, and is a isolated instance of a neural network.

Function example

In ShuqStyle, plug-ins use either Caffe or Keras depending on the function. The choice of multiple frameworks may sound strange - but it comes at a minimal cost; dependency setup for production servers are mostly automated, and whoever made the choice to use framework X over Y simply needs to wrap it according to the pre-defined sentinel plug-in interface.

This isolates the framework dependency to a single component, and to have the rest of the team not have to care about how to install CUDA and whatever framework is needed to get the software running. If the next plug-in is going to be built on a slightly more obscure framework like Chainer, nobody needs to care apart from the plug-in author.

Each plug-in takes input in a pre-defined protocol (in the form of a serialized object), and returns it's output. It's a completely stateless pure function, and for those reasons we can start and shutdown as many sentinels as we need.

If we really urgently need computing power, all of our desktops can just start a sentinel and join the compute grid.

The users of the sentinels are buckets, importers, or simply average joe developers like me. The users will use sentinel clients, mostly without knowing - mostly as a function call. ShuqStyle's main component transparently treats sentinel calls as a native function, so it's called from whatever module needs it, or from a batch job, or from a REPL. To the users, it's just a native class function bound to the bucket.

This allows developers using machines without GPUs (or GPUs that are either not compatible or powerful enough) to work on the rest of our stack without even knowing what is actually happening in these function calls, and debug the actual search engine (which is inside bucket) locally, as long as they have network connectivity.

As a simple example (this isn't the exact code we are shipping, just an example to make it easier to understand what it looks like):

bucket = SearchableBucket('customer')
detected = bucket.detect('foo')
bucket.set_rois(detected)
feature = bucket.extract_feature(detected.dresses[0])
results = bucket.search(feature)

In this simple example above, two lines transparently get computed by a sentinel plug-in. As a user, you don't have to care what those are.

It also is a cost saver, since we don't have to buy everyone a Geforce, which becomes a problem when you also have people with Macs on your team.

This even works from other countries, which I personally think is pretty cool. (Accessing a GPU compute grid's neural network instance as a function call in iPython, sitting at a Starbucks across from another country is a interesting experience. Trust me.)

In the next article, we will cover the bucket component of ShuqStyle.