NoSQL datastore for IoT

This is a 4 min read

I recently answered “Which NoSQL DB is more advantageous for IoT data?” in quora. 

” Let me answer the question in 2 different aspects: 1. Design and 2. Choosing

1. Design:

The question: what kind of data store and an upstream system is warranted to back a large IoT (or even an Industrial IoT ) implementation? depends on two key aspects: Time-series events and Write-heavy-ness.

Let’s dig deeper,

A. Use-case: Imagine your favourite sports-wearable, say a Fitbit, the simple variant that can count the number of steps you walk a.k.a pedometer. Wearables use a sensor fusion technique using MEMs sensors (like accelerometers and gyroscope etc) to detect your motion and in turn, you are walking and records the same.

Don’t worry about the internals as what matters is, it collects your step counts and sends it to the cloud once in every ‘x’ seconds. Say in every ‘x’ seconds you took 10 steps you would have 10 time ordered steps or step-events in IoT speak. The wearable sends it all in one-shot. So imagine the number of Fitbit users sending “step-events” to the cloud every ‘x’ seconds. Also, Step-counts are just one event type, factor in all the other events that the wearables send. You should get the idea? All the data coming into a typical IoT systems is Time ordered aka time-series data and at an overwhelming velocity and volume. There may be even with occasional spikes.

You need a system that can handle that kind of volume and velocity of events and eventually store the same in the datastore.

B. Reasoning about your upstream System: You need a system that can take those all those real-time events without getting overwhelmed and store it into a data store in the order of time. This is where event firehoses like Kafka, AWS kinesis etc.. and a host of other streaming systems like storm, spark-streaming, Flink, kafka-streams etc come into the picture. I am intentionally brushing away this part so we can focus on the data store part.

C. Reasoning about the Data storage: Say, you have built a kick-ass upstream system now what ? after all you have to store the events and if you don’t do a good job here, your data store can be a bottleneck.

Simple !, What you need is a datastore that supports blind-writes, like fast appends to the end of a file. What does that mean? Say, In a data store with a storage engine that uses a B-Tree as the underlying data structure (imagine MySQL or MongoDB with MMAPv1 storage engine), you might think you are writing a single record, but internally each write needs a few reads before the actual write (recollect te logic behind inserting a node into B-Tree, it needs reading to decide if you have to go left or right and so on ). So ONE logical write done by your application internally needs physical reads before actually making the physical write. That is undesirable and slow in IoT case. A write is a blind write if it doesn’t imply a read prior to the write. In time complexity sense we need a write of o(1), b-tree writes are of o(Log n).

The write-amplification factor (WAF) tells you how many write operations you end-up to accomplish a logical data store write. In IoT systems you need a data store with lower WAF. LSFS inspired LSM-Trees are the data structures which support blind writes and optimised for write-heavy workloads. So any datastore with a storage engine that uses LSM-tree can be a best fit for storing real time events like IoT events.

Now comes the fun part, Your choice doesn’t even have to be a NoSQL , as even some relational data stores support LSM-tree based storage engines.

2. Choosing (Choice of datastore)

Now that we have the basics right, lets go back to your question, “What NoSQL store best suits IoT” – Simple ! First, pick a data store which uses LSM-tree as the underlying data structure for it’s storage engine.

  • AWS DynamoDB (Usage of LSM Tree here is an assumption)
  • Google Bigtable
  • Cassandra (Facebook built this, it’s actually a mashup of Dynamo and Bigtable)
  • RIAK
  • MongoDB (with WiredTiger storage engine)
  • HBase (Spoof of big table to large extent)

These are few to name, Having an LSM tree is only one part of the whole choosing question. Does it mean read suffers with LSM tree ? To a very little extent yes, but that’s the price you pay for keeping your writes faster. Storing in a data store is also for reading, so it’s not like LSM tree data stores are a bad fit for reads. So you have to choose based on what kind of read access patterns you have to support and other minor nuances like how you want your consistencies, user experiences you could pick a data store from the above short list and go for a polyglot persistence model where you could mix and match multiple data stores as your use-cases dictates.

Mind you, Netflix runs entirely on cassandra only they don’t store the actual videos in cassandra. Remember they also collect some telemetry from variety of devices they support for playback, mobile phones, tablets etc. Storing Video pause location for (each video) X (for each user) can be a good use-case for time series data in Netflix scenario.

Hope this make sense.

I have written in detail about the subject of choosing data stores here. “