This is my attempt to clear the air in the subjects of Column Stores and Column oriented databases (both at terminology and at understanding level). I will be talking a bit about how terrible is the idea of grouping column oriented databases as flavour of NoSQL data stores.
What is a column store really ?
There is no scope for any confusion, is there ? The data can be stored physically row-wise or column wise. Column stores store the data column-wise, duh !.
|2||Martin Luther King||Non-Violence||200|
|3||Mother Theresa||Social Service||50|
Figure 1: Row store
|Name||1:Gandhi||2:Martin Luther King||3:Mother Theresa||4:Jackie chan|
Figure 2: Column store of same data
Those who are acquainted with RDBMs must be very familiar with the default way of storing data shown in figure 1. In fact the word RDBMS served as short hand for row oriented RDMBS. Well, swapping columns with rows gives you the a column store or column oriented RDMBS. Remember just because you swapped, neither the relational nature of datastore go away nor it becomes a “NoSQL” data store. Now why would one want to organise data in a column store ? Simple, it optimises column-operations say aggregate queries involving a avg, sum, count.
Check out this use case.
Case for Column stores:
Imagine the above scenario you have to a run a query with table shown in figure 1 to calculate avg salary all employees. Say, there are a million employees and size of a single row be 60 bytes (id = 8 bytes, Name and dept code 15 bytes each and Salary 10 bytes). Processing is simple, do regular SQL read row by row get the salary create a ‘avg bucket’ variable and accumulate it to find ‘avg’ after reaching end of table (since the query touches almost all records, the query optimiser will suggest a table scan and not a index probe). Well whats wrong with this ? Imagine your server had 8 GB RAM, we have 60 bytes x million rows i.e. 60 GB worth data. So the disk IO 48/8 times and each time memory will hit the high limits, let alone the performance of this query. Of course you could argue you will setup a cluster , shard the database to with ID and improve the query throughput. Great, but what will be the cost of the deployment ?
Now, Imagine running the same query against the table shown in figure 2, single read will yield you a row containing all the salaries. 10 bytes x million columns i.e 10GB worth of data. Just see how storing data in column store can dramatically improve performance and query throughput. Data stores like Vertica, Greenplum have special storage engines to support column oriented storage.They also have optimisers to tune column oriented queries.
Why are “column-oriented” data stores are grouped as one of the 4 so called NoSQL data store types ?
Typically blog posts and articles started grouping NoSQL data stores under 4 categories: Document stores, K,V stores, Column-oriented stores and Graph stores. Also if you notice these write ups typically claim BigTable, HBASE, Cassandra as examples of column oriented stores. None of tables store data column wise and in fact Datastax documentation claims Cassandra is a partitioned row store. Cassandra technically is a Distributed Map of a Sorted Maps. If you dig deep one common feature all these data stores offers is a column family to group related maps. This terminology lead to the whole confusion. Read the rant in reference 3 for a detailed version of this.
So just because a NoSQL store offers a Column family feature you cannot call it a Column oriented data store. This is totally wrong.
Quick side bar: I like martin kleppmann for busting myths especially in the distributed data stores subject matter. I personally requested if he had mentioned about this confusion anywhere in his brilliant new book. He thanked for prompting him and offered to call out this confusion. Thanks to him.