When to use Columnar databases?
A column-oriented DBMS or columnar DBMS is a database management system that stores data tables by column rather than by row, which makes it suitable for analytical query processing, and thus for data warehouses (OLAP).
Let’s see the below diagram to understand, how actually Columnar Database stores data.
Table of Contents
Most popular databases that support Columnar Database
- Amazon Redshift
- Snowflake
- ClickHouse
- Druid
- Vertica
- Hbase
- Apache Kudu
- Apache Parquet
- Apache Cassandra
- Azure Synapse Analytics
- MariaDB
When to use Columnar Database?
Columnar databases are used in data warehouses where businesses send massive amounts of data from multiple sources for BI analysis. Columnar Databases reduced I/O for the queries having fewer columns as it has to lookup for particular files only based on query columns.
Colmnar Database Internal Architecture? Why Columnar Database are faster in performance for Analytics?
Column-oriented databases (aka columnar databases) are more suitable for analytical workloads because the data format (column format) lends itself to faster query processing — scans, aggregation, etc.
- Columnar databases store data of a single column together (contiguously) on disk or in-memory or both. Thus we can quickly loop over all the column values potentially loading some/all of them into CPU in a single memory reference.
- On the other hand, row oriented databases store a single row (and all its columns) contiguously. In order to go over some/all values of a particular column (and this is what most analytic operations do), we have to go over each tuple and skip the unnecessary columns simply because the values of a column are not stored contiguously.
- Columnar databases tend to easily leverage efficient parallel processing on hardware. For example, SIMD instructions are used in most columnar databases to get data level parallelism and evaluate multiple column values in a single instruction. This improves the performance of typical analytical workload queries — scans, scalar aggregation, joins. Thus most columnar databases use vectorized instructions for query processing.
- Compression – Because each column is stored contiguously as a separate unit, we can use compression technique that is most suitable for the particular column depending on the column cardinality, data type (number, string, varchar etc) and whether the column is sorted or not.
- Simple and powerful techniques like RLE (Run Length Encoding), Bit Vector Encoding, Null suppression etc can be efficiently used on a per column basis and give better compression ratios because the compression algorithm operates on related (same data type) values.
- Another technique used by columnar databases is that they directly operate on the compressed column values during predicate evaluation etc. They avoid decompression (and its CPU cost) until later when its absolutely necessary —presentation of results (necessary column values) to the end user.
- Because each column is stored separately, query processing doesn’t have to go over unnecessary columns that are not touched by the query at all for operator evaluation or output list etc. The required set of columns (usually few in analytical queries) can be directly accessed individually. In most row oriented databases, we need to walk the row column by column and then extract the set of required columns for projection, evaluation etc.
- Late Materialization – Many columnar databases use this technique for constructing only the necessary tuples that really qualify for the resultset. Thus values from multiple columns are not stitched together to form a tuple until later when the predicate evaluation has already happened on the column and we are ready to present the result set to user.
Few examples
Example 1 – SELECT SUM(SALARY) FROM EMPLOYEE;
This is a very simple analytical query typically used in report generation etc. The query can be efficiently processed because:
- All values in SALARY column are stored together. So looping over all the column values to do a simple summation is quick.
- The query can be made more efficient by using SIMD processing techniques to do a parallel ADD operation and some clever bit shifting to compute the sum.
- Unless there is a secondary index on the SALARY column, this operation is going to be very slow in traditional row oriented databases.
Example 2 – SELECT COLUMN1 + COLUMN2 AS COLUMN3;
This simple operation can easily leverage SIMD processing to do a parallel ADD operation between corresponding values of COLUMN1 and COLUMN2.
Example 3 – SELECT SALE_ID, SHIPPING_ADDR FROM SALES WHERE STATE=’CA’
- From the SALEID and SHIPPING_ADDR column we only need the relevant column values where the predicate evaluation is successful on the corresponding value in STATE column.
- Firstly, the predicate evaluation will be super fast because multiple STATE column values can be compared (against CA) in a single instruction by loading them into a SIMD register (typically 128 bit wide).
- Secondly, we will access only the corresponding values from SALEID and SHIPPING_ADDR column (and decompress if they are compressed) where the SIMD evaluation was TRUE. This is where late materialization and the ability to directly operate on compressed data comes in.
Example 4 – SELECT COUNT(*) FROM EMPLOYEE WHERE SALARY >= 80000 AND SALARY < = 150000
- Again, the predicate evaluation on the SALARY column can use SIMD processing to get data level parallelism.
- Once the SIMD evaluation has given its result (typically a bit vector or a mask) over a set of column values, all we have to do is count the number of bits that are set to compute the result of COUNT(*).
Example 5 – SELECT FIRST_NAME, LAST_NAME FROM EMPLOYEE WHERE SALARY >= 80000 AND SALARY < = 150000
- Should be similar to Example 3.
In general, analytical query workload comprises of queries that touch a subset of table columns but a large set of rows to construct the result. Such queries usually require predicate evaluation on all values of one or more columns. Thus they are more efficient with columnar databases because of the organization of data (column values are stored together). On the other hand, OLTP workloads run well against row-oriented databases since they comprise operations interested in full rows (all columns of a row).
Consider the query — SELECT * FROM EMPLOYEE.
This is not really an analytical query since it is interested in all the columns of a single row and will in fact be much slower with columnar databases because it will require several disks seeks to all the separately stored columns. The same query will be very fast with row stores since a row is stored contiguously and all the column values can be loaded into memory in a single physical I/O
[Example] Use case : Analytics for Food Delivery Application
For eg., different events occur in the interaction of users with the food ordering apps
1. They search for a dish (search event)
2. They select a restaurant and view menu (view event)
3. They may add a dish to the cart (add_to_cart event)
4. They may place an order (order event)
5. They may abandon cart (abondon_cart event)
Now if you want to have an answer to the queries like:
1. Top 10 restaurants who got the highest orders in the last one month after user searched “Cold Coffee” where the city was Hyderabad?
2. How many people searched for pure-veg restaurants and placed an order with a particular restaurant in the last year.
3. Most popular items in Jaipur in the last one week?
Column-oriented databases have faster query performance because the column design keeps data closer together, which reduces seek time.
Limitations of Columnar Database
- The columnar database is not suited for incremental data loading.
- Online Transaction Processing (OLTP) applications are not suitable in column-oriented databases, as isolation needs to be done aross different columns in different files, that’s I/O sensitive. Columnar databases have a reputation for performing poorly on join operations, which is why they’re not recommended for OLTP workloads.
- User queries against only a few rows cannot give you any benefits in columnar databases.
Fun fact: Swiggy uses Snowflake for Swiggylytics, here is the architecture of Swiggylytics: