The fight of the Big Data stores
In October 2017 at the dataMinds conference (former SQL User Days) I gave a presentation about Data Lake Analytics and one of my slides was the following:
This picture was taken and I found that it was quite a nice topic to write something about. So here we are!
In Azure there are 3 different big data stores that you can use and in this blog post I want to explain the differences between them.
Azure Cosmos DB (former Document DB) is a multi-model database or NoSQL database. What does that mean and what are some features of Azure Cosmos DB?
- Multi-Model Database: it can accommodate relational, hierarchical, object, document, graph, wide-column and key/value data. So as an example you can have a document with data talking about a user and one of the attributes is a list of for example delivery addresses. Instead of (if we think relational wise) having different tables for users and addresses and an extra table to make the relation between the two, all of the info is defined in one document (dataset) that can be saved in JSON.
- 1-2-3 Global Distribution – distribution to any number of Azure regions is done in a mouse click to ensure the lowest possible latency. It will also connect automatically to the closest region to your location.
- Popular API’s – many different populare API’s can read from Cosmos DB like DocumentDB, MongoDB & Cassandra API.
- Elastically scale – scale database througput at a per-second granularity, anytime you want.
- Always-On – 99.99% on a single-region and 99,999% on a multi-region account
- More info can be found at: https://docs.microsoft.com/en-us/azure/cosmos-db/
SQL Data Warehouse
Azure SQL Data Warehouse is the easiest to explain if you are already using SQL Server. It’s the same as an ordinary SQL Server database but developed for Petabyte scaled databases. By making use of Massively Parallel Processing it can quickly run complex queries across petabytes of data.
It stores relational tables with columnar storage. That way it can significantly reduce the data storage cost and improves query performance. Compared with traditional database systems, analysis queries finish in seconds instead of minutes, or hours instead of days.
- Elasticity Performance – it automatically scales storage and compute frequently to support short periods of peak activity.
- Compute Performance – make use of Solid State Disk cache to keep the most frequently accessed data close to the CPU
More info can be found at: https://docs.microsoft.com/en-us/azure/sql-data-warehouse/
What is a data lake? One of my colleagues explained it quite nicely. If you take a bottle of water, it’s all cleansed and ready to drink. That water can be poured back into a lake or a river, and nothing will happen. Now if we do that the other way around and we fill our bottle with water from the lake, the bottle might be contaminated with trash or diseases.
So in a data lake we can store any kind of data, unprocessed (river water) and processed (bottled water). And the nicest thing is that it does not have to be structured. Relational data in any sort is not necessary. You can add images, music, documents, databases, text files, … No size limit to files or total storage (That is not possible with Azure Blob Storage). That are the storage capabilities of Data Lake. But an extra resource that is available is Data Lake Analytics that has a query language called “U-SQL”. This so called U-SQL gives us the ability to query text-files but also documents (docx, pdf, …) and images.
How nice would it be if you have a collection of photos and you would want to see only the photos that have a horse on it. Or only with people who are smiling. Or what about executing Python or R code on data that is just there. It’s all possible.
- Highly scalable – For every job you commit you can choose how many processing units you want to use. You pay per second.
- Simple Query Language – U-SQL is a combination of C# and SQL and easy to learn and to use.
- Fast development and quick debugging – by making use of familiar tools like Visual Studio development and debugging is quickly done.
When do we use what? If you need a multi-model database go to Cosmos DB. If you need a relational database make use of SQL Data Warehouse, but if your data exists of all kind of different types of files then make use of Data Lake to clean and transform it, so you can store it in one of the others if necessary.