Wednesday, April 2, 2008

Advances in Data Warehousing

As the size of the data is growing and as the enterprises are expecting deeper analysis from larger historical volume of data, there is a clear need in the market to provide solutions that can provide efficiencies across various dimensions in Data Warehousing implementations:
1) Speed to query the data
2) Space to store the data
3) Hardware costs and reliability
etc.

Netezza has addressed the speed issue by throwing (mostly) hardware and hardware architectures at the problem.
Then, there is another model - that of "cloud computing" - that Google Map/Reduce has embraced and other open source implementations of Map/Reduce such as Hadoop are following suit that utilizes parallel computing across commodity hardware.

However, I've been intrigued by the advances in the fundamental data storage and data access technologies for OLAP / data warehousing style applications - especially around "column based databases". Column based databases are not a new concept (Sybase has been offering Sybase IQ for years now) - but some of the upstarts such as Vertica are showing how this fundamental database technology can be packaged and marketed to change the data warehousing and BI technology landscape altogether.

Vertica's database solution seems promising and the following features stand out:
- Column based storage (fast query processing compared to row-based query processing - at least 20x improvement for selects across standard star schemas with fat fact tables).
- Compression of the data (10x compression is easily achievable).
- Shared Nothing architecture (built-in redundancy and parallelism) across commodity hardware nodes - with any kind of storage such as SAN etc.
- Continuous loading architecture. One can write into the warehouse while one is reading from it concurrently (as long as the current epoch is not being read).
- Applications can use SQL (no MDX required).