SAP HANA , having entered the data 2.0/3.0 space at the right time, has been getting traction lately; and there will be lot of users like me who wants to test and get their hands dirty to see how hana can solve some of the complex data problems.
Currently doing a pilot test for deploying hana appliance for one of my client. Based on preliminary research, here is some high-level thoughts on hana and will be having few technical blog posts in the coming days as my evaluation continues (performance stats, comparison, pros & cons, etc.).
Most Important Features
Few fascinating points that really made me to evaluate hana:
- Support of hybrid data stores; hana supports both ROW and COLUMN types in the same engine; which is a big missing piece in many leading data stores at present.
- Support of both SQL and MDX
- Support of SQLScript language
- Support of modeling views for getting aggregated data (non-materialized, avoids data duplication; good and bad as both has their own advantages and dis-advantages)
- Support for R integration and most popular predictive analytical functions in the form of “PAL” (Predictive Analysis Library) written using SQLScript; which can ease the development cycles for data scientists (support for clustering and classification algorithms is good enough to support larger audience).
- Support for time traversal using history table; which can help easy book-keeping of sensitive data especially for financial, SEC and UID (Update, Insert and Delete) operations (this is somewhat similar to HBase cell version) and avoids the need for append-only or audit table logic that has been used by many companies.
How It Can Power Next Generation Analytics
As Intel took computational power to new heights in last few years, and cost of memory and SSDs/Flash IO Cards are becoming cheaper day by day; and hana tries to capitalize by standardizing the hardware requirements.
If SAP can deliver what has been promised, unless something really blows big in terms of performance or stability; surely hana can play a crucial role in the following four areas where there is still a big vacuum in today’s (big) data world:
Real Time Analytics
Transactional, in-memory grid, persistence and getting real time analytics is critical for any business. Real time stats is going to be de-facto in coming years for majority of the business as it reflects directly on user experience and takes the business insights in to next logical level.
As there is no good solution that solves the current real time needs in large data flow environments; people use alternative methods like:
- Custom ETL (takes away the real real-time meaning)
- In-memory driven counter based solution (just the counts for real-time, other stats are not exposed)
- Separate event handling by having parallel implementation to transactional flow (currently people either use in-memory counter based or priority queue solution)
The days are not too far to expect everything to be (near) real-time and hana will surely lead the space.
Reporting and Visualization:
Even though the reporting and/or visualization space is old and matured over years, but OLAP/MDX engine is always being a core of a problem for most of the reporting and visualization needs in big data analytics space and lot of reporting needs were suppressed due to lack of scalable in-memory OLAP solution.
Being personally involved in a few analytics/reporting products in the last few years; where lot of reporting features were either stripped or eliminated due to the following reasons:
- lack of materialized data in-time
- data getting refreshed too frequently and heavy invalidation
- cubes not being able to keep up with refresh rate
There are cases where the cubes are built overnight for serving tomorrow needs by keeping yesterday’s data; which by no means helps any business.
Hana can be a strong candidate in this space and one could expect major changes in reporting/visualization area once MDX becomes less of a problem on frequently changed data and can keep up near real time by staying closer to the actual source of data.
One solution fits for all:
Hana bridges the gap between typical OLTP engine to warehouse/analytics world along with positioning itself in the next generation database SaaS suite offering by enabling features like PAL (Predictive Analysis Library) and advanced scripting language like SQLScript; which enables the system to be more extensible and build an ecosystem by developers and third party vendors easily.
Hana eventually becomes the core of SAP’s software as a service (SaaS) and more than that, the solution can replace the need for many components like OLTP, NoSQL, ETL, Warehouse, Datamart and OLAP in the typical (big) data architecture.
As more businesses lean towards predictive analytics; and Gartner predicts predictive analytics is the future of most business by 2016; no database vendor seriously incorporated core algorithms (or efficient statistical functions) and make it extensible; so that the computation can run like a stored procedure by keeping the dataset within the engine space by taking advantage of the hardware.
But, it’s not easy for any database vendor to support unless there is a support for expressive language; which none of the current databases support; except MongoDB; but apparently they did not. Hana took a stab due to the support of SQLScript.
Basically, instead of fetching massive amounts of data to application logic and processing one row at a time; the logic needs to be pushed down to server; if server can handle efficiently; and hana is designed for this purpose (parallel calculation engine).
This is surely a winning feature from hana by targeting the future business needs along with solving currently burning problem space.
- Support for unstructured data: Currently, hana does not have a way to handle unstructured data directly; and one needs to transform into structured data and hopefully we can expect this important feature down the line.
- Materialized views: Initially I was little hesitant to evaluate hana without materialized views support; even though it can compute on the fly due to in-memory processing speeds; but having the aggregated data materialized in real-time can add a lot of benefit to real world analytical space.
- NoSQL, key-value support: Even though one does not need a separate key value storage when you have a high performance in-memory grid like hana; but nice to have non transactional table store (in-memory or optionally persisted) and uses the native REST API (part of HANA XS though) for PUT, GET, POST and DELETE operations.
- Standalone: Nice if they had a version that runs on standard hardware that is ideal for development, testing, staging and also for smaller footprint use cases.
- SUSE: Hana only supported on SUSE OS; which is kind of out-dated (at-least in north America); I wish they picked either Redhat or Ubuntu.
- Cost: It is an expensive solution; and may not be an option for small companies (forget about startups unless SAP can spin the wheel by giving it for free as part of their new cloud offering, and they should if they really need a penetration).
- SQL: At first glance; it lacks lot of basic SQL support that one may need (especially people coming from MySQL, PostgreSQL or Oracle background).
Even though I am not a great fan of appliance model (for that matter, even commercial solutions); if SAP can deliver( multi billion dollar question) what has been promised (faster, better & cheaper); and make it more dearer to small to medium end companies; then we will be seeing lot many hana adoptions in coming years and hana will be a dream development platform for analytical and data science engineers.
I will have more technical blog posts on hana in the coming days along with exploring solutions like platfora, impala, parstream, etc.; and if you have a solution to consider, please drop me a note in the comments section.