In this blog series, we’ll explore the concepts that make up the Semantic Data Lake. We’ll begin with an introduction to Hadoop – what is it and why was it developed? Techniques traditionally applied when mastering complicated organizational reference data such as customers often require centralization to enforce standardization.
Time to construct and maintain these solutions diminishes organizational responsiveness in many large organizations, because building large scale data warehouses or message integration solutions requires enormous investments in resources and patience to yield value. Alternatives were needed -- spurred by the growth in the digital economy, agility and flexibility have become critical success factors to enabling digital transformation initiatives.
Internet giants like Google and Yahoo had managed to maintain and increase responsiveness, providing new real-time capabilities when others were challenged with delivering weekly refreshes, let alone daily refreshes or streaming data. Well understood data warehousing techniques that necessitated replication and restructuring of data could simply not be applied - the volumes were far too large and the time required to process into a defined structure were not available. Instead, these giants developed open source technologies, the cornerstone of which today is commonly known as Hadoop.
The basic premise behind Hadoop was to provide in-place query capabilities leveraging a cluster of servers based on commodity hardware. This direction was contrarian in many ways. But architectures and databases aside, the critical Hadoop success factor was its inherent ability to send the algorithm to the data and not the traditional reshaping techniques that sought to centralize data for processing.
In the video above, we illustrate the query in place capability of Hadoop on commodity hardware. Hive is used as a programmatic interface where a SQL-like script is developed. The script when executed would generate corresponding Map/Reduce scripts that are directed to the nodes containing required files. Scripts are executed and results compiled and returned to the Hive interface. Note that these are all native Hadoop features that do not require additional developer intervention.
This critical concept of sending the algorithm to the data coupled with the ability to query contents without restructuring was foundational in the definition of the Hadoop Data Lake. Join us next time where we see how organizations have learned to leverage operational savings in cost reduction and cost avoidance and applied them to optimization efforts capitalized from 360 data views. How does your organization use a Data Lake? What 360 data views power your analytics? We welcome your thoughts, value your insights and action your feedback: share below!