In the future, difference between successful and unsuccessful company will be increasingly dependent upon its ability to leverage information. For an organization to produce great decisions, it will require two pieces of data - one with real time information and the other with historic information and even further with the ability to blend this together and predict results. As time moves on, the amount of data gathered will go up at least by one or in some cases by two or three orders of magnitude particularly when we start pulling in data from outside the organization & from the “Internet of Things”
Relational data warehouses and their big price tags have long dominated complex analytics and reporting. However, the slow-changing data models and rigid field-to-field integration mappings are too fragile to support big data volume and variety. The data lake approach disables these problems.
“Data lakes” is a storage depository that holds data in its innate format which is used by “Data Scientists” for ideation and discovery. It is an effective approach to the challenges of data integration as enterprises increase their exposure to mobile and cloud-based applications and even the sensor-driven Internet of Things. Data lakes store large amounts of data at a cost 10-50 times lower than traditional “Data Warehouses” and gives business users immediate access to all data. Data Lake can contain machine generated data, social media content; click stream and even video and audio. Traditional data warehouses are limited to structured data but Data Lake can hold any type of data.
Data Lake accepts inputs from various sources and can preserve both the original data fidelity and the lineage of data transformations. Data models emerge with usage over time rather than being imposed as a huge block of data. The lake can serve as an assembly point for the data warehouse, the location of a more carefully “treated” data for reporting and analysis in batch mode.
However some experts regard data lakes as a “Fallacy”. Data lakes are marketed as enterprise wide data management platforms for analyzing disparate sources of data in its native format. The idea is simple: rather than placing data in a data store, you move it into a data lake in its original format. This knocks out the upfront costs of data transformation which makes it a cheap storage house in this data driven economy.
Data lakes carry substantial risks. Biggest risk is its inability to determine data quality .Absence of any mechanism to maintain data quality ends up turning data lake into a data marsh. Another risk is certainty and access control since it is a ungoverned store.
Data lakes are going to be critical for successful enterprise because what companies are realizing with the new upcoming standards like “Hadoop” & “MapReduce” is that they can keep all sorts of data about their business and form their businesses into common structure. It is bringing together application data & analytics in seamless manner because of its huge storage capacity and immediate modelling feature.