domingo, 4 de junho de 2017

Bigdata in the context of Enterprise Data Warehouse - por Mahendra Kumar

Publicado em 8 de janeiro de 2015https://www.linkedin.com/pulse/bigdata-context-enterprise-data-warehouse-mahendra-kumar


Internet is full of articles on what is big data. For this post, I will focus on big data in the context of Enterprise data warehouse.

Enterprise data warehouses have been around for a long time. Many companies have made huge investments in building enterprise data warehouses. They bring many benefits such as integration and standardization of data from multiple sources, access to historical data, pre-aggregation, OLAP, isolation of analytics load from OLTP and so on so forth. Big data technologies bring benefits such as distributed storage and parallel processing of large volume of unstructured data. Large volume of unstructured data can be cheaply stored on HDFS and processed using map reduce and spark frameworks.

Here are some areas where big data technologies can augment traditional data warehousing :
Organizations are looking at gathering newer types of data such as social media feeds, public data, web logs, opinions, reviews, etc. These newer sources of data provide valuable insights about organization's customers, products and service offerings.
With the advent of Internet of things, massive volumes of data is being generated by connected wearables, sensors, automotives, smart home devices, etc. Organizations are looking to capture and process this data in real-time to become more efficient and proactive. Some organizations are using real time data feeds for timely fraud detection.
BI has evolved beyond simple reporting and analytics. Organizations are making use of deep machine learning algorithms to better understand their customers and come up with product recommendations or service offerings. More data (coupled with a good approach) yields better predictions and recommendations.

There are also use cases where big data technologies are being considered for improving existing warehousing processes and performance :
ETL : Most warehousing solutions employ ETL for loading data into the warehouse from variety of operational data sources. ETL is also used for change data capture. With ETL, data is first transformed before being loaded into warehouse. Most ETL tools require separate hardware, which can be expensive. Alternate approach is to first load data directly and then run transformations in the database engine itself. Since hadoop provides cheap storage and processing, it can be used to dump raw data directly into hdfs and then transformations applied directly on the data in hdfs by running either map reduce or spark jobs.
ODS : The volume of transactional data gathered by enterprises is increasing, putting pressure on batch window available to process the data. Warehouse practitioners are used to providing operational data stores to provide access to more recent data. ODS increases the cost and yet do not provide real-time insights. Hence, organizations are looking at distributed messaging frameworks (such as flume and kafka) to ingest large volume of data in real-time.

While map-reduce and spark provides distributed processing framework, there are abstractions such as HiveQL, Spark SQL and pig for users familiar with sql and scripting. For real time processing, systems such as spark streaming and storm provide distributed, fault-tolerant processing of incoming streams.
Here is an architecture of a integrated Enterprise data warehouse with big data technologies :



The top portion of the architecture diagram shows a traditional BI system with staging database, ODS, EDW and various components of a BI system. The middle portion of the diagram shows big data technologies to handle large volume of unstructured data coming from social media, weblogs, blogs, etc. It contains storage components such as HDFS/HBase and processing components such as map reduce/spark. Processed data can be loaded into EDW or accessed directly using low latency systems such as Impala. The bottom portion of the diagram shows stream processing. It consists of messaging frameworks such as kafka or flume and real-time stream processing systems such as storm or spark streaming.

(Views expressed on this blog are mine and do not reflect opinion of Oracle corp.)

Nenhum comentário: