The concept of data warehouse may be traced back to last century. with the continuous growth of big data and developments of Hadoop ecosystem, offline data warehouse based on Hive/HDFS architect can rise. And recently years, Storm/Spark(Steaming)/Flink etc... real-time frameworks go up and have a rapid development. Every company need a real-time data solution in their system. In this article, we will talk about some typical real-time data architect in Chinese internet companies like Meituan,Netease and OPPO; selections of their storage and computing engines, also with layers division may inspire us in some point.

Four examples:

Meituan Flink based real-time data warehouse platform

https://s3-us-west-2.amazonaws.com/secure.notion-static.com/d7bab29e-3008-4ea0-bea5-a0f9b1bf3b2e/2020-06-28-1.png

from a functional perspective, Meituan's real-time computing platform contains jobs config\publish\status managements and resource managements. Resource management means multi-tenant resource isolation, delivery and deployment.

traditional data warehouse model

https://s3-us-west-2.amazonaws.com/secure.notion-static.com/542dbb50-def5-4f60-bf01-63ccabd8a3c3/2020-06-28-2.png

There are always 4 layers from bottom to top. ODS(Operational Data Store), DWD(Data Warehouse Detail), DWS(DWS, Data Warehouse Summary), ADS(ADS,Application Data Store) with Hive or spark for query.

real-time data warehouse

https://s3-us-west-2.amazonaws.com/secure.notion-static.com/8e5d382d-1d39-4419-b369-feb5cb11bc1a/2020-06-28-3.png

In real-time model, DWD & DWS always based on Kafka. considering of performance, dimensional data are always placed in HBase or Tair KV storage.

Quasi-real-time data warehouse model