Pivotal to Modern Warehouse design is the data Lake. The term Data Lake, coined by James Dixon, CTO of Pentaho, refers to the ad hoc nature of data in a lake alongside other data, which is diametrically opposed to the clean and processed data stored in a traditional data warehouse system. Internal and external organisational data flows, into this virtual lake, with minimal transformation or change to the schema. By following this principle, organisations can quickly, and effectively establish the central data repository for structured as well as unstructured data thereby creating easy on-demand access to the data.
There are numerous design considerations that must be considered when establishing the architecture of an analytics solution. Typically, this architecture is guided by the organisational requirements and are based on the organisation’s data maturity. A constant however, (no matter where the organisation is on the maturity curve) is that most of the organisational data, structured or unstructured, needs to be centralised as a starting point.
The norm is for data lakes to be configured on a cluster of inexpensive and scalable infrastructure, allowing data to be dumped in the lake for later use, without having to worry about storage capacity. The option of on-premises or in the cloud data storage clusters are available, cloud’s popularity due to the low cost of cloud storage has become prevalent.
The benefits of implementing a data lake are as follows:
Data lakes work on a principle called schema-on-read, implying that there is no predefined schema into which data needs to be fitted before storage.
Modern Data Architecture (MDA) design creates the flexibility to ingest data as fast as possible and enables data to be stored as is, and in any format.
Only when the data is read during processing is it parsed and adapted into a schema as needed, saving time usually spent on schema definition.
One of the first considerations in the data design of a Microsoft Modern Data Architecture is data ingestion and storage. Azure Synapse uses Azure Data Lake Storage Gen2 (ADLS Gen2) as a next-level data storage solution to support large-volume data analytics. ADLS Gen2 combines ADLS Gen1 features (like file-level security, scaling and file system semantics) with Azure Blob Storage features such as tiered storage, disaster recovery, and high-availability. ADLS Gen 2 provides the flexibility to store masses of data in massive files (up to a petabyte per file) and still allows for query that delivers great performance. Azure Synapse Analytics ensures seamless integration with the ADLS and this makes it the perfect option for both data stage as well as operational data store. Polybase makes it possible for Azure Synapse Analytics to access data directly from the data lake without ingesting or committing it to the database itself, allowing for only recent or relevant data to flow into the data warehouse.
The TrueNorth Group (a Microsoft Gold Partner) will be facilitating ‘Analytics in a Day’ workshops with companies that are interested in deep diving into the technology to accelerate their journey towards a Modern Data Architecture in Azure. The Analytics in a Day workshop is designed to enhance learning and promotes getting started on Azure Synapse Analytics. These sessions are focused and address the customer’s specific needs.
Complete our Expression of Interest form here - https://tng.digital/analytics
Comments