AZURE SYNAPSE LINK FOR COSMOS DB DEEP DIVE — PART 1

Enable near real-time analytics over operational data using Synapse Link

RK Iyer
Microsoft Azure
Published in
5 min readNov 9, 2022

--

HOW TO LEVERAGE AZURE SYNAPSE LINK TO PERFORM ANALYTICS OVER AZURE COSMOS DB DATA

Photo by Stephen Phillips - Hostreviews.co.uk on Unsplash

❑ Background

Most business applications store their transactional data in Cosmos DB containers and push the same data in data lake to perform analytics, BI, and machine learning over your operational data. This process of storing and maintaining both transactional and analytical stores is tedious, costly and requires high maintenance. Also, in this scenario, transactional and analytical workloads will compete for the same resources, forcing customers to provision throughput for both.

During my customer interactions, I have many times come across same query if Cosmos DB can also be leveraged to generate real-time insights on operational data without impacting the transactional store performance.

To solve the above problem, Azure Synapse Link was introduced. In this 2-part blog series, I will share how Synapse link can be leveraged along with the best practices based on my experience in implementing in customer projects.

❑ Traditional approach before Synapse Link?

  1. All application data in an Azure Cosmos DB container is internally stored in an indexed row-based “transactional store”.
  2. Azure Data Factory is used to retrieve the data from Cosmos DB and pushed periodically to Data Lake (Azure Data Lake Store Gen2) using Azure Data Factory (ADF). Some customers also use change feed to get latest data & push it into Data Lake.
  3. This data is transformed, standardized, aggregated & pushed to Cosmos DB to be used in a consumable form.

Below are the limitations of this approach -

  1. There is a performance & cost impact (Extra RU’s) on the transactional workloads.
  2. There is a delayed insight on Data as the data is pulled from transactional store in a batched interval.
  3. Data formats & storage layer needs to be managed for analytics.

Synapse link is used to solve the above limitations.

❑ How Synapse Link work?

Let’s break down how Synapse Link works -

  1. All application data in an Azure Cosmos DB container is internally stored in an indexed row-based “transactional store”. Row store format is designed & highly optimized for transactional reads and writes & operational queries with milliseconds response times.
  2. This transactional store data is automatically synced to analytical store (Column store) within 2 mins i.e. All the inserts, updates, deletes to operational data are automatically synced from transactional store to analytical store in near real time. This data stored is stored in Parquet format designed with efficient data compression & encoding schemes along with enhanced performance to handle complex data storage and retrieval in bulk. Also, an important point to note that most of the columnar format files are append only but here the updates as well as deletes are also taken care of, which means that latest data is available in analytical store.
  3. Synapse analytics (Synapse Apache Spark and serverless SQL pool) can query this analytical store through cloud-native hybrid transactional and analytical processing (HTAP) capability known as Synapse Link without impacting the transactional store’s provisioned throughput.

Azure Synapse Link creates a tight seamless integration between Azure Cosmos DB and Azure Synapse Analytics.

Synapse Link is currently supported for Azure Cosmos DB SQL API & Azure Cosmos DB for Mongo DB

❑ Advantages of Synapse Link -

🗹 Reduces complexity and manageability by using “Auto-Sync”— Your transactional data is ready to be used for analytics without any change feed jobs or complex ETL. There is no need to monitor and manage these complex pipelines.

🗹 Near Real time — Your transactional data is automatically synced to analytical store within 2 mins providing limitless opportunities for near real-time analytics over the operational data.

🗹 No performance or costs impact (No Extra RU’s) on your transactional workloads — The analytical workload is independent of the transactional workload traffic & not consuming any of the provisioned throughput of your operational data.

🗹 Automatic schema inference — Converting unstructured data into a structured data is very difficult. While Azure Cosmos DB transactional store is schema-agnostic, Azure Cosmos DB analytical store is schematized to optimize for analytical query performance. With the auto-sync capability, Azure Cosmos DB manages the schema inference over the latest updates from the transactional store. It also manages the schema representation in the analytical store out-of-the-box which, includes handling nested data types.

🗹 Compressed Cheaper Data Storage— Data is compressed automatically in a parquet format resulting in approximately 1/10 of actual data size.

🗹 There is a native integration with Synapse to analyze the data directly with SQL & Spark runtimes

🗹 There is a native integration with PowerBI — You can build Power BI dashboards with just a few clicks using Azure Cosmos DB portal. For more information, see Integrated Power BI experience in Azure Cosmos DB portal for Synapse Link enabled accounts.

🗹 Synapse Link Global Availability — Cosmos DB data is available for analytics in any global region where the Cosmos DB account is replicated

❑ Synapse Link Use Case Pattern (Ad hoc Analysis) -

Synapse Link use case patterns

Below are some of the common use case patterns for Synapse Link for Cosmos DB -

  • Run Analytical T-SQL ad hoc Queries — Query Cosmos Db data in place, in seconds using serverless SQL pool and full expressiveness of T-SQL language.
  • Build near real-time dashboards — Use PowerBI integrated in Synapse.
  • Build a logical data warehouse — Analyze unified view across Azure Cosmos DB, Azure Data Lake store & Azure Blob Storage.
  • Prepare & train predictive pipeline — Generate insights over the operational data using machine learning translates. You can query the data in an interactive notebook or scheduled remote jobs without complex data engineering using Synapse Spark pools & build Machine Learning (ML) models with Spark ML algorithms and Azure ML integration in Azure Synapse Analytics. You can write back the results after model inference into Azure Cosmos DB for operational near-real-time scoring.

I would like to thank Rodrigo Souza for reviewing the content.

I hope this blog helped you in understanding the basics of Cosmos DB Synapse Link, its benefits & use case patterns. In the next blog we will discuss best Practices & learnings for using Synapse Link

Happy Learning!!!

Please Note — All opinions expressed here are my personal views and not of my employer.

--

--

RK Iyer
Microsoft Azure

Architect@Microsoft, Technology Evangelist, Sports Enthusiast! All opinions here are my personal thoughts and not my employers.