LESSONS LEARNT WHILE MIGRATING DATA FROM S3 to BLOB — LEVEL:INTERMEDIATE CATEGORY:DATA MIGRATION

Large Scale Data Migration from AWS S3 TO Azure Blob

RK Iyer
10 min readSep 7, 2021

✐ Co-Author — Venkatesh Sankaranarayanan & Amit Damle

❑ Background

Photo by Robinson Greig on Unsplash

Relocation is always an exciting adventure, but one would surely agree that it equally stressful and an exhausting journey. However, there are still ways to ensure that your relocation goes as smoothly as possible by planning, preparing & executing the plan perfectly. Similarly, whenever there is data relocation/migration especially from one cloud to another, it also requires a lot of thought to ensure that the migration journey is smooth.

While working for some of our large customers, we had a challenge of moving around 600 TB ~ 1 PB of data from AWS S3 to Azure Blob. In this blog, I will share my learnings on different considerations and how to overcome the challenges using some of the best practices.

Typically there are 2 main categories of data for a large-scale data migration be it any domain healthcare & life sciences, media & entertainment, financial services, retail.

  • Moving large media assets & content data e.g. Audio, Video, Images, DICOM files, product catalog, email, web page data, etc.
  • Data lake/Data warehouse migrations moving historical data stored in filesystem/databases

Architectural Pillars of Large Scale Data Movement

Factors to be considered for large scale data movement
  • Cost — Development, maintenance & management, monitoring & debugging cost need to be considered.
  • Time for Setup — Time for creating/arranging for infrastructure and end-to-end migration needs to be considered.
  • Ease of use — How easy it is to perform the migration
  • Performance — End-to-end data migration must be performed within an acceptable time range depending on business requirements.
  • Security — Data should be transferred in a secured manner. Data protection must be ensured that only authorized entities can view, modify, or delete your data.
  • Monitoring — Data transfer should be continuously monitored. One should also be able to track your physical order through the Azure portal.
  • Reliability — Data must be consistent. There needs to be a mechanism to ensure that the source and destination data matches.
  • Control — Selective transfer of data over a while

❑ Migration Options with Pros and Cons & recommendations

Large Data transfer migration options

Although there are multiple options(as shown in the above chart), We have seen customers wanting to avoid offline/physical transfer due to time, uncertainty, high coordination & unreliability in the current pandemic scenario. In online/non-physical transfer mode, customers want to avoid reinventing the wheel using custom scripts since many of the required features like monitoring, auditing, resumability, security needed to be developed from scratch. Azure Data Factory (ADF) becomes the unanimous choice since most of the required features are available out of the box as a built-in feature.

Lesson’s learnt and best practices using ADF

Selective copy of data with prioritization (Start development work rather than waiting for full migration)

ADF allows performing a selective copy of data allowing to copy only specific folders or files so the overall turnaround time is less. Fine-grained control of data transfer based on priority reduces the risk of migration since the small amount of data with high priority can be transferred and validated rather than starting the development work waiting for the entire migration to complete.

Selective Copy of data with prioratizing

Please refer Copy data from/to a file system — Azure Data Factory & Azure Synapse | Microsoft Docs for more details.

Delta migration Migration after time period.

Typically we have observed that till the time S3 bucket is sunset fully, the downstream applications will keep storing the files into S3 post the initial one-time migration. This incremental data needs to be loaded into Azure Blob using the incremental data pipeline in ADF. ADF provides the capability to identify new files created/updated into S3 buckets using the “Filter By Last Modified” property of Copy Data Activity. Users can specify start and end date-time to fetch the incremental data.

Delta files migration

Note: “Filter By Last Modified” feature of Copy Data Activity works efficiently if files to be listed <100000. to mitigate this limitation users can capture the name of the new / modified file into a text file and use the above mentioned selective copy method

Time for Setup —

Since ADF is a fully managed PaaS service, an ADF instance can be created within a minute through a click of a button or using an ARM template, Azure CLI, or Powershell command.

Monitoring and Alerting

ADF provides out-of-box monitoring capabilities to monitor copy pipeline runs and set alerts in case of failures. Detailed analysis can be performed by looking at some of the parameters like Data read, Files read, Data written, Files written, copy duration, throughput, DIU’s used, etc.

Pipeline Monitoring
Detailed Monitoring of Copy

For more details, Please refer Monitor data factories using Azure Monitor — Azure Data Factory | Microsoft Docs

Monitor data factories using Azure Monitor — Azure Data Factory | Microsoft Docs

Auditing

Although ADF doesn’t provide out-of-the-box auditing as it is application/use-case dependent, it is very easy to implement auditing. Auditing can be performed for all the files wherein copy is successful and also for copy failures be it user or system failures. This will also help for future delta loads so that these auditing tables can be referred to in the future. Some of the key fields which can be audited are Item name, data read, data written, rows copied, copy duration, load date-time, status flag — success or failure.

Auditing

Reliability —

Reliability and building robust pipeline is a key requirement for any data migration project. Please find below some of the key capabilities of ADF which we utilized.

Resumability — ADF has a resume capability by which you can build robust pipelines for many scenarios. With this enhancement, if one of the activities fails, you can rerun the pipeline from that failed activity. When moving data via the copy activity, you can resume the copy from the last failure point at the file level instead of starting from the beginning, which greatly increases the resilience of your data movement solution especially on large size of files movement between file-based stores.

Rerun from failed activity
  • When you copy data from Amazon S3, Azure Blob, Azure Data Lake Storage Gen2 and Google Cloud Storage, copy activity can resume from arbitrary number of copied files.

Data Consistency verification —When the “Data consistency verification” option is selected, copy activity will do an additional data consistency verification between source and destination store after data movement. The verification includes file size check and checksum verification for binary files, and row count verification for tabular data.

Data Consistency verification

You have 2 option to handle inconsistencies — You can abort on failure or ignore and continue. Copy activity will continue to copy the rest of the data by skipping the inconsistent objects and logging if you also enable logging in copy activity.

Fault tolerance settings —By selecting fault tolerance setting, you can ignore some errors occurred in the middle of copy process. E.g. incompatible rows between source and destination store, file being deleted during data movement etc.

Security

As per this article, 74% of the data breaches start With Privileged Credential abuse. It is important to store the credentials in Azure key vault so that the credentials are hidden from the data engineers.

Azure Key Vault

Preserve metadata along with data —

While copying data from source to sink, in scenarios like data lake migration, you can also choose to preserve the metadata and ACLs along with data using copy activity.

Low Cost

Since ADF provides many of these features as out of box features, it is less costly compared to custom scripts/solutions that needs to develop from scratch. Please refer for more details. The Total Cost of Ownership with ADF is much lower compared to custom solution.

❑ Infrastructure setup strategies

There are multiple strategies to implement security. Infact, Security, Performance and cost are 3 important factors which needs to be considered while selecting any of the below strategy.

It is also important to note that whenever data moves out there would be S3 Data Egress Charges. AWS Pricing Calculator

➊ ADF with Express Route & Direct Connect —

Set-up main components

Direct Connect — Using AWS Direct Connect, you create a dedicated private connection between AWS and your data center, office, or colocation environment.

Express Route — Will help in secure data transfer through a private dedicated n/w between Azure datacenters and infrastructure on your premises or in a colocation environment.

Connectivity between Direct Connect with Express Route (e.g. Equinox) — Companies like Equinix can help us bridge network between AWS direct connect and Azure ExpressRoute though their physical router.

Pros

🗹 Will help in secure data transfer through a private dedicated n/w

🗹 This can increase bandwidth throughput and provide a more consistent network experience than internet-based connections. Performance will be better compared to site-to-site network with uptime SLA. It is available in speeds starting at 50 Mbps and scaling up to 100 Gbps.

🗹 Egress cost via direct connect will be lesser compared to egress cos via internet. ~$0.2 to $0.3 per GB data transfer out VS ~$0.09 per GB data transfer out via internet

Direct Connect & Express route offer higher security, reliability, and speeds, with lower and consistent latencies than typical connections over the Internet with https & site to site VPN. Setting up of such environment will take time based on location and existing setup of the customer. Use the above statergy when the volume is significantly higher (in Petabyte range) & security, predictability is of utmost importance.

➋ Site to Site VPN —

A site to site VPN is setup between AWS & Azure.

🗹 Will help in data transfer through dedicated secure channel.

🗹 Setup time is comparatively less.

Although the maximum bandwidth is 1.25 Gbps, based on experience we have see a consistent transfer of ~100 Mbps. The reliability and speed consistency is lower compared to direct connect/Express route setup. Egress cost is more compared to direct connect/Express route setup. It provides more security compared to https data transfer over the internet. Use the above statergy when the volume is significantly not very high(in 300~500 TB range) & security is of utmost importance.

➌ Over the Internet with IP restriction for SHIR(Self hosted IR) —

In this approach, the S3 blob is restricted to accept and allow transfer of data to restricted IP’s from Azure only from self-hosted IR IP’s.

🗹 Setup time is comparatively less.

🗹 Performance is better than Site-to-Site VPN

Although this method is less secure compared to site-to-site VPN or direct connect/Express route but it gives better performance compared to site to site VPN(~400–500 Mbps) consistently based on experience). Egress cost is more compared to direct connect/Express route setup.

➍ Over the Internet with IP restriction for ADF IP ranges

IP restriction for ADF IP ranges

In this approach, the S3 blob is restricted to accept and allow transfer of data to restricted IP’s from Azure’s ADF IP ranges for the region where ADF is hosted. We would not need a self-hosted IR since Blob and S3 are not part of VPN or VPC and Azure IR can be used. Although this method is less secure compared to site-to-site VPN or direct connect/Express route but it gives better performance compared to site to site VPN(700 Mbps consistently based on experience).

❑ Reference

Choosing a data transfer technology — Azure Architecture Center | Microsoft Docs

Monitor data factories using Azure Monitor — Azure Data Factory | Microsoft Docs

Visually monitor Azure Data Factory — Azure Data Factory | Microsoft Docs

❑ Acknowledgement

I would like to thank Abhishek Narain for his constant guidance and help whenever needed throughout different migration journeys.

Based on our experience working with different customers, ADF has only improved with respect to reliability, performance, security making it tool of choice and I hope it supports many more migrations in future…

Till then, Happy Learning!!!

Please Note — All opinions expressed here are my personal views and not of my employer.

Thought of the moment-

Yes it’s going to be hard but it is going to be worth it….

--

--

RK Iyer

Architect@Microsoft, Technology Evangelist, Sports Enthusiast! All opinions here are my personal thoughts and not my employers.