From 5 Jobs to 500: How Senior AWS Architects Manage Scale
Joining us today is Roman Čerešňák: an AWS Cloud Architect and AI/ML specialist who builds production-ready cloud and generative AI solutions.
Today, he breaks down how to orchestrate serverless data pipelines at enterprise scale using AWS Glue and MWAA:
• Why Glue and MWAA are a powerful combination for modern data workflows • How to design orchestration patterns that hold up in production • Real-world architecture decisions, trade-offs, and lessons from the field
Now over to Roman!
💡 Quick note before we start:
If you want to get hands-on with AWS skills – with none of the setup, cleanup, or extra costs – we've got you covered. Check these out!
Become an AWS pro with this course developed by AWS Solution Certified Architects. Get hands-on with secure, resilient, high-performing, and cost-optimized architecture design.
👋 Hi there, it's Roman.
I’ve been around long enough to remember when “the cloud” was just someone else’s data center you accessed via a sketchy VPN, and “ETL” meant spending your weekend babysitting a massive SQL Server Integration Services (SSIS) package. Fast forward twenty years, and we are living in the golden age of decoupling.
In the AWS ecosystem, AWS Glue has become the de facto standard for serverless ETL. It’s powerful, scales like a beast, and handles everything from schema discovery to Spark-heavy transformations. But as your data platform grows from five jobs to five hundred, the native Glue triggers start to feel like trying to conduct a symphony orchestra with a whistle.
That’s where Amazon Managed Workflows for Apache Airflow (MWAA) comes in. If Glue is the engine, MWAA is the sophisticated cockpit. Today, we’re going beyond the “Hello World” tutorials. We’re going to discuss a high-level, production-hardened architecture for managing Glue jobs that prioritizes idempotency, dynamic generation, and observability.
The Architectural Gap: Why Native Triggers Aren’t Enough
In a simple world, a Glue trigger starts Job B when Job A finishes. But enterprise data pipelines are rarely simple. You have cross-account dependencies, S3 data arrivals that don’t follow a schedule, and the need for complex branching logic (e.g., “If 10% of records fail validation, stop the pipeline and alert Slack; otherwise, proceed to Redshift”).
Airflow brings Directed Acyclic Graphs (DAGs) to the table. By using MWAA, we remove the operational overhead of managing Airflow workers and schedulers, allowing us to focus on the logic.
An “Advanced” Approach: The Metadata-Driven Factory
The biggest mistake I see “junior” senior architects make is hard-coding Glue job names into Airflow DAGs. This creates a maintenance nightmare. Instead, we should treat our ETL orchestration as a Metadata-Driven Factory.
1. The Dynamic DAG Pattern
Instead of writing 50 Python files for 50 Glue jobs, you write one “Factory” script. This script reads a configuration file (YAML or JSON) stored in an S3 bucket.
# A conceptual snippet of a Metadata-Driven DAG config = load_config_from_s3("s3://my-ops-bucket/glue_configs.yaml")
for job_name, params in config.items(): with DAG(dag_id=f"etl_{job_name}", ...) as dag: start_task = DummyOperator(task_id='start')
This approach allows your data engineers to deploy new ETL processes simply by dropping a JSON config and a Spark script into S3. No Airflow deployment required.
2. Handling the “Data Arrival” Problem (Sensors vs. Events)
One of the most elegant ways to trigger Glue via MWAA isn’t a schedule; it’s the S3KeySensor. However, polling S3 can get expensive and slow.
A more “senior” approach is to use Amazon EventBridge. When a file lands in S3, it triggers a Lambda that pokes the Airflow API (via the MWAA CLI token) to trigger a specific DAG. This transforms your batch system into a near-real-time reactive system.
Hard-Won Lessons in Security and Networking After two decades of fixing “it works on my machine” issues, I can tell you: Networking is where most MWAA projects die.
VPC Security: Your MWAA environment and Glue jobs should reside in private subnets. Ensure you have VPC Endpoints for S3, Glue, and CloudWatch. This keeps your data traffic within the AWS backbone, reducing latency and increasing security. IAM Least Privilege: Do not give your MWAA execution role AdministratorAccess. It needs airflow:CreateCliToken and specific glue:StartJobRun permissions. Similarly, your Glue jobs need a separate role that only touches the specific S3 buckets they need.
Observability: Beyond “Success” or “Failure”
In a production environment, knowing a job failed is easy. Knowing why it failed without clicking through five layers of the AWS Console is the goal.
Custom Callbacks
Use Airflow’s on_failure_callback. When a Glue job fails, the callback can scrape the last 20 lines of the CloudWatch Log Stream and post them directly to a Microsoft Teams or Slack channel. This cuts your Mean Time to Recovery (MTTR) significantly.
Resource Optimization
Glue is expensive if misconfigured. I always recommend implementing a “Cost-Monitor” task in your DAG. After the GlueJobOperator completes, a downstream task can query the GetJobRun API to retrieve the DPU-Hours consumed.
We can calculate the cost of a single run using:
Cost = (Number of Workers × Worker Capacity) × (Duration in seconds / 3600)
Logging this to a dashboard allows you to spot “greedy” scripts before they blow your quarterly budget.
Logging this to a dashboard allows you to spot “greedy” scripts before they blow your quarterly budget.
The “Senior” Reality Check: When to Skip Airflow?
Despite my love for MWAA, it isn’t always the answer. If you are running simple, independent jobs with no complex dependencies, Glue Workflows or even Step Functions might be better.
Step Functions are particularly potent for shorter, high-volume orchestrations because they are state-machine based and have native “Wait for Task Token” integrations that are more cost-effective than an Airflow worker sitting idle.
However, if you need cross-platform orchestration (e.g., run a Glue job, then refresh a Tableau extract, then trigger a dbt Cloud run), MWAA is the undisputed king.
Closing Thoughts
Managing AWS Glue with MWAA is about moving from “managing scripts” to “managing a data product.” By using dynamic DAG generation, event-driven triggers, and rigorous cost monitoring, you build a system that is robust, scalable, and - most importantly - easy to hand over to an operations team.
Remember: The best code is the code that handles its own failures and tells you exactly what happened when it did.