AWS Cross-Account Authentication in a Data Pipeline

Mar 27

This is the third article in our Behind The Build series, where we take you inside real-world projects and the decisions that shape them.

If you haven’t read the first two yet, start here: 🔹 Debating the Data Stack: Why We Chose SQLMesh Over dbt + DuckDB 🔹 DuckDB Meets SQLMesh: Building a Lean Data Architecture for the UN

Now, in Part 3: Optimizing Data for the World’s Leading Intergovernmental Organization, we’re tackling one of the trickiest pieces of modern infrastructure—secure, cross-account AWS authentication—and how we solved it without hardcoded credentials or vendor lock-in.

Secure and scalable cloud data pipelines are built by overcoming complex problems. One such challenge we faced in the data pipeline of a non-profit client was enabling secure access across AWS accounts. This meant finding a solution that was not only secure but also streamlined, easy to manage, and scalable for future use.

In this article, we’ll walk you through how we solved the problem of cross-account authentication between AWS accounts. We’ll explain not just what we did, but also why we made certain choices and how those choices led to an optimal solution. You’ll also learn about what worked, what didn’t, and how AWS tools helped us implement an efficient, secure setup.

The Problem: Secure Cross-Account Access

In our setup, the data pipeline runs in Account A but needs to pull data from resources like S3 buckets that live in Account B. This presents a challenge: how can the pipeline in Account A securely access resources in Account B without:

Hardcoding credentials (which is risky and insecure)
Writing complex custom authentication logic
Making the onboarding process difficult for other applications or services that might need similar access

A Simple Analogy: Guest Badges for Secure Access

To understand cross-account role assumption, think of Account B as a secure office building. You don’t work there full-time, but you visit often to access certain rooms.

Instead of giving you a permanent key (which could be lost, stolen, or misused), they issue you a guest badge every time you check in. This badge gives you access only to the rooms you need and automatically expires after a certain time.

That’s exactly how AWS STS (Security Token Service) works. When the pipeline in Account A assumes a role in Account B, AWS hands it a temporary set of credentials (like the guest badge). These credentials work only for a specific set of permissions and expire after a short time.

Our First Attempt

We started with a more direct setup — explicitly calling

sts:AssumeRole

grabbing the credentials and passing them to the S3 client.

It worked, but:

The client’s internal policy didn’t allow anyone to create personal AWS credentials.
The code was longer and harder to read
We had to manage expiration ourselves as the client only offered temporary credentials that expire every 1hr
It was easy to mess up the environment or forget to refresh

One option we considered was using IAM User Access Keys, which provide permanent credentials tied to a specific user. However, these keys come with several drawbacks: they don’t expire (making them risky if leaked), require manual rotation, and make it difficult to trace actions to their source since everything just logs under the same user. While access keys can work for quick fixes, they’re considered legacy for automation purposes.

We also looked into AWS SSO (Single Sign-On), which is excellent for developers logging into the AWS console but less ideal for headless applications like pipelines. SSO requires human interaction to log in and makes token refresh difficult to automate, so it’s better suited to CLI workflows than production pipelines. While we do use SSO for local testing, it wasn’t practical for our automated production setup.

Another option we explored was EC2 or ECS Instance Profiles, which are useful when both your compute and resources reside in the same AWS account. However, since we needed cross-account access, these profiles didn’t fully address our use case. Even with instance profiles, we’d still need to implement assume-role logic to handle multiple accounts. Additionally, they tie your identity directly to the compute layer, and we preferred a more loosely coupled approach that keeps permissions and compute management separate. Ultimately, IAM role assumption offered the most secure, scalable, and automated solution for our cross-account pipeline.

In the end, we switched from stick shift to automatic — same hassle, less hassle, but we needed a solution that was secure, automated, and easy to manage. AWS already provides a pattern designed to handle this: Cross-Account IAM Role Assumption. We’ll explain how it works below.

The Solution We Implemented

We used AWS’s cross-account authentication pattern, which follows this structure:

Account A: The account where the data pipeline runs
Account B: The account where the S3 bucket and other resources live
IAM Role in Account B: This role has permissions to access the resources in Account B and a trust policy that allows the pipeline in Account A to assume the role

To enable secure communication between the two, we created an IAM role in Account B. This role has specific permissions to access the S3 bucket and includes a trust policy that allows the pipeline in Account A to assume it.

This setup gives us several benefits:

No long-term credentials to manage: We don’t have to store or rotate permanent keys
Scoped, auditable permissions: The assumed role can only perform specific actions (e.g., read from the S3 bucket), and AWS logs all actions in CloudTrail
Automatic credential refresh: AWS SDKs handle the temporary credentials behind the scenes, so we don’t have to worry about expiration or manual refresh

How AWS Credentials Work Behind the Scenes

AWS SDKs don’t just guess which credentials to use — they follow a specific order called the default credential provider chain. This is a checklist that AWS goes through to find valid credentials:

Explicit credentials in the code (this is rare and not recommended).
Environment variables (see examples below)

  
            AWS_ACCESS_KEY_ID

        AWS_SECRET_ACCESS_KEY

Web identity tokens from identity providers like Kubernetes (this is how IRSA works).

AWS SSO tokens (for humans logging in manually).
Shared credentials files (see below)

        ~/.aws/credentials

EC2 instance roles or container metadata (common in automated setups).

In our case, we use web identity tokens from Kubernetes. AWS injects a token and role ARN into the container’s environment, and the AWS SDK automatically assumes the role for us.

🔐 Security Wins

No long-term credentials – STS tokens expire automatically
Scoped access – the assumed role can only do what we allow
Audit trail – CloudTrail logs all role assumptions and actions
Revocation is easy – just update the role’s trust policy

And since credentials are never written to disk, there's little risk of accidental leakage.

🧩 DuckDB and S3

We needed DuckDB to read directly from S3. To do that, we configured a secret inside DuckDB using the active AWS credentials:

  
    def setup_s3_secret(self):

    creds = self.session.get_credentials().get_frozen_credentials()

    self.con.sql("DROP SECRET IF EXISTS my_s3_secret")

    self.con.sql(f"""

        CREATE PERSISTENT SECRET my_s3_secret (

            TYPE S3,

            KEY_ID '{creds.access_key}',

            SECRET '{creds.secret_key}',

            SESSION_TOKEN '{creds.token}',

            REGION '{self.session.region_name}'

        );

    """)

This lets us run SQL like:

  
    SELECT * FROM read_csv_auto('s3://name-data-pipeline/file.csv', SECRET my_s3_secret);
  

🛠️ Quick Setup Checklist

Create an IAM Role in Account B with a trust policy that allows Account A to assume it
Attach a policy to that role with S3 permissions like:

  
    {

  "Effect": "Allow",

  "Action": ["s3:GetObject", "s3:PutObject", "s3:ListBucket"],

  "Resource": ["arn:aws:s3:::name-data-pipeline/*"]

}

Configure your environment (e.g., in Kubernetes with IRSA or via CI/CD env vars)

Let the AWS SDK do the rest

Final Takeaway

Cross-account IAM role assumption isn’t just secure — it’s efficient and scalable. By leveraging AWS’s built-in tools, we avoided the risks of long-term credentials, eliminated manual refresh logic, and gained fine-grained control over access.

If you’re new to AWS but experienced with engineering, think of this solution as learning how to “borrow” access when you need it without the risk of owning and managing permanent keys. AWS takes care of the locks and keys, and you can focus on building your pipeline.

This is the authentication approach that powers our data pipeline — and we highly recommend it for any multi-account AWS setup.

Key Terms

AWS Cross-Account Authentication: A method that allows applications or services in one AWS account to securely access resources in another AWS account without relying on hardcoded credentials or manual configuration.
IAM (Identity and Access Management): AWS’s service for managing access to AWS resources. It lets you create users, groups, and roles and define permissions using policies.
IAM Role Assumption: A process where an AWS service or application temporarily "borrows" permissions from a specific role in another account to access its resources.
STS (Security Token Service): A service that provides short-term, temporary credentials for accessing AWS resources. These credentials automatically expire after a set time.
AWS Credential Provider Chain: The order in which AWS searches for valid credentials, checking sources like environment variables, configuration files, or roles assigned to compute services.
IRSA (IAM Roles for Service Accounts): A feature that allows Kubernetes pods to automatically assume AWS roles, providing temporary credentials without storing secrets inside the pods.
Trust Policy: A set of rules attached to an AWS IAM role that defines who (e.g., other AWS accounts or services) is allowed to assume that role.

Ready to Build Your Own Open-Source Data Stack?

If your startup or organization is considering a modern data platform using open-source tools, we can help. Our team at Database Tycoon specializes in cost-effective, serverless architectures built on best-in-class technologies

Book your 30-min consult

AWSData EngineeringModern Data StackCross-Account AuthenticationDuckDBData InfrastructureCloud Security

Stephen Sciortino