Many of our customers that use Amazon EMR as their big data platform need to integrate with their existing Microsoft Active Directory (AD) for user authentication. This integration requires the Kerberos daemon of Amazon EMR to establish a trusted connection with an AD domain, which involves a lot of moving pieces and can be difficult to get right.

This post describes what a one-way trust with Active Directory means, and how the commands work that are used in setting it up. We describe how DNS names, Kerberos realms, and AD domains are different, and the consequences of that for Amazon EMR security configuration and cluster one-way trust settings. We also discuss how you can’t use AWS Managed Microsoft AD for the Amazon EMR key distribution center (KDC) trust, and must either use an existing or new AD server.

AWS has already put out documentation and blog posts that cover some of this area, and this post is meant to complement them rather than replace them. As such, we recommend reading the following before continuing:

Is this the right architecture for you?

There are several options for authenticating Amazon EMR with Kerberos, and choosing the right approach will depend on your use case. For more information, refer to Kerberos architecture options. In this post, we cover the case where your users are already in your AD domain, and you want to use those identities to authorize actions in Amazon EMR. If you don’t already have an AD, or you want to extend an existing non-AD Kerberos realm, then one of the other options will be better suited. Don’t make more work for yourself than you have to!

Connecting to Active Directory

There are two goals when connecting Amazon EMR to AD:

  • Establish a one-way trust with from Kerberos to AD so that users working in Amazon EMR can use their AD credentials to access services. This requires ports 88, 464 (for Kerberos), and 139 (for LDAP) to be accessible from the EMR cluster.
  • Connect the EMR cluster with AD so that the servers making up the EMR cluster can be registered in the AD domain. This requires a user configured in the AD domain with sufficient privileges to make those registrations—typically called a bind user.

Types of Active Directory

The combination of the preceding requirements means that only an AD server can meet the goals. You can’t use AWS Managed Microsoft AD, nor Microsoft Azure AD. This leaves two options as the best practice.

Firstly, you can connect Amazon EMR directly to an existing AD server (whether on premises or in the cloud), as shown in the following diagram.

Alternatively, you can build a new AD server on Amazon Elastic Compute Cloud (Amazon EC2), add it to the existing corporate AD forest, and have Amazon EMR connect to this new cloud-based AD server (as shown in the following diagram).

Domains vs. realms

It is critical to understand which terms apply to which technology to avoid confusion.

Active Directory manages domains, which contain many registered entities like users and computers. They’re typically written in lowercase (for example, corp.mycompany.com) and look very similar to internet domains. They don’t necessarily have to resolve to an IP address, but they typically do.

Kerberos uses realms for a similar concept, and its registered entities are referred to as principals. Realms are typically written in uppercase, and often use a naming style that looks like an internet domain except for the case (for example, EMR.MYCOMPANY.COM). They don’t need to resolve to an IP address and typically do not.

When configuring the realm for an EMR cluster’s Kerberos daemon, the name is completely arbitrary. It serves as a namespace for the principals defined within it, but it doesn’t have to match the domain names of the instances in your EMR cluster. For example, if the private DNS names of your EC2 machines use the default ec2.internal domain, your EMR realm name doesn’t have to be ec2.internal; it could be mykerberosrealm or anything you like.

Bind user

A bind user has to be created in Active Directory with the permission to register (“bind”) EMR computers into the AD domain. Amazon EMR registers these computers under CN=Computers.

The Kerberos principals that Amazon EMR creates for use with the components of Hadoop are strictly local to the Amazon EMR Kerberos installation—they’re not registered in the AD domain.

DNS

The EMR Kerberos daemon has to be able to resolve the DNS name of your AD server in order to establish the trust between them. If those DNS domains are managed in your corporate DNS servers, you need Amazon Route 53 forwarders to your corporate DNS servers for Amazon EMR to resolve them. Because the trust is only one-way, the AD server doesn’t need to be able to resolve the internal DNS names of the EMR cluster nodes.

Establish trust

The trust that you need to establish from Amazon EMR to AD only needs to be one-way (Amazon EMR trusts AD), not two-way (they trust each other). To establish the one-way trust, use the ksetup and netdom utilities on the command line of the AD machine. The recommended encryption type attribute (SetEncTypeAttr) for the domain is an AES-256 cipher. The following code uses the example of the EMR Kerberos realm EMR.MYCOMPANY.COM and the AD domain corp.mycompany.com:

C:\Users\Administrator> ksetup /addkdc EMR.MYCOMPANY.COM
C:\Users\Administrator> netdom trust EMR.MYCOMPANY.COM /Domain:corp.mycompany.com /add /realm /passwordt:MyVeryStrongPassword
C:\Users\Administrator> ksetup /SetEncTypeAttr EMR.MYCOMPANY.COM AES256-CTS-HMAC-SHA1-96

In the first ksetup command, you don’t need to provide the fully qualified domain name of the cluster KDC as a final argument. That is only needed for a two-way trust, and is optional in a one-way trust.

Amazon EMR security configuration

Before you start the EMR cluster, you must create a security configuration that contains the details of the AD server and domain to which you’re connecting. The following screenshot shows an example of that configuration.

These fields are case-sensitive, so make sure that you enter everything correctly. Note that the domain and realm in this configuration both refer to the AD server, and not to the EMR cluster! Because AD can act as a Kerberos daemon as well as Active Directory, both of those fields are configured here. In both cases, however, they refer to the AD server and not the EMR cluster’s Kerberos domain.

Amazon EMR security options when starting a cluster

When you start an EMR cluster, you configure the security options as shown in the following screenshot.

Here you use the security configuration you just created, and specify the details of the EMR Kerberos realm as well as the parameters needed to establish the trust with the AD domain in the security configuration. You provide information for the following fields:

  • Realm – The Kerberos realm you specify is entirely up to you, but must be the same as the one you used when establishing the trust on the AD machine.
  • KDC admin password – This isn’t used anywhere else in the cluster configuration, so you can set it to something unique and secure specifically for this cluster. It’s only needed for any future management of the cluster-dedicated KDC.
  • Cross-realm trust principal password – This is the password you set with the netdom command.
  • Active Directory domain join user –This is the bind user your AD admin created.
  • Active Directory domain join password – This is the password for the bind user from AD.

Clean up

When you’re done testing this solution, remember to clean up the resources. If you used the CloudFormation templates to create the resources, then use the AWS CloudFormation console to delete the stack. Alternatively, you can use the AWS Command Line Interface (AWS CLI) or SDKs. For instructions, refer to Deleting a stack. Deleting a stack also deletes the resources created by that stack.

If one of your stacks doesn’t delete, make sure that there are no dependencies on the resources created by that stack. For example, if you deployed an Amazon VPC using AWS CloudFormation and then deployed a domain controller into that VPC using a different CloudFormation stack, you must first delete the domain controller stack before you can delete the VPC stack.

Conclusion

The steps in this post walked you through creating the trust between Amazon EMR’s Kerberos daemon and an Active Directory domain. We hope that this has demystified the process and makes it easy for you to create secure, AD-integrated EMR clusters in the future.


About the Authors

Anandkumar Kaliaperumal is a Senior Data Architect with the Professional Services SDT, where he focuses on helping customers with their Hadoop and data lake migrations. He lives with his growing family in Dallas.

Bharath Kumar Boggarapu is a Data Architect at AWS Professional Services with expertise in big data technologies. He is passionate about helping customers build performant and robust data-driven solutions and realize their data and analytics potential. His areas of interests are open-source frameworks, automation, and data architecting. In his free time, he loves to spend time with family, play tennis, and travel.

Oliver Meyn was a Senior Data Architect in the Canadian Professional Services Shared Delivery Team, where he helped customers with migrating their data and workflows to AWS. He lives in Toronto with his family and far too many bicycles.