S3 credentials in Cloudera

jboothomas
3 min readAug 25, 2021

Part 2 of the series on Cloudera S3 access to a Pure Storage FlashBlade.

In order to access our S3 storage we cannot always leverage the credentials added to our Administration > External Accounts. I will provide here the various other methods available.

Link to Part 1 : GUI configuration of Cloudera v7 to use on premise S3 storage

Link to Part 3 : Cloudera S3 access to a Pure Storage FlashBlade covering Spark, Hive and distcp

Via a Credential Provider

With this method we will create a credential store so as to secure the S3 access and secret keys. From one of the Cloudera nodes run the following commands:

sudo -u hdfs hadoop credential create fs.s3a.access.key -value PSFBSAZRGKFPFFINEPNBCOHBNAJKGPLHHNIEJCNJP -provider jceks://hdfs/user/hdfs/home/keystores/fb01.jcekssudo -u hdfs hadoop credential create fs.s3a.secret.key -value 5C4F948D77bf6a90/1cbf1A4BFA316222b48aOPPP -provider jceks://hdfs/user/hdfs/home/keystores/fb01.jceks

The credentials file is created with permissions for my hdfs user:

$ sudo -u hdfs hadoop fs -ls /user/hdfs/home/keystoresFound 2 items-rw — — — — 3 hdfs supergroup 1040 2021–04–27 18:29 /user/hdfs/home/keystores/fb01.jceks

Then add to the HDFS configuration under Cluster-wide Advanced Configuration Snippet (Safety Valve) for core-site.xml the hadoop.security.credential.provider.path setting:

Remember to restart the HDFS configuration to propagate the changes across the Cloudera cluster.

Checking our core-site.xml on a Cloudera node shows that the access and secret keys are no longer present:

$ tail -17 /etc/hadoop/conf.cloudera.hdfs/core-site.xml<property><name>fs.s3a.connection.ssl.enabled</name><value>false</value></property><property><name>fs.s3a.path.style.access</name><value>true</value></property><property><name>fs.s3a.endpoint</name><value>192.168.4.100</value></property><property><name>hadoop.security.credential.provider.path</name><value>jceks://hdfs/user/hdfs/home/keystores/fb01.jceks</value></property>

Running a list on the S3 bucket confirms the credentials are good and used:

$ sudo -u hdfs hadoop fs -ls s3a://clouderabucket/21/04/27 18:13:05 INFO s3a.S3AFileSystem: S3Guard is disabled on this bucket: clouderabucketFound 1 items-rw-rw-rw- 1 hdfs hdfs 2728058790 2021–04–16 11:19 s3a://clouderabucket/yellow_tripdata_2010–01.csv

For security reasons perhaps a default credentials provider is not the best option. We can remove the hadoop.security.credential.provider.path setting from the HDFS configuration under Cluster-wide Advanced Configuration Snippet (Safety Valve) for core-site.xml and apply the change to the HDFS cluster. Once done we can call a credentials file directly:

sudo -u hdfs hadoop fs -D hadoop.security.credential.provider.path=jceks://hdfs/user/hdfs/home/keystores/fb01.jceks -ls s3a://clouderabucket/21/04/27 18:16:16 INFO s3a.S3AFileSystem: S3Guard is disabled on this bucket: clouderabucketFound 1 items-rw-rw-rw- 1 hdfs hdfs 2728058790 2021–04–16 11:19 s3a://clouderabucket/yellow_tripdata_2010–01.csv

Doing this requires users submitting jobs to the Cloudera cluster to have permissions to read the Credentials file. As created my credential file is usable by the hdfs user only:

rw — — — — 3 hdfs supergroup 1040 2021–04–27 18:29 /user/hdfs/home/keystores/fb01.jceks

via HDFS parameters — (unsecure not recommended)

We can add our S3 access_key and secret_key as parameters to our HDFS configuration, again under the Cluster-wide Advanced Configuration Snippet (Safety Valve) for core-site.xml:

The problem being that the credentials are written in clear text to the hdfs-site.xml on all the nodes:

$ tail -9 /etc/hadoop/conf.cloudera.hdfs/core-site.xml<property><name>fs.s3a.access.key</name><value>PSFBSAZRGKFPFFINEPNBCOHBNAJKGPLHHNIEJCNJP</value></property><property><name>fs.s3a.secret.key</name><value>5C4F948D77bf6a90/1cbf1A4BFA316222b48aOPPP</value></property>

I can now run direct commands against my S3 bucket such as:

$ sudo -u hdfs hadoop fs -ls s3a://clouderabucket/21/04/27 17:43:52 INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).Found 1 items-rw-rw-rw- 1 hdfs hdfs 2728058790 2021–04–16 11:19 s3a://clouderabucket/yellow_tripdata_2010–01.csv

--

--

jboothomas

Infrastructure engineering for modern data applications