- GCP Credentials

Exporting GCP credentials to S3 and then PySpark.

We now have a dataset that we can use as input to a PySpark pipeline, but we don’t yet have access to the bucket on GCS from our Spark environment.

Accessing GCP bucket

With AWS, we were able to set up programmatic access to S3 using an access and secret key. With GCP, the process is a bit more complicated because we need to move the JSON credentials file to the driver node of the cluster in order to read and write files on GCS.

One of the challenges with using Spark is that you may not have SSH access to the driver node, which means that we’ll need to use persistent storage to move the file to the driver machine. This isn’t recommended for production environments, but instead it is being shown as a proof of concept.

Managing credentials

The best practice for managing credentials in a production environment is to use IAM roles.

Get hands-on with 1200+ tech skills courses.