Last updated: 19 November 2022
In this blog, I am going to play around with spark with databricks running on AWS.
Setup a free trial databricks account.
Created a new S3 bucket spark-learning
in the cheapest region Ohio.
# high concurrency is for a shared cluster
Cluster Mode = Standard
# Turn off Autopilot
Terminate after = 15 mins
# 10 cents an hour. Cheapest I can find.
Worker, Driver Type = m6gd.large
Num workers = 1
Let’s first create a table, this can be done in the workspace console
Data -> Table -> Create table
. For data, we are using this telecommunication data from
kaggle. We are going
to directly upload the data from this csv file into DataBricks DBFS fmt table creator.
We create a spark notebook with databricks.
In order to teardown the resources, we need to