, Databricks has shaken the data market once again. The company launched its free edition of the Databricks platform *with all the functionalities included. *It is a great resource for learning and testing, to say the least.
With that in mind, I created an end-to-end project to help you learning the fundamentals of the main resources within Databricks.
This project demonstrates a complete Extract, Transform, Load (ETL) workflow within Databricks. It integrates the OpenWeatherMap API for data retrieval and the OpenAI GPT-4o-mini model to provide personalized, weather-based dressing suggestions.
Let’s learn more about it.
The Project
The project implements a full data pipeline within Databricks, following these steps.
- Extract: Fetches current weather data for *New Y…
, Databricks has shaken the data market once again. The company launched its free edition of the Databricks platform *with all the functionalities included. *It is a great resource for learning and testing, to say the least.
With that in mind, I created an end-to-end project to help you learning the fundamentals of the main resources within Databricks.
This project demonstrates a complete Extract, Transform, Load (ETL) workflow within Databricks. It integrates the OpenWeatherMap API for data retrieval and the OpenAI GPT-4o-mini model to provide personalized, weather-based dressing suggestions.
Let’s learn more about it.
The Project
The project implements a full data pipeline within Databricks, following these steps.
- Extract: Fetches current weather data for New York City via the OpenWeatherMap API [1].
- Transform: Converts UTC timestamps to New York local time and utilizes OpenAI’s [2] GPT-4o-mini to generate personalized dressing suggestions based on the temperature.
- Load: Persists the data into the Databricks Unity Catalog as both raw JSON files and a structured Delta table (Silver Layer).
- Orchestration: The notebook with this ETL code is added to a job and scheduled to run every 1 hour in Databricks.
- Analytics: The silver layer feeds a Databricks Dashboard that displays relevant weather information alongside the LLM’s suggestions.
Here is the architecture.
Project Architecture. Image by the author.
Great. Now that we understand what we need to do, let’s move on with the how piece of this tutorial.
Note: if you still don’t have an account in Databricks, go to Databricks Free Edition page [3], click Sign up for Free Edition and follow the prompts on screen to get your free access.
Extract: Integrating API And Databricks
As I usually say, a data project needs data to begin, right? So our task here is integrating OpenWeatherMap API to ingest data directly into a PySpark notebook within Databricks. This task may look complicated at first, but trust me, it is not.
On Databricks’ initial page, create a new notebook using the **+New **button, then select Notebook.
Create a new Notebook. Image by the author.
For the Extract part, we will need:
1. The API Key from the API OpenWeatherMap.
To get that, go to the API’s signup page and complete your free registration process. Once logged in to the dashboard, click on the API Key tab, where you will be able to see it.
2. Import packages
# Imports
import requests
import json
Next, we are going to create a Python class to modularize our code and make it production-ready as well.
- This class receives the API_KEY we just created, as well as the city and country for the weather fetch.
- Returns the response in JSON format.
# Creating a class to modularize our code
class Weather:
# Define the constructor
def __init__(self, API_KEY):
self.API_KEY = API_KEY
# Define a method to retrieve weather data
def get_weather(self, city, country, units='imperial'):
self.city = city
self.country = country
self.units = units
# Make a GET request to an API endpoint that returns JSON data
url = f"https://api.openweathermap.org/data/2.5/weather?q={city},{country}&APPID={w.API_KEY}&units={units}"
response = requests.get(url)
# Use the .json() method to parse the response text and return
if response.status_code != 200:
raise Exception(f"Error: {response.status_code} - {response.text}")
return response.json()
Nice. Now we can run this class. Notice we use dbutils.widgets.get(). This command looks at the Parameters in the scheduled job, which we will see later in this article. It is a best practice to keep the secrets safe.
# Get the API OpenWeatherMap key
API_KEY = dbutils.widgets.get('API_KEY')
# Instantiate the class
w = Weather(API_KEY=API_KEY)
# Get the weather data
nyc = w.get_weather(city='New York', country='US')
nyc
Here is the response.
{'coord': {'lon': -74.006, 'lat': 40.7143},
'weather': [{'id': 804,
'main': 'Clouds',
'description': 'overcast clouds',
'icon': '04d'}],
'base': 'stations',
'main': {'temp': 54.14,
'feels_like': 53.44,
'temp_min': 51.76,
'temp_max': 56.26,
'pressure': 992,
'humidity': 89,
'sea_level': 992,
'grnd_level': 993},
'visibility': 10000,
'wind': {'speed': 21.85, 'deg': 270, 'gust': 37.98},
'clouds': {'all': 100},
'dt': 1766161441,
'sys': {'type': 1,
'id': 4610,
'country': 'US',
'sunrise': 1766146541,
'sunset': 1766179850},
'timezone': -18000,
'id': 5128581,
'name': 'New York',
'cod': 200}
With that response in hand, we can move on to the Transformation part of our project, where we will clean and transform the data.
Transform: Formatting The Data
In this section, we will look at the clean and transform tasks performed over the raw data. We will start by selecting the pieces of data needed for our dashboard. This is simply getting data from a dictionary (or a JSON).
# Getting information
id = nyc['id']
timestamp = nyc['dt']
weather = nyc['weather'][0]['main']
temp = nyc['main']['temp']
tmin = nyc['main']['temp_min']
tmax = nyc['main']['temp_max']
country = nyc['sys']['country']
city = nyc['name']
sunrise = nyc['sys']['sunrise']
sunset = nyc['sys']['sunset']
Next, let’s transform the timestamps to the New York time zone, since it comes with Greenwich time.
# Transform sunrise and sunset to datetime in NYC timezone
from datetime import datetime, timezone
from zoneinfo import ZoneInfo
import time
# Timestamp, Sunrise and Sunset to NYC timezone
target_timezone = ZoneInfo("America/New_York")
dt_utc = datetime.fromtimestamp(sunrise, tz=timezone.utc)
sunrise_nyc = str(dt_utc.astimezone(target_timezone).time()) # get only sunrise time time
dt_utc = datetime.fromtimestamp(sunset, tz=timezone.utc)
sunset_nyc = str(dt_utc.astimezone(target_timezone).time()) # get only sunset time time
dt_utc = datetime.fromtimestamp(timestamp, tz=timezone.utc)
time_nyc = str(dt_utc.astimezone(target_timezone))
Finally, we format it as a Spark dataframe.
# Create a dataframe from the variables
df = spark.createDataFrame([[id, time_nyc, weather, temp, tmin, tmax, country, city, sunrise_nyc, sunset_nyc]], schema=['id', 'timestamp','weather', 'temp', 'tmin', 'tmax', 'country', 'city', 'sunrise', 'sunset'])
Data cleaned and transformed. Image by the author.
The final step in this section is adding the suggestion from an LLM. In this step, we are going to pick some of the data fetched from the API and pass it to the model, asking it to return a suggestion of how a person could dress to be prepared for the weather.
- You will need an OpenAI API Key.
- Pass the weather condition, *max *and *min *temperatures (
weather,tmax,tmin) - Ask the LLM to return a suggestion about how to dress for the weather.
- Add the suggestion to the final dataframe.
%pip install openai --quiet
from openai import OpenAI
import pyspark.sql.functions as F
from pyspark.sql.functions import col
# Get OpenAI Key
OPENAI_API_KEY= dbutils.widgets.get('OPENAI_API_KEY')
client = OpenAI(
# This is the default and can be omitted
api_key=OPENAI_API_KEY
)
response = client.responses.create(
model="gpt-4o-mini",
instructions="You are a weatherman that gives suggestions about how to dress based on the weather. Answer in one sentence.",
input=f"The weather is {weather}, with max temperature {tmax} and min temperature {tmin}. How should I dress?"
)
suggestion = response.output_text
# Add the suggestion to the df
df = df.withColumn('suggestion', F.lit(suggestion))
display(df)
Cool. We are almost done with the ETL. Now it is all about loading it. That’s the next section.
Load: Saving the Data and Creating the Silver Layer
The last piece of the ETL is loading the data. We will load it in two different ways.
- Persisting the raw files in a Unity Catalog Volume.
- Saving the transformed dataframe directly into the silver layer, which is a Delta Table ready for the Dashboard consumption.
Let’s create a *catalog *that will hold all the weather data that we get from the API.
-- Creating a Catalog
CREATE CATALOG IF NOT EXISTS pipeline_weather
COMMENT 'This is the catalog for the weather pipeline';
Next, we create a schema for the Lakehouse. This one will store the volume with the raw JSON files fetched.
-- Creating a Schema
CREATE SCHEMA IF NOT EXISTS pipeline_weather.lakehouse
COMMENT 'This is the schema for the weather pipeline';
Now, we create the volume for the raw files.
-- Let's create a volume
CREATE VOLUME IF NOT EXISTS pipeline_weather.lakehouse.raw_data
COMMENT 'This is the raw data volume for the weather pipeline';
We also create another schema to hold the Silver Layer Delta Table.
--Creating Schema to hold transformed data
CREATE SCHEMA IF NOT EXISTS pipeline_weather.silver
COMMENT 'This is the schema for the weather pipeline';
Once we have everything set up, this is how our Catalog looks.
Catalog ready to receive data. Image by the author.
Now, let’s save the raw JSON response into our Raw Volume. To keep everything organized and prevent overwriting, we’ll attach a unique timestamp to each filename.
By *appending *these files to the volume rather than just overwriting them, we’re creating a reliable “audit trail”. This acts as a safety net, meaning that if a downstream process fails or we run into data loss later, we can always go back to the source and re-process the original data whenever we need it.
# Get timestamp
stamp = datetime.now().strftime('%Y-%m-%d_%H-%M-%S')
# Path to save
json_path = f'/Volumes/pipeline_weather/lakehouse/raw_data/weather_{stamp}.json'
# Save the data into a json file
df.write.mode('append').json(json_path)
While we keep the raw JSON as our “source of truth,” saving the cleaned data into a Delta Table in the Silver layer is where the real magic happens. By using .mode(“append”) and the Delta format, we ensure our data is structured, schema-enforced, and ready for high-speed analytics or BI tools. This layer transforms messy API responses into a reliable, queryable table that grows with every pipeline run.
# Save the transformed data into a table (schema)
(
df
.write
.format('delta')
.mode("append")
.saveAsTable('pipeline_weather.silver.weather')
)
Beautiful! With this all set, let’s check how our table looks now.
Silver Layer Table. Image by the author.
Let’s start automating this pipeline now.
Orchestration: Scheduling the Notebook to Run Automatically
Moving on with the project, it is time to make this pipeline run by itself, with minimal supervision. For that, Databricks has the Jobs & Pipelines tab, where it is easy we can schedule jobs to run.
- Click the Jobs & Pipelines tab on the left panel
- Find the button Create and select Job
- Click on Notebook to add it to the Job.
- Configure like the data below.
- Add the API Keys to the Parameters.
- Click Create task.
- Click Run Now to test if it works.
Adding a Notebook to the Job. Image by the author
Once you click the Run Now button, it should start running the notebook and display the Succeeded message.
Jobs ran. Image by the author.
If the job is working fine, it is time to schedule it to run automatically.
- Click on Add trigger on the right side of the screen, right under the section Schedules & Triggers.
- Trigger type = Scheduled.
- Schedule type: select Advanced
- Select **Every 1 hour **from the drop-downs.
- Save it.
Excellent. Our Pipeline is on auto-mode now! Every hour, the system will hit the OpenWeatherMap API and get fresh weather information for NYC and save it to our Silver Layer Table.
Analytics: Building a Dashboard for Data-Driven Decisions
The last piece of this puzzle is creating the Analytics deliverable, which will show the weather information and provide the user with actionable information about how to dress for the weather outside.
- Click on the Dashboards tab on the left side panel.
- Click on the Create dashboard button
- It will open a blank canvas for us to work on.
Dashboard started. Image by the author.
Now dashboards work based on data fetched from SQL queries. Therefore, before we start adding text and graphics to the canvas, first we need to create some metrics that will be the variables to feed the dashboard cards and graphics.
So, click on the +Create from SQL button to start a metric. Give it a name. For example, Location, to retrieve the latest fetched city name, I must use this query that follows.
-- Get the latest city name fetched
SELECT city
FROM pipeline_weather.silver.weather
ORDER BY timestamp DESC
LIMIT 1
And we must create one SQL query for each metric. You can see them all in the GitHub repository [ ].
Next, we click on the Dashboard tab and start dragging and dropping elements to the canvas.
Dashboard creation elements menu. Image by the author.
Once you click on the Text, it lets you insert a box into the canvas and edit the text. When you click on the graphic element, it inserts a placeholder for a graphic, and opens the right side menu for selection of the variables and configuration.
Interacting with Dashboards in Databricks. Image by the author.
Ok. After all the elements are added, the dashboard will look like this.
Completed Dashboard. Image by the author.
So nice! And that concludes our project.
Before You Go
You can easily replicate this project in about an hour, depending on your experience with the Databricks ecosystem. While it’s a quick build, it packs a lot in terms of the core engineering skills you’ll get to exercise:
- Architectural Design: You’ll learn how to structure a modern Lakehouse environment from the ground up.
- Seamless Data Integration: You will bridge the gap between external web APIs and the Databricks platform for real-time data ingestion.
- Clean, Modular Code: We move beyond simple scripts by using Python classes and functions to keep the codebase organized and maintainable.
- Automation & Orchestration: You’ll get hands-on experience scheduling jobs to ensure your project runs reliably on autopilot.
- Delivering Real Value: The goal isn’t just to move data; it’s to provide value. By transforming raw weather metrics into actionable dressing suggestions via AI, we turn “cold data” into a helpful service for the end user.
If you liked this content, find my contacts and more about me in my website.
GitHub Repository
Here is the repository for this project.
https://github.com/gurezende/Databricks-Weather-Pipeline
References
[1. OpenWeatherMap API] (https://openweathermap.org/)
**[2. Open Ai Platform] **(https://platform.openai.com/)
**[3. Databricks Free Edition] **(https://www.databricks.com/learn/free-edition)
[4. GitHub Repository] (https://github.com/gurezende/Databricks-Weather-Pipeline)