Within this resource, we’re going to explore joining tables using a popular python library called Pandas.
WHY JOIN TABLES?
There are several ways to join datasets, but the bigger question is: why bother joining tables at all?
Why not simply work with the necessary table or data frame and avoid the extra hassle?
Well, let’s imagine it this way,
As a data expert, you’re given two datasets: one containing customer information and the other detailing store transactions.
Your task is to identify customers who spent over $999 bucks in the past week to qualify for a Black Friday discount.
Keep in mind that these are separate, existing datasets.
While working directly with the transactions dataset might help identify transactions over $999 bucks, the real challenge lies in identifyi…
Within this resource, we’re going to explore joining tables using a popular python library called Pandas.
WHY JOIN TABLES?
There are several ways to join datasets, but the bigger question is: why bother joining tables at all?
Why not simply work with the necessary table or data frame and avoid the extra hassle?
Well, let’s imagine it this way,
As a data expert, you’re given two datasets: one containing customer information and the other detailing store transactions.
Your task is to identify customers who spent over $999 bucks in the past week to qualify for a Black Friday discount.
Keep in mind that these are separate, existing datasets.
While working directly with the transactions dataset might help identify transactions over $999 bucks, the real challenge lies in identifying who made these transactions.
What are their names and contact details?
These questions can only be answered once you bring the customers dataset into the mix.
See how it works?
Joining datasets is just as important as putting the puzzle pieces together—without it, you’re limited!
But let’s pause for a moment to tackle the one super question you’ve been waiting for.
_Why am I getting these datasets one by one—like they’re dropping clues for a treasure hunt I didn’t sign up for?
Did the data gods see me toss that banana peel into the wrong bin at the park and decide I must suffer for my sins?
Why not just serve me a combined dataset straight from the source? Is that too much to ask?_
I hear you, my friend, and I’m here to help you understand why.
In most cases, databases are designed to store various datasets with different relationship patterns—much like how the customers dataset connects to the transactions dataset.
Apart from scalability, other reasons for practicing dataset separation include:
- Access control for sensitive data
- Query efficiency
- Logical organization
- Data ownership
- Compliance and regulation amongst others.
TYPES OF JOINING TABLES
Alright, before we wrap things up, let’s quickly go over a few join types you’re likely to come across.
- Inner Join
- Left Join
- Right Join
- Outer Join
Don’t panic, they’re interesting and fun to work with.
Inner Join is one of the simplest and most popular join types, where datasets are merged based on columns that are common to both datasets.
Let’s revisit our customers and transactions datasets:
Customers Dataset
Customer_ID Name Contact
101 Alice alice@example.com
102 Bob bob@example.com
103 Charlie charlie@example.com
104 Diana diana@example.com
105 Eve eve@example.com
Transactions Dataset
Transaction_ID Customer_ID Amount Date
201 101 1200 2024-12-01
202 103 1500 2024-12-02
203 104 999 2024-12-03
204 106 750 2024-12-04
205 107 1800 2024-12-05`
As you can see, both datasets contain various diverse columns ranging from customer names to contact and amount to date of transaction respectively, but…at least one column shares it’s presence on both dataset making it a common factor.
And that’s the Customer_ID column.
This commonality helps us perform joins effectively.
So inner join in this case is simply merging both tables on the Customer_ID column, excluding all rows whose customer_ID have no matching values on both tables.
I’ll show you:
import pandas as pd
inner_join_result = pd.merge(customers_df, transactions_df, on="Customer_ID", how="inner")
As demonstrated in the code snippet, performing an inner join requires importing the pandas library and using its merge() method. This method combines the datasets by specifying the on parameter, which indicates the common column to merge on (in this case, Customer_ID). The how="inner" argument specifies the join type which is "inner".
The results of the above inner join would be:
Customer_ID Name Contact Transaction_ID Amount Date
101 Alice alice@example.com 201 1200 2024-12-01
103 Charlie charlie@example.com 202 1500 2024-12-02
104 Diana diana@example.com 203 999 2024-12-03
Let’s analyse the result for better understanding:
- Alice (101) appears in both datasets, so her information and similar rows such as Charlie(103) and Diana(104) are included.
- Eve (105) and Bob(102) are in the Customers dataset but not in Transactions, so they are excluded.
- Customer_ID 106 and 107 from the Transactions dataset are excluded because they do not match any rows in the Customers dataset.
In summary, only rows with matching customer_id is the output of our inner join.
On the other hand, rows whose Customer_ID not found on the other merging dataset are completely excluded form the resulting dataset.
how is this useful and real life application
It’s important
Too much to take in?
I’ll explain.
If you wish to explore other types of join with me such as anti-join, semi-join, self-join and many more, do give a like and read here:
Till we talk again.
ciao.
Now go join those Datasets with conviction