Step 1: Connect to SQLite Database¶
In [1]:
import sqlite3
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
conn = sqlite3.connect("mydatabase.db")
csv_files = {
"orders": "orders.csv",
"staffs": "staffs.csv",
"customers": "customers.csv",
"stores": "stores.csv",
"order_items": "order_items.csv"
}
for table_name, csv_file in csv_files.items():
df = pd.read_csv(csv_file)
df.to_sql(table_name, conn, if_exists='replace', index=False)
In [2]:
cursor = conn.cursor()
cursor.execute("SELECT name FROM sqlite_master WHERE type='table';")
print(cursor.fetchall())
[('orders',), ('staffs',), ('customers',), ('stores',), ('order_items',)]
In [3]:
# A function to display any table by putting its name into the function
def view_table(name):
df = pd.read_sql_query(f"SELECT * FROM {name} LIMIT 5", conn)
display(df)
view_table("orders")
order_id | customer_id | order_status | order_date | required_date | shipped_date | store_id | staff_id | |
---|---|---|---|---|---|---|---|---|
0 | 1 | 259 | 4 | 2016-01-01 | 2016-01-03 | 2016-01-03 | 1 | 2 |
1 | 2 | 1212 | 4 | 2016-01-01 | 2016-01-04 | 2016-01-03 | 2 | 6 |
2 | 3 | 523 | 4 | 2016-01-02 | 2016-01-05 | 2016-01-03 | 2 | 7 |
3 | 4 | 175 | 4 | 2016-01-03 | 2016-01-04 | 2016-01-05 | 1 | 3 |
4 | 5 | 1324 | 4 | 2016-01-03 | 2016-01-06 | 2016-01-06 | 2 | 6 |
Step 2: Database Keys and Relationships¶
Primary Keys:
customers
:customer_id
orders
:order_id
order_items
: composite key of (order_id
,item_id
) oritem_id
if uniquestaff
:staff_id
stores
:store_id
Foreign Keys:
orders.customer_id
→customers.customer_id
orders.store_id
→stores.store_id
orders.staff_id
→staff.staff_id
order_items.order_id
→orders.order_id
order_items.product_id
→ (assuming aproducts
table, not listed here)staff.store_id
→stores.store_id
staff.manager_id
→staff.staff_id
(self-referential for manager)
Relationships:
- One customer can have many orders (1-to-many)
- One order can have many order_items (1-to-many)
- One store has many staff members (1-to-many)
- One staff member can manage multiple other staff members (1-to-many self relationship)
- Each order is created by one staff and belongs to one store
Step 3: Writing SQL Queries¶
In [4]:
# Defining a function to use to easily run SQL queries
def run_query(sql_query, conn):
df = pd.read_sql_query(sql_query, conn)
display(df)
return df
In [5]:
# Joining orders + customers tables using an inner join
query1 = """
SELECT
orders.order_id,
orders.order_date,
customers.first_name,
customers.last_name
FROM orders
JOIN customers ON orders.customer_id = customers.customer_id
LIMIT 10;
"""
df1 = run_query(query1, conn)
order_id | order_date | first_name | last_name | |
---|---|---|---|---|
0 | 1 | 2016-01-01 | Johnathan | Velazquez |
1 | 2 | 2016-01-01 | Jaqueline | Cummings |
2 | 3 | 2016-01-02 | Joshua | Robertson |
3 | 4 | 2016-01-03 | Nova | Hess |
4 | 5 | 2016-01-03 | Arla | Ellis |
5 | 6 | 2016-01-04 | Sharyn | Hopkins |
6 | 7 | 2016-01-04 | Laureen | Paul |
7 | 8 | 2016-01-04 | Leslie | Higgins |
8 | 9 | 2016-01-05 | Neil | Mccall |
9 | 10 | 2016-01-05 | Alane | Munoz |
In [6]:
# Joining staffs + stores tables using inner join
query2 = """
SELECT
staffs.first_name,
staffs.last_name,
stores.store_name
FROM staffs
JOIN stores ON staffs.store_id = stores.store_id
LIMIT 10;
"""
df2 = run_query(query2, conn)
first_name | last_name | store_name | |
---|---|---|---|
0 | Fabiola | Jackson | Santa Cruz Bikes |
1 | Mireya | Copeland | Santa Cruz Bikes |
2 | Genna | Serrano | Santa Cruz Bikes |
3 | Virgie | Wiggins | Santa Cruz Bikes |
4 | Jannette | David | Baldwin Bikes |
5 | Marcelene | Boyer | Baldwin Bikes |
6 | Venita | Daniel | Baldwin Bikes |
7 | Kali | Vargas | Rowlett Bikes |
8 | Layla | Terrell | Rowlett Bikes |
9 | Bernardine | Houston | Rowlett Bikes |
In [8]:
# Total revenue per order
query3 = """
SELECT
order_id,
SUM(quantity * list_price * (1 - discount)) AS total_revenue
FROM order_items
GROUP BY order_id
LIMIT 10;
"""
df3 = run_query(query3, conn)
order_id | total_revenue | |
---|---|---|
0 | 1 | 10231.0464 |
1 | 2 | 1697.9717 |
2 | 3 | 1519.9810 |
3 | 4 | 1349.9820 |
4 | 5 | 3900.0607 |
5 | 6 | 9442.5048 |
6 | 7 | 2165.0817 |
7 | 8 | 1372.4719 |
8 | 9 | 7199.9820 |
9 | 10 | 242.9910 |
In [9]:
# Total revenue per customer
query4 = """
SELECT
customers.customer_id,
customers.first_name,
customers.last_name,
SUM(order_items.quantity * order_items.list_price * (1 - order_items.discount)) AS total_spent
FROM customers
JOIN orders ON customers.customer_id = orders.customer_id
JOIN order_items ON orders.order_id = order_items.order_id
GROUP BY customers.customer_id, customers.first_name, customers.last_name
ORDER BY total_spent DESC
LIMIT 10;
"""
df4 = run_query(query4, conn)
customer_id | first_name | last_name | total_spent | |
---|---|---|---|---|
0 | 94 | Sharyn | Hopkins | 34807.9392 |
1 | 10 | Pamelia | Newman | 33634.2604 |
2 | 75 | Abby | Gamble | 32803.0062 |
3 | 6 | Lyndsey | Bean | 32675.0725 |
4 | 16 | Emmitt | Sanchez | 31925.8857 |
5 | 73 | Melanie | Hayes | 31913.6902 |
6 | 1 | Debra | Burks | 27888.1834 |
7 | 61 | Elinore | Aguilar | 25636.4531 |
8 | 93 | Corrina | Sawyer | 25612.7021 |
9 | 122 | Shena | Carter | 24890.6244 |
In [10]:
# List the first and last names of customers and the dates of the orders they placed.
query5 = """
SELECT
customers.first_name,
customers.last_name,
orders.order_date
FROM customers
JOIN orders ON customers.customer_id = orders.customer_id
LIMIT 10;
"""
df5 = run_query(query5, conn)
first_name | last_name | order_date | |
---|---|---|---|
0 | Debra | Burks | 2016-12-09 |
1 | Debra | Burks | 2018-04-18 |
2 | Debra | Burks | 2018-11-18 |
3 | Kasha | Todd | 2017-02-05 |
4 | Kasha | Todd | 2017-08-21 |
5 | Kasha | Todd | 2018-04-09 |
6 | Tameka | Fisher | 2018-03-27 |
7 | Tameka | Fisher | 2018-04-06 |
8 | Tameka | Fisher | 2018-10-21 |
9 | Daryl | Spence | 2017-02-07 |
In [11]:
# Where the staff work and their managers
query6 = """
SELECT
s.first_name AS staff_first,
s.last_name AS staff_last,
st.store_name,
m.first_name AS manager_first,
m.last_name AS manager_last
FROM staffs AS s
JOIN stores AS st ON s.store_id = st.store_id
JOIN staffs AS m ON s.manager_id = m.staff_id
LIMIT 10;
"""
df6 = run_query(query6, conn)
staff_first | staff_last | store_name | manager_first | manager_last | |
---|---|---|---|---|---|
0 | Mireya | Copeland | Santa Cruz Bikes | Fabiola | Jackson |
1 | Genna | Serrano | Santa Cruz Bikes | Mireya | Copeland |
2 | Virgie | Wiggins | Santa Cruz Bikes | Mireya | Copeland |
3 | Jannette | David | Baldwin Bikes | Fabiola | Jackson |
4 | Marcelene | Boyer | Baldwin Bikes | Jannette | David |
5 | Venita | Daniel | Baldwin Bikes | Jannette | David |
6 | Kali | Vargas | Rowlett Bikes | Fabiola | Jackson |
7 | Layla | Terrell | Rowlett Bikes | Venita | Daniel |
8 | Bernardine | Houston | Rowlett Bikes | Venita | Daniel |
Step 3: Visualize Data¶
In [12]:
# Top-Spending Customers (from query4)
plt.figure(figsize=(10, 6))
sns.barplot(x='total_spent', y='first_name', data=df4)
plt.title('Top 10 Customers by Total Spending')
plt.xlabel('Total Spent (USD)')
plt.ylabel('Customer')
plt.tight_layout()
plt.show()
💸 Top Spending Customers – Insights¶
- The top customer, Sharyn Hopkins, spent approximately $34.8K, making her one of the store’s most valuable clients.
- All top 10 customers spent over $24K, indicating a clear group of high-value customers.
- These insights could support targeted marketing efforts or loyalty programs to retain and reward top spenders.
In [13]:
# Staff Count per Store (from query2)
plt.figure(figsize=(10, 6))
sns.countplot(data=df2, y='store_name', order=df2['store_name'].value_counts().index)
plt.title('Number of Staff per Store')
plt.xlabel('Number of Staff')
plt.ylabel('Store')
plt.tight_layout()
plt.show()
🏬 Staff Distribution Across Stores – Insights¶
- The Santa Cruz Bikes store has a larger concentration of staff, with the top 4 employees all working there. This may indicate a bigger or busier location.
- Both Baldwin Bikes and Rowlett Bikes have multiple staff members, suggesting they are also active store locations but possibly smaller or less staffed than Santa Cruz.
- Understanding staff distribution helps identify where the workforce is concentrated and can inform decisions on resource allocation or management focus.
In [15]:
# Order Revenue Distribution (from query3)
plt.figure(figsize=(8, 5))
sns.histplot(df3['total_revenue'], bins=20)
plt.title('Distribution of Order Revenues')
plt.xlabel('Total Revenue per Order')
plt.ylabel('Frequency')
plt.tight_layout()
plt.show()
💰 Total Revenue per Order – Insights¶
- Order ID 1 generated the highest revenue at $10,231.05, indicating a large or multiple-item purchase.
- Other orders show a wide range in revenue, from as low as $242.99 (Order 10) to several thousand dollars, highlighting varying order sizes.
- This spread suggests diverse customer purchasing behavior, with some orders being small and others significantly larger.
- Analyzing revenue per order helps identify high-value sales and patterns in purchasing that can inform sales strategies.