Step 1: Connect to SQLite Database¶

In [1]:
import sqlite3
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings("ignore")

conn = sqlite3.connect("mydatabase.db")

csv_files = {
    "orders": "orders.csv",
    "staffs": "staffs.csv",
    "customers": "customers.csv",
    "stores": "stores.csv",
    "order_items": "order_items.csv"
}

for table_name, csv_file in csv_files.items():
    df = pd.read_csv(csv_file)
    df.to_sql(table_name, conn, if_exists='replace', index=False)
In [2]:
cursor = conn.cursor()
cursor.execute("SELECT name FROM sqlite_master WHERE type='table';")
print(cursor.fetchall())
[('orders',), ('staffs',), ('customers',), ('stores',), ('order_items',)]
In [3]:
# A function to display any table by putting its name into the function
def view_table(name):
    df = pd.read_sql_query(f"SELECT * FROM {name} LIMIT 5", conn) 
    display(df)
    
view_table("orders")
order_id customer_id order_status order_date required_date shipped_date store_id staff_id
0 1 259 4 2016-01-01 2016-01-03 2016-01-03 1 2
1 2 1212 4 2016-01-01 2016-01-04 2016-01-03 2 6
2 3 523 4 2016-01-02 2016-01-05 2016-01-03 2 7
3 4 175 4 2016-01-03 2016-01-04 2016-01-05 1 3
4 5 1324 4 2016-01-03 2016-01-06 2016-01-06 2 6

Step 2: Database Keys and Relationships¶

  • Primary Keys:

    • customers: customer_id
    • orders: order_id
    • order_items: composite key of (order_id, item_id) or item_id if unique
    • staff: staff_id
    • stores: store_id
  • Foreign Keys:

    • orders.customer_id → customers.customer_id
    • orders.store_id → stores.store_id
    • orders.staff_id → staff.staff_id
    • order_items.order_id → orders.order_id
    • order_items.product_id → (assuming a products table, not listed here)
    • staff.store_id → stores.store_id
    • staff.manager_id → staff.staff_id (self-referential for manager)
  • Relationships:

    • One customer can have many orders (1-to-many)
    • One order can have many order_items (1-to-many)
    • One store has many staff members (1-to-many)
    • One staff member can manage multiple other staff members (1-to-many self relationship)
    • Each order is created by one staff and belongs to one store

Step 3: Writing SQL Queries¶

In [4]:
# Defining a function to use to easily run SQL queries
def run_query(sql_query, conn):
    df = pd.read_sql_query(sql_query, conn)
    display(df)
    return df
In [5]:
# Joining orders + customers tables using an inner join

query1 = """
SELECT 
    orders.order_id,
    orders.order_date,
    customers.first_name,
    customers.last_name
FROM orders
JOIN customers ON orders.customer_id = customers.customer_id
LIMIT 10;
"""

df1 = run_query(query1, conn)
order_id order_date first_name last_name
0 1 2016-01-01 Johnathan Velazquez
1 2 2016-01-01 Jaqueline Cummings
2 3 2016-01-02 Joshua Robertson
3 4 2016-01-03 Nova Hess
4 5 2016-01-03 Arla Ellis
5 6 2016-01-04 Sharyn Hopkins
6 7 2016-01-04 Laureen Paul
7 8 2016-01-04 Leslie Higgins
8 9 2016-01-05 Neil Mccall
9 10 2016-01-05 Alane Munoz
In [6]:
# Joining staffs + stores tables using inner join

query2 = """
SELECT 
    staffs.first_name,
    staffs.last_name,
    stores.store_name
FROM staffs
JOIN stores ON staffs.store_id = stores.store_id
LIMIT 10;
"""

df2 = run_query(query2, conn)
first_name last_name store_name
0 Fabiola Jackson Santa Cruz Bikes
1 Mireya Copeland Santa Cruz Bikes
2 Genna Serrano Santa Cruz Bikes
3 Virgie Wiggins Santa Cruz Bikes
4 Jannette David Baldwin Bikes
5 Marcelene Boyer Baldwin Bikes
6 Venita Daniel Baldwin Bikes
7 Kali Vargas Rowlett Bikes
8 Layla Terrell Rowlett Bikes
9 Bernardine Houston Rowlett Bikes
In [8]:
# Total revenue per order

query3 = """
SELECT 
    order_id,
    SUM(quantity * list_price * (1 - discount)) AS total_revenue
FROM order_items
GROUP BY order_id
LIMIT 10;
"""

df3 = run_query(query3, conn)
order_id total_revenue
0 1 10231.0464
1 2 1697.9717
2 3 1519.9810
3 4 1349.9820
4 5 3900.0607
5 6 9442.5048
6 7 2165.0817
7 8 1372.4719
8 9 7199.9820
9 10 242.9910
In [9]:
# Total revenue per customer

query4 = """
SELECT 
    customers.customer_id,
    customers.first_name,
    customers.last_name,
    SUM(order_items.quantity * order_items.list_price * (1 - order_items.discount)) AS total_spent
FROM customers
JOIN orders ON customers.customer_id = orders.customer_id
JOIN order_items ON orders.order_id = order_items.order_id
GROUP BY customers.customer_id, customers.first_name, customers.last_name
ORDER BY total_spent DESC
LIMIT 10;
"""

df4 = run_query(query4, conn)
customer_id first_name last_name total_spent
0 94 Sharyn Hopkins 34807.9392
1 10 Pamelia Newman 33634.2604
2 75 Abby Gamble 32803.0062
3 6 Lyndsey Bean 32675.0725
4 16 Emmitt Sanchez 31925.8857
5 73 Melanie Hayes 31913.6902
6 1 Debra Burks 27888.1834
7 61 Elinore Aguilar 25636.4531
8 93 Corrina Sawyer 25612.7021
9 122 Shena Carter 24890.6244
In [10]:
# List the first and last names of customers and the dates of the orders they placed.

query5 = """
SELECT 
    customers.first_name,
    customers.last_name,
    orders.order_date
FROM customers
JOIN orders ON customers.customer_id = orders.customer_id
LIMIT 10;
"""

df5 = run_query(query5, conn)
first_name last_name order_date
0 Debra Burks 2016-12-09
1 Debra Burks 2018-04-18
2 Debra Burks 2018-11-18
3 Kasha Todd 2017-02-05
4 Kasha Todd 2017-08-21
5 Kasha Todd 2018-04-09
6 Tameka Fisher 2018-03-27
7 Tameka Fisher 2018-04-06
8 Tameka Fisher 2018-10-21
9 Daryl Spence 2017-02-07
In [11]:
# Where the staff work and their managers

query6 = """
SELECT 
    s.first_name AS staff_first,
    s.last_name AS staff_last,
    st.store_name,
    m.first_name AS manager_first,
    m.last_name AS manager_last
FROM staffs AS s
JOIN stores AS st ON s.store_id = st.store_id
JOIN staffs AS m ON s.manager_id = m.staff_id
LIMIT 10;
"""

df6 = run_query(query6, conn)
staff_first staff_last store_name manager_first manager_last
0 Mireya Copeland Santa Cruz Bikes Fabiola Jackson
1 Genna Serrano Santa Cruz Bikes Mireya Copeland
2 Virgie Wiggins Santa Cruz Bikes Mireya Copeland
3 Jannette David Baldwin Bikes Fabiola Jackson
4 Marcelene Boyer Baldwin Bikes Jannette David
5 Venita Daniel Baldwin Bikes Jannette David
6 Kali Vargas Rowlett Bikes Fabiola Jackson
7 Layla Terrell Rowlett Bikes Venita Daniel
8 Bernardine Houston Rowlett Bikes Venita Daniel

Step 3: Visualize Data¶

In [12]:
# Top-Spending Customers (from query4)

plt.figure(figsize=(10, 6))
sns.barplot(x='total_spent', y='first_name', data=df4)
plt.title('Top 10 Customers by Total Spending')
plt.xlabel('Total Spent (USD)')
plt.ylabel('Customer')
plt.tight_layout()
plt.show()
No description has been provided for this image

💸 Top Spending Customers – Insights¶

  • The top customer, Sharyn Hopkins, spent approximately $34.8K, making her one of the store’s most valuable clients.
  • All top 10 customers spent over $24K, indicating a clear group of high-value customers.
  • These insights could support targeted marketing efforts or loyalty programs to retain and reward top spenders.
In [13]:
# Staff Count per Store (from query2)

plt.figure(figsize=(10, 6))
sns.countplot(data=df2, y='store_name', order=df2['store_name'].value_counts().index)
plt.title('Number of Staff per Store')
plt.xlabel('Number of Staff')
plt.ylabel('Store')
plt.tight_layout()
plt.show()
No description has been provided for this image

🏬 Staff Distribution Across Stores – Insights¶

  • The Santa Cruz Bikes store has a larger concentration of staff, with the top 4 employees all working there. This may indicate a bigger or busier location.
  • Both Baldwin Bikes and Rowlett Bikes have multiple staff members, suggesting they are also active store locations but possibly smaller or less staffed than Santa Cruz.
  • Understanding staff distribution helps identify where the workforce is concentrated and can inform decisions on resource allocation or management focus.
In [15]:
# Order Revenue Distribution (from query3)

plt.figure(figsize=(8, 5))
sns.histplot(df3['total_revenue'], bins=20)
plt.title('Distribution of Order Revenues')
plt.xlabel('Total Revenue per Order')
plt.ylabel('Frequency')
plt.tight_layout()
plt.show()
No description has been provided for this image

💰 Total Revenue per Order – Insights¶

  • Order ID 1 generated the highest revenue at $10,231.05, indicating a large or multiple-item purchase.
  • Other orders show a wide range in revenue, from as low as $242.99 (Order 10) to several thousand dollars, highlighting varying order sizes.
  • This spread suggests diverse customer purchasing behavior, with some orders being small and others significantly larger.
  • Analyzing revenue per order helps identify high-value sales and patterns in purchasing that can inform sales strategies.