Connecting to PostgreSQL from Python

Connecting to PostgreSQL from Python#

In this notebook, we will learn how to connect a PostgreSQL database to Python.
This is an essential skill if you want to:

Explore your data directly in pandas DataFrames
Build data pipelines (ETL/ELT) to refresh your tables regularly
Power machine learning or LLM-based applications that rely on structured data
Prototype dashboards and APIs that serve insights to end-users

We will go step by step:

Load database credentials securely from a .env file
Connect to PostgreSQL using SQLAlchemy and psycopg2
Run sanity checks to confirm the connection
Query our nightlights and population tables into pandas
Perform simple summaries, joins, and a time-series query

Required Python Packages#

Ensure that you installed the required packages.

import os
from dotenv import load_dotenv
from sqlalchemy import create_engine, text
import pandas as pd

PostgreSQL Environment Variables#

To connect, PostgreSQL needs a few basic pieces of information:

PGHOST → The host where PostgreSQL is running.
For local setups this is usually localhost.
PGPORT → The port PostgreSQL listens on.
Default is 5432.
PGDATABASE → The database name you created for this lab, e.g. ntl_pop.
PGUSER → Your PostgreSQL username.
This is usually the same as your system username when running locally.

⚠️ Note: On a local setup, you usually don’t need a password if PostgreSQL is configured to trust local connections.
In that case, we will not use PGPASSWORD.

# 1) Load environment variables
# The .env file should be in the repo root with keys:
# PGHOST, PGPORT, PGDATABASE, PGUSER, PGPASSWORD

# looks for .env in current directory by default
load_dotenv()  

PGHOST = os.getenv("PGHOST")
PGPORT = os.getenv("PGPORT", "5432")
PGDATABASE = os.getenv("PGDATABASE")
PGUSER = os.getenv("PGUSER")
PGPASSWORD = os.getenv("PGPASSWORD")

PGPORT

'5432'

print("PostgreSQL Connection Settings:")
print(f"Host: {PGHOST}")
print(f"Port: {PGPORT}")
print(f"Database: {PGDATABASE}")
print(f"User: {PGUSER}")
print(f"Password: {'[SET]' if PGPASSWORD else '[NOT SET]'}")

PostgreSQL Connection Settings:
Host: localhost
Port: 5432
Database: ntl_pop
User: dmatekenya
Password: [NOT SET]

Create SQLAlchemy engine#

At this stage, we are establishing the connection between Python and PostgreSQL.
The SQLAlchemy package plays the role of a database toolkit and Object Relational Mapper (ORM). In our case, we are mainly using it as a bridge: it translates Python code into SQL statements that PostgreSQL can understand, and it manages the underlying connection details for us (user, host, port, database). By creating an engine object, we set up a reusable gateway that allows us to open sessions, run queries, and easily pull results into Python tools such as pandas for analysis. While psycopg2 handles the low-level communication with PostgreSQL, SQLAlchemy provides a higher-level, more user-friendly interface.

# Create SQLAlchemy engine (no password needed for local connections)
connection_string = f"postgresql+psycopg2://{PGUSER}:{PGPASSWORD}@{PGHOST}:{PGPORT}/{PGDATABASE}"
print("Connection psql string:", connection_string)

Connection psql string: postgresql+psycopg2://dmatekenya:@localhost:5432/ntl_pop

engine = create_engine(
    connection_string,
    pool_pre_ping=True,
)

# Test the connection
with engine.connect() as conn:
    who = conn.execute(text("SELECT current_user, current_database();")).fetchone()
    print("Connected as:", who)

Running SQL Commands#

Once the connection is established through the SQLAlchemy engine, we can begin sending SQL commands from Python.
What happens in the background is:

You write a SQL query as a string in Python (e.g., "SELECT * FROM ntl_annual LIMIT 5;").
SQLAlchemy passes this query to the underlying driver (psycopg2), which handles the low-level communication with PostgreSQL.
PostgreSQL executes the query on the database and returns the results (rows and columns).
SQLAlchemy collects the results and makes them available to Python.
If we use pandas.read_sql, the results are automatically converted into a DataFrame for analysis, filtering, and visualization.

In short: Python → SQLAlchemy → psycopg2 → PostgreSQL → back to Python as DataFrame.
This allows us to stay in a familiar Python environment while still harnessing the power of SQL.

In the same way, we can also create new tables, update existing tables with additional data, or even delete records—all from within Python.
For example:

Use CREATE TABLE statements to define new tables.
Use INSERT or COPY to add more rows of data.
Use UPDATE to modify existing records.
Use DROP to remove tables you no longer need.

This makes Python a powerful interface for both querying and managing your database directly.

def peek(sql: str) -> pd.DataFrame:
    """Run a SQL query and return the results as a pandas DataFrame."""
    return pd.read_sql(sql, engine)

# ========== 3. Sanity checks ==========
tables = ["cells", "pop", "ntl_annual", "ntl_monthly"]

print("\n--- Table counts ---")
for t in tables:
    cnt = peek(f"SELECT COUNT(*) AS n FROM {t};")
    print(f"{t}: {cnt.loc[0, 'n']} rows")

print("\n--- First 5 rows from cells ---")
print(peek("SELECT * FROM cells LIMIT 5"))

# ========== 4. Annual summaries ==========
print("\n--- Average ntl_mean per year ---")
annual_avg = peek("""
SELECT year, AVG(ntl_mean) AS avg_ntl_mean
FROM ntl_annual
GROUP BY year
ORDER BY year;
""")
print(annual_avg.head())

print("\n--- Top 5 brightest cells in 2023 ---")
brightest = peek("""
SELECT cell_id, ntl_mean
FROM ntl_annual
WHERE year = 2023
ORDER BY ntl_mean DESC
LIMIT 5;
""")
print(brightest)


# ========== 5. Join with population ==========
print("\n--- Light per capita (2023) ---")
per_capita = peek("""
SELECT
  c.cell_name,
  c.district_name,
  a.year,
  a.ntl_sum / NULLIF(p.general_pop, 0) AS light_per_capita,
  a.ntl_sum,
  p.general_pop
FROM ntl_annual a
JOIN cells c ON a.cell_id = c.cell_id
JOIN pop   p ON a.cell_id = p.cell_id
WHERE a.year = 2023
ORDER BY light_per_capita DESC NULLS LAST
LIMIT 10;
""")
print(per_capita)

Exercise: Working with Your Database#

Now that you’ve seen how to connect and query PostgreSQL from Python, it’s time to practice.
In this exercise, you will write and run your own SQL commands inside Python.

Tasks#

Create a New Table
- Write a SQL command in Python to create a small table called demo_cells with the following columns:
  - id (integer, primary key)
  - cell_name (text)
  - population (integer)
Insert Data
- Insert at least two rows of data into demo_cells.
- Hint: use INSERT INTO demo_cells (...) VALUES (...);.
Query Your Data
- Select all rows from demo_cells and load them into a pandas DataFrame.
- Display the results.
Update Data
- Update one of the rows (for example, change the population of a cell).
- Run a SELECT again to confirm the update worked.
Cleanup (Optional)
- Drop the table when you are done: DROP TABLE demo_cells;.

💡 Tip: Remember that you can use the engine.execute(text("SQL HERE")) pattern for commands like CREATE, INSERT, UPDATE, or DROP, and pd.read_sql("SQL HERE", engine) when you want to return results into a DataFrame.

Summary: Working with PostgreSQL in Python#

In this notebook, you learned how to connect Python to a PostgreSQL database, configure the connection using environment variables, and create a SQLAlchemy engine that acts as a bridge between Python and SQL. You executed key operations such as checking table counts, exploring data with a helper function, and running queries to analyze nightlight trends, identify the brightest cells, and compute light-per-capita statistics by joining nightlights with population data. In the same way, Python can interact with many other databases — whether they are hosted locally on your machine, running on a server in the cloud, or provided as enterprise solutions. This includes both proprietary systems such as Microsoft SQL Server and Oracle, as well as open-source databases like MySQL, MariaDB, and SQLite. Thanks to libraries like SQLAlchemy, the workflow you practiced here is portable: once you learn how to query and manage data in Python, you can apply the same approach across different database platforms.

Next Steps: Exploring Databases with Python#

After learning how to connect Python to PostgreSQL and run queries, you can explore:

Advanced SQL: joins, window functions, subqueries, CTEs
Database management: creating/updating tables, indexes, transactions
Python integration: automate queries, build ETL pipelines, visualize results
Other databases: SQLite, MySQL/MariaDB, SQL Server, Oracle
Scaling up: cloud-hosted databases, connection pooling, performance tuning
Applications: power ML workflows, LLM-based apps (RAG), dashboards, or APIs