unsprawl¶
Unsprawl - A hardware-accelerated compound AI system for high-fidelity urban simulation and autonomous infrastructure resilience.
This package provides: 1. Universal Modular Design (UMD): Region-agnostic schemas (Entity, Asset, Agent) and dynamic Provider→Adapter→Loader architecture. 2. Legacy Singapore Valuation Pipeline: A complete CLI + programmatic API for lease-adjusted property valuation (backwards-compatible).
Quick Start (Global Platform - New UMD API)¶
>>> from unsprawl import Region, UniversalLoader
>>>
>>> loader = UniversalLoader()
>>> assets = loader.load(Region.SG) # Dynamic dispatch to SGAdapter
>>> print(assets[0].asset_type, assets[0].floor_area_sqm)
Quick Start (Legacy SG Pipeline - Backwards Compatible)¶
>>> from unsprawl import UnsprawlApp
>>>
>>> app = UnsprawlApp()
>>> results = app.process(
... input_path="resale.csv",
... town="PUNGGOL",
... budget=600000,
... top_n=10
... )
>>> print(results)
Main Classes (UMD Architecture)¶
Entity, Asset, Agent : Universal simulation schemas (Pydantic)
Region : Nested namespace for region codes (e.g., Region.SG, Region.US.CA.SF)
UniversalLoader : Dynamic dispatcher that routes region nodes to country adapters
GovSGProvider : Network-first data fetcher for Singapore (cached under ~/.unsprawl/data)
SGAdapter : Normalizes SG datasets into universal
Assetobjects
Legacy Classes (Singapore Valuation - v0 compat)¶
UnsprawlApp : High-level orchestrator for the complete SG valuation pipeline
HDBLoader : Load and normalize HDB resale CSV data
FeatureEngineer : Parse remaining lease and compute price efficiency
LeaseDepreciationModel : Bala’s Curve implementation for lease depreciation
ValuationEngine : Compute group-wise z-scores and valuation scores
TransportScorer : Calculate MRT accessibility scores
ReportGenerator : Filter, rank, and format results
Schema : Column name definitions for the pipeline
Submodules¶
Attributes¶
Classes¶
Application orchestrator wiring the pipeline and providing both programmatic and |
|
A dynamic actor (Commuter, Bus, Car). |
|
A static economic unit (Building, Park, Transit Station). |
|
Universal base class for the Unsprawl simulation. |
|
Load and normalize HDB resale CSV data. |
|
Canonical column names expected by the pipeline. |
|
Dynamic dispatcher that loads Assets for a given Region node. |
|
Engineer features required for valuation. |
|
Non-linear lease depreciation model (Bala's Curve Approximation). |
|
Compute group-wise Z-Scores, growth potential, and a final valuation score. |
|
Filter, rank, and render a clean buy list table to console. |
|
Compute MRT accessibility scores using spatial nearest-neighbor queries. |
Functions¶
|
Entry point that dispatches to Typer app. |
|
Ensure DataFrame has numeric lat and lon columns. |
|
Configure root logger formatting and level. |
Package Contents¶
- class UnsprawlApp(schema=None, transport_cache_dir=None)[source]¶
Application orchestrator wiring the pipeline and providing both programmatic and CLI access.
This class can be used directly as a Python module or via the CLI. For programmatic usage, use the process() method with explicit parameters. For CLI usage, use the run() method with parsed arguments.
Example (Module Usage)¶
>>> app = UnsprawlApp() >>> results = app.process( ... input_path="resale.csv", ... town="PUNGGOL", ... budget=600000, ... top_n=10 ... ) >>> print(results.head())
Example (With MRT Accessibility - Default)¶
>>> app = UnsprawlApp() >>> results = app.process( ... input_path="resale.csv", ... town="BISHAN" ... )
Example (Custom MRT Catalog)¶
>>> results = app.process( ... input_path="resale.csv", ... mrt_catalog="stations.geojson", ... town="BISHAN" ... )
Initialize the valuation engine with optional custom schema and cache directory.
- Parameters:
schema (Schema | None) – Custom schema definition. If None, uses default Schema().
transport_cache_dir (Optional[str]) – Directory for caching transport KDTree data. If None, uses default .cache_transport.
- schema¶
- loader¶
- fe¶
- engine¶
- transport¶
- reporter¶
- logger¶
- load_data(input_path)[source]¶
Load HDB resale data from CSV file.
- Parameters:
input_path (str) – Path to the HDB resale CSV file.
- Returns:
Loaded and normalized DataFrame.
- Return type:
pd.DataFrame
- Raises:
FileNotFoundError – If the file does not exist.
ValueError – If the CSV cannot be parsed.
- process(input_path=None, data=None, mrt_catalog=None, clear_transport_cache=False, group_by=None, enable_accessibility_adjust=True, town=None, town_like=None, budget=None, flat_type=None, flat_type_like=None, flat_model=None, flat_model_like=None, storey_min=None, storey_max=None, area_min=None, area_max=None, lease_min=None, lease_max=None, top_n=10, return_full=False)[source]¶
Process HDB resale data and return filtered, scored results.
This is the main programmatic entry point for using the valuation engine as a module.
- Parameters:
input_path (Optional[str]) – Path to HDB resale CSV. Required if data is not provided.
data (Optional[pd.DataFrame]) – Pre-loaded DataFrame. If provided, input_path is ignored.
mrt_catalog (Optional[str]) – Path to MRT stations GeoJSON or CSV for transport scoring.
clear_transport_cache (bool) – Whether to clear transport cache before processing.
group_by (Optional[List[str]]) – Columns to group by for peer comparison z-scores. Defaults to [town, flat_type].
enable_accessibility_adjust (bool) – Whether to adjust price efficiency based on MRT accessibility. Default True.
town (Optional[str]) – Exact town filter (case-insensitive).
town_like (Optional[str]) – Partial town match (substring).
budget (Optional[float]) – Maximum resale price.
flat_type (Optional[str]) – Exact flat type filter.
flat_type_like (Optional[str]) – Partial flat type match.
flat_model (Optional[str]) – Exact flat model filter.
flat_model_like (Optional[str]) – Partial flat model match.
storey_min (Optional[int]) – Minimum storey number.
storey_max (Optional[int]) – Maximum storey number.
area_min (Optional[float]) – Minimum floor area (sqm).
area_max (Optional[float]) – Maximum floor area (sqm).
lease_min (Optional[float]) – Minimum remaining lease (years).
lease_max (Optional[float]) – Maximum remaining lease (years).
top_n (int) – Number of top results to return. Default 10.
return_full (bool) – If True, return all filtered results instead of just top_n.
- Returns:
Filtered and scored results, sorted by valuation_score descending.
- Return type:
pd.DataFrame
- Raises:
ValueError – If neither input_path nor data is provided and the default dataset path is not available.
FileNotFoundError – If input_path does not exist.
Examples
>>> app = UnsprawlApp() >>> results = app.process( ... input_path="resale.csv", ... town="PUNGGOL", ... budget=600000, ... top_n=5 ... ) >>> print(f"Found {len(results)} undervalued properties")
- render_report(data=None, town=None, town_like=None, budget=None, flat_type=None, flat_type_like=None, flat_model=None, flat_model_like=None, storey_min=None, storey_max=None, area_min=None, area_max=None, lease_min=None, lease_max=None, top_n=10)[source]¶
Render a formatted string report from processed data.
- Parameters:
data (Optional[pd.DataFrame]) – Pre-processed DataFrame with scores. If None, uses internally stored data.
top_n (int) – Number of results to include in report.
Notes
This method accepts the same filter arguments as
process().- Returns:
Formatted table string ready for console output.
- Return type:
- render_rich_table(df, title='🏠 Top Undervalued Residential Properties')[source]¶
Render a Rich table from results DataFrame.
- Parameters:
df (pd.DataFrame) – Results DataFrame with valuation scores.
title (str) – Table title.
- Returns:
Formatted Rich table ready for console output.
- Return type:
rich.table.Table
- main(argv=None)[source]¶
Entry point that dispatches to Typer app.
Keeps return code semantics for tests, and supports legacy calls without a subcommand by defaulting to the valuate command when argv starts with flags.
- class Agent(/, **data)[source]¶
Bases:
EntityA dynamic actor (Commuter, Bus, Car).
Agents flow through the city graph / continuous space depending on the simulation backend.
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
- goal: LatLon¶
- state: Literal['idle', 'moving', 'stuck'] = 'idle'¶
- class Asset(/, **data)[source]¶
Bases:
EntityA static economic unit (Building, Park, Transit Station).
This replaces the legacy Singapore-specific concept of “HDB Flat” with a generic container that can represent any asset class across any region.
The physics engine treats local_metadata as an opaque payload.
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
- asset_type: Literal['residential', 'commercial', 'transport']¶
- class Entity(/, **data)[source]¶
Bases:
pydantic.BaseModelUniversal base class for the Unsprawl simulation.
Everything in the simulation (static or moving) is an Entity.
Notes
Coordinate ordering is strictly (lat, lon) across the entire platform. Adapters must normalize any source data into this convention at the boundary.
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
- location: LatLon¶
- Region¶
- ensure_lat_lon_from_town_centroids(df, *, town_col='town', lat_col='lat', lon_col='lon')[source]¶
Ensure DataFrame has numeric lat and lon columns.
If lat and lon already exist, they are coerced to numeric and left as-is. If one or both are missing, they are inferred from town using
TOWN_CENTROIDS.Unknown towns will remain NaN.
- class HDBLoader(schema=None)[source]¶
Load and normalize HDB resale CSV data.
The loader focuses on robust file I/O and schema normalization. It lowercases and strips column names to mitigate schema drift and attempts to coerce core numeric columns into numeric dtype with proper NA handling.
- schema¶
- logger¶
- load(path)[source]¶
Load CSV into a pandas DataFrame with normalized column names.
- Parameters:
path (str) – Path to the CSV file.
- Returns:
DataFrame with normalized columns and raw types preserved where possible.
- Return type:
pd.DataFrame
- Raises:
FileNotFoundError – If the file does not exist.
ValueError – If the CSV cannot be parsed.
- class Schema[source]¶
Canonical column names expected by the pipeline.
This class centralizes schema expectations while allowing flexible mapping from real-world datasets where names may vary slightly in case or spacing.
- class FeatureEngineer(schema=None, use_lease_depreciation=True, depreciation_model=None)[source]¶
Engineer features required for valuation.
Responsibilities¶
Parse remaining lease strings of the form “85 years 3 months” into a float in units of years (e.g., 85.25) with robust handling of edge cases.
Compute price efficiency as: resale_price / (floor_area_sqm * remaining_lease_years)
Apply non-linear lease depreciation adjustment via LeaseDepreciationModel
Mathematical Notes¶
Price efficiency penalizes larger prices per effective area-year. By dividing price by both floor area (sqm) and remaining lease (years), the metric naturally adjusts for lease decay. The non-linear depreciation model further refines this by accounting for the accelerating loss of value as lease expiry approaches.
Initialize FeatureEngineer with optional lease depreciation model.
- Parameters:
schema (Schema | None) – Schema definition for column names.
use_lease_depreciation (bool) – Whether to apply non-linear lease depreciation adjustment (default: True).
depreciation_model (LeaseDepreciationModel | None) – Custom depreciation model. If None and use_lease_depreciation=True, creates default LeaseDepreciationModel.
- _LEASE_YEARS_RE¶
- _LEASE_MONTHS_RE¶
- schema¶
- logger¶
- use_lease_depreciation = True¶
- depreciation_model: LeaseDepreciationModel | None¶
- _parse_lease_text(text)[source]¶
Parse a remaining lease string into float years.
Examples
“85 years 3 months” -> 85.25
“99 years” -> 99.0
“8 months” -> 0.666…
“less than 1 year” -> 0.5 (conservative placeholder)
- Parameters:
text (str | float | int | None) – Raw value from the dataset.
- Returns:
Parsed years as float, or None if parsing fails.
- Return type:
Optional[float]
- _infer_remaining_lease_from_commence(df, assumed_lease_years=99.0)[source]¶
Infer remaining lease (years) from lease_commence_date and month columns.
Mathematics¶
remaining_years = assumed_lease_years - ((year + month/12) - lease_commence_year) where (year, month) come from the transaction month string “YYYY-MM”.
Values are clipped to [0, assumed_lease_years]. Non-parsable rows yield NaN.
- parse_remaining_lease(df)[source]¶
Add a remaining_lease_years float column to the DataFrame.
The method attempts to parse the canonical remaining_lease column if present. If a numeric-looking remaining_lease_years already exists, it is respected. If missing, it falls back to inferring from (lease_commence_date, month) assuming a 99-year lease. All parsing errors coerce to NaN.
- Parameters:
df (pd.DataFrame) – Input dataframe.
- Returns:
DataFrame with an added/updated remaining_lease_years column.
- Return type:
pd.DataFrame
- compute_price_efficiency(df)[source]¶
Compute price efficiency metric with optional non-linear lease depreciation.
Formula (Base)¶
price_efficiency = resale_price / (floor_area_sqm * remaining_lease_years)
Formula (With Depreciation Adjustment)¶
price_efficiency_adjusted = base_efficiency / depreciation_factor(remaining_lease)
where depreciation_factor ∈ [0, 1] computed via Bala’s Curve.
Interpretation¶
Lower values indicate better cost per area-year. The non-linear depreciation adjustment increases the effective price for properties with shorter leases, reflecting the accelerating loss of market value as lease expiry approaches. This makes the valuation economically rigorous and market-realistic.
- class LeaseDepreciationModel(max_lease=99.0, decay_rate=3.0, steepness=2.5)[source]¶
Non-linear lease depreciation model (Bala’s Curve Approximation).
This model implements an economically rigorous depreciation curve for HDB leases, recognizing that a 99-year lease does not depreciate linearly. The value holds well for the first 30-40 years and then accelerates downward as lease expiry approaches.
Mathematical Model¶
The depreciation factor is computed using a sigmoid-like curve:
factor = exp(-k * ((99 - remaining) / 99)^n)
Where: - remaining: years of lease remaining - k: decay rate parameter (default: 3.0) - n: curve steepness (default: 2.5)
This produces: - Factor ≈ 1.0 for remaining > 80 years (minimal depreciation) - Factor ≈ 0.8-0.9 for remaining = 50-80 years (moderate depreciation) - Factor ≈ 0.3-0.7 for remaining = 20-50 years (accelerating depreciation) - Factor ≈ 0.0-0.2 for remaining < 20 years (severe depreciation)
References
This approximates the observed market behavior described in academic literature on HDB lease decay, including Bala’s studies on Singapore public housing valuation.
Initialize the lease depreciation model.
- Parameters:
max_lease (float) – Maximum lease period in years (default: 99.0 for HDB).
decay_rate (float) – Controls overall depreciation intensity (higher = more aggressive decay).
steepness (float) – Controls curve shape (higher = sharper decline near end of lease).
- max_lease = 99.0¶
- decay_rate = 3.0¶
- steepness = 2.5¶
- logger¶
- compute_depreciation_factor(remaining_years)[source]¶
Compute non-linear depreciation factor for given remaining lease years.
- Parameters:
remaining_years (pd.Series | float) – Remaining lease in years (can be Series or scalar).
- Returns:
Depreciation factor between 0 and 1, where 1 = no depreciation.
- Return type:
pd.Series | float
- adjust_price_efficiency(base_efficiency, remaining_years)[source]¶
Adjust price efficiency using non-linear lease depreciation.
The adjusted efficiency accounts for the non-linear loss of value over time. Lower depreciation factors increase the effective price per area-year, making properties with shorter leases appear more expensive on a value-adjusted basis.
- Parameters:
base_efficiency (pd.Series) – Base price efficiency (price / (area * remaining_years)).
remaining_years (pd.Series) – Remaining lease years for each property.
- Returns:
Lease-adjusted price efficiency.
- Return type:
pd.Series
- class ValuationEngine(schema=None)[source]¶
Compute group-wise Z-Scores, growth potential, and a final valuation score.
Methodology¶
Compute Z-Score of price_efficiency within groups defined by configurable grouping keys (default: (town, flat_type)). The Z-Score is defined as:
z = (x - mu) / sigma
where x is the observation’s price_efficiency, mu is the group mean, and sigma is the group standard deviation. If sigma == 0 or NaN, z is set to 0.
Define Valuation_Score = -Z_Price_Efficiency so that higher scores indicate better (cheaper-than-peers) properties.
Compute Growth_Potential metric based on Price-per-Sqm vs Town Average: - Deep Value (High Growth): Unit PSM < 0.85 × Town Avg PSM - Fair Value (Moderate Growth): 0.85 ≤ Unit PSM < 1.0 × Town Avg PSM - Premium (Low Growth): Unit PSM ≥ 1.0 × Town Avg PSM
This civic value metric identifies properties trading significantly below their peer average, suggesting potential for price appreciation or representing exceptional value for money.
- schema¶
- logger¶
- _groupwise_zscore(series, groups)[source]¶
Compute group-wise Z-Score with robust handling of zero std.
- Parameters:
series (pd.Series) – Numeric series to standardize.
groups (pd.Series) – Group labels of same length as series.
- Returns:
Group-wise z-scores with NaN-safe handling; zeros where std is 0 or NaN.
- Return type:
pd.Series
- _compute_growth_potential(df)[source]¶
Compute future appreciation potential based on price-per-sqm vs town average.
This civic finance heuristic identifies “deep value” properties trading significantly below their peer group average, which may indicate: 1. Undervaluation relative to neighborhood 2. Higher potential for price appreciation 3. Exceptional value-for-money opportunities
The metric uses vectorized pandas operations for performance.
- Parameters:
df (pd.DataFrame) – Input DataFrame with resale_price, floor_area_sqm, town, and flat_type.
- Returns:
DataFrame with added columns: - price_per_sqm: Unit price per square meter - town_avg_psm: Average PSM for (town, flat_type) peer group - psm_ratio: Unit PSM / Town Avg PSM - growth_potential: Categorical score (High/Moderate/Low)
- Return type:
pd.DataFrame
- score(df, group_by=None)[source]¶
Add Z-Score, Valuation Score, and Growth Potential columns to the DataFrame.
Adds the following columns: - z_price_efficiency: group-wise Z-Score of price_efficiency within selected groups - valuation_score: -z_price_efficiency, so higher is more undervalued - price_per_sqm: Price per square meter - town_avg_psm: Average PSM for peer group (town, flat_type) - psm_ratio: Unit PSM / Town Average PSM - growth_potential: Categorical (High/Moderate/Low) appreciation potential
- Parameters:
df (pd.DataFrame) – Input DataFrame containing required columns.
group_by (Optional[List[str]]) – Column names to define peer groups. Defaults to [town, flat_type].
- Returns:
DataFrame with added score columns.
- Return type:
pd.DataFrame
- class ReportGenerator(schema=None)[source]¶
Filter, rank, and render a clean buy list table to console.
Filtering¶
Optional exact/partial town filter.
Optional budget filter for maximum resale price.
Extended filters: flat_model, flat_type (exact or partial), storey_min/max, area_min/max, lease_min/max.
Ranking¶
Sort by valuation_score descending (highest implies most undervalued), with ties broken by lowest price_efficiency and then lowest resale_price.
Display the top N results (default: 10).
Rendering¶
Human-friendly table using pandas’ built-in formatting.
- schema¶
- logger¶
- static _parse_storey_range(sr)[source]¶
Parse HDB storey range strings like “07 TO 09” into (min, max).
Non-parsable inputs return (None, None).
- _apply_filters(df, town=None, town_like=None, budget=None, flat_type=None, flat_type_like=None, flat_model=None, flat_model_like=None, storey_min=None, storey_max=None, area_min=None, area_max=None, lease_min=None, lease_max=None)[source]¶
Apply user-specified filters to DataFrame.
- Parameters:
df (pd.DataFrame) – The scored dataset.
town (Optional[str]) – Town name for exact case-insensitive filtering.
town_like (Optional[str]) – Substring case-insensitive match for town.
budget (Optional[float]) – Maximum resale price.
flat_type (Optional[str]) – Exact match filter for flat_type (case-insensitive).
flat_type_like (Optional[str]) – Substring match for flat_type.
flat_model (Optional[str]) – Exact match filter for flat_model.
flat_model_like (Optional[str]) – Substring match for flat_model.
storey_min, storey_max (Optional[int]) – Min/max storey number filter (overlap with storey_range).
area_min, area_max (Optional[float]) – Floor area filters.
lease_min, lease_max (Optional[float]) – Remaining lease (years) filters.
- generate_dataframe(df, town=None, town_like=None, budget=None, flat_type=None, flat_type_like=None, flat_model=None, flat_model_like=None, storey_min=None, storey_max=None, area_min=None, area_max=None, lease_min=None, lease_max=None, top_n=10, full=False)[source]¶
Produce the filtered, sorted DataFrame for display/export.
If full is True, returns all rows after sorting; otherwise, returns the top_n rows.
- render(df, town=None, town_like=None, budget=None, flat_type=None, flat_type_like=None, flat_model=None, flat_model_like=None, storey_min=None, storey_max=None, area_min=None, area_max=None, lease_min=None, lease_max=None, top_n=10)[source]¶
Generate the formatted table for the buy list.
The table prioritizes the most undervalued units by sorting on valuation_score desc, breaking ties by price_efficiency asc and resale_price asc.
- class TransportScorer(stations_df=None, cache_dir=None)[source]¶
Compute MRT accessibility scores using spatial nearest-neighbor queries.
This scorer loads a catalog of MRT station coordinates, strictly excluding all LRT stations using a regex filter ‘^(BP|S[WE]|P[WE])’. The pattern matches the line codes for Bukit Panjang (BP), Sengkang (SW/SE), and Punggol (PW/PE) LRT loops, ensuring that only heavy rail stations are retained.
A KDTree (from scikit-learn) is used for vectorized nearest-neighbor computation across thousands of records instantly, avoiding Python loops.
Accessibility score definition¶
score = max(0, 10 - (dist_km * 2)) where dist_km is the Euclidean distance in kilometers from the HDB listing coordinate to the nearest MRT station in the filtered catalog.
- logger¶
- _cache_dir¶
- static _exclude_lrt(df)[source]¶
Exclude LRT stations using strict regex on line codes.
Excludes station rows whose line_code matches ‘^(BP|S[WE]|P[WE])’. Column expectations: - name: station name (str) - line_code: string line code such as ‘NS’, ‘EW’, ‘DT’, ‘CC’, ‘BP’, ‘SW’ - lat, lon: numeric coordinates in degrees
- load_stations(stations_df)[source]¶
Load station catalog, exclude LRT, and build KDTree index.
- Parameters:
stations_df (pd.DataFrame) – DataFrame with columns: [‘name’, ‘line_code’, ‘lat’, ‘lon’].
- load_stations_geojson(path)[source]¶
Load MRT stations from an LTA Exit GeoJSON file and build KDTree.
The GeoJSON is expected to be a FeatureCollection where each feature is a station exit with properties containing station information. This loader will:
Extract station name and line code from common property keys.
Preserve robust fallback logic for station name parsing across GeoJSON variants (STATION_NA / STN_NAME / STN_NAM / NAME / etc.).
Strictly exclude LRT using the regex ‘^(BP|S[WE]|P[WE])’ on line codes when available, and additionally filter out any stations with ‘LRT’ in the name as a safety fallback.
Build a KDTree over exit coordinates (lon, lat). Using exits provides accurate pedestrian access points for distance calculations.
- Parameters:
path (str) – Path to the GeoJSON file.
- static _haversine_meters(latlon1, latlon2)[source]¶
Compute haversine distance in meters between arrays of points.
- Parameters:
latlon1 (np.ndarray) – Array of shape (n, 2) with columns [lat_rad, lon_rad] in radians.
latlon2 (np.ndarray) – Array of shape (n, 2) with columns [lat_rad, lon_rad] in radians.
- calculate_accessibility_score(df)[source]¶
Annotate DataFrame with nearest MRT and accessibility score.
Adds columns: - Nearest_MRT: name of nearest heavy-rail MRT station - Dist_m: distance to nearest station in meters - Accessibility_Score: score = max(0, 10 - (dist_km * 2))
Expectations: Input df must have ‘lat’ and ‘lon’ columns (degrees).