~/articles/design-yelp-proximity

◆◆Intermediateasked at Googleasked at Yelpasked at Uberasked at Metaasked at DoorDash

Design Yelp / Nearby Search (proximity service)

Q: What is geohash and how does it enable fast proximity queries?

Geohash encodes a (lat, lng) coordinate as a string where shared prefixes indicate geographic proximity. A 5 km radius query computes the geohash5 prefix for the query point, fetches that cell plus its 8 neighbors, and runs exact distance calculations on the resulting ~500 candidates — turning a full 100 million-row scan into a handful of indexed point lookups that complete in under 50 ms.

Q: Why does Uber's H3 hexagonal grid outperform square geohash cells for neighbor lookups?

On a square grid, diagonal neighbors are roughly 1.41x farther from the center than cardinal neighbors, so a radius query over a 3x3 cell block over-fetches corners — the diagonal cells extend well past the circular radius. Hexagons place all six neighbors at the same distance from the center, making the neighbor ring geometrically uniform and avoiding the corner over-fetch problem. H3 cell IDs are also 64-bit integers, easy to index, though less human-readable than geohash strings.

Q: What geohash precision should I use for a 5 km versus a 1 km radius search?

Use geohash5 for a 5 km search: each cell spans roughly 4.9 km x 4.9 km, so the center cell plus its 8 neighbors form a 3x3 grid that covers the full radius. Use geohash6 for a 1 km search: cells measure approximately 1.2 km x 0.6 km, keeping the candidate set small.

Q: Why does geographic cell-based sharding create a hot-shard problem, and what are the mitigations?

Sharding by cell collocates nearby businesses but concentrates query load: Manhattan or central Tokyo can generate 100x more QPS than a rural shard holding the same number of cells. The article recommends three complementary mitigations: virtual shards that let hot cell ranges be redistributed independently, read replicas on hot shards, and a Redis cell-level cache with a 60-second TTL that absorbs burst traffic before it reaches the shard.

Q: When should a proximity service use PostGIS instead of geohash or H3?

PostGIS (via its GIST R-tree index and ST_DWithin) is the right tool for queries that require arbitrary geometry — such as 'businesses inside this neighborhood polygon' — or when you need the full ST_Distance and spatial-containment API. Its limitation at this scale is that R-trees do not shard horizontally as cleanly as cell-based indexes, so most production systems use geohash or H3 for sharding and fast radius lookups and reserve PostGIS for complex polygon queries.

Find restaurants/businesses near a location, fast. Geohash, quadtree, hexagonal cells, and the right index for "within 5 km of me".

16 min read2026-02-18Ironclad Academy

#interview #geo #search #indexing #distributed-systems

// DEPTH

the full breakdown — requirements, capacity, evolution, trade-offs

The problem

Yelp serves roughly half a billion monthly queries from users looking for restaurants, coffee shops, and mechanics within a few kilometers of wherever they're standing (the exercise below uses a simplified 10M/day scale to keep the arithmetic tractable). Type "tacos near me" on a Tuesday afternoon, and within 200 ms Yelp returns a ranked list of nearby businesses filtered by category, hours, and rating — across a global catalog of 100 million listings.

The engineering challenge isn't throughput. At ~120 QPS average and ~1k QPS peak, you're not building the next Google Search in terms of raw traffic. The hard part is the geometry: SQL has no efficient primitive for "give me everything within 5 km of this coordinate." A naïve WHERE distance(lat, lng, ?, ?) < 5000 scans every row, and with 100 million businesses that takes seconds, not milliseconds.

The solution is a spatial index — a data structure that converts a continuous two-dimensional space into discrete cells so that a proximity query becomes a handful of indexed point lookups instead of a full-table scan. Geohash, quadtrees, and Uber's H3 hexagonal grid are the main options, each making a different trade-off between implementation simplicity and density uniformity. This is foundational tech that shows up in Uber and Lyft (driver-to-rider matching), DoorDash (restaurant selection), Instagram (location tagging), and every dating app's "people nearby" feature.

Geographic indexing also surfaces a subtler tension: cell-based indexes make proximity fast but introduce boundary effects — a business 10 meters outside your query cell is invisible unless you explicitly fetch neighboring cells. And sharding by geographic cell creates hot-shard problems when one city generates 100× more queries than rural areas. The design has to handle both.

Functional requirements

Search businesses near a point, with optional category/keyword filter.
Return up to 50 results, sorted by distance (or rating, or relevance).
Add / update business listings.
(Stretch) reviews, photos, hours.

Non-functional

p99 query < 200ms.
100M businesses.
10M searches/day → ~120 QPS average, ~1k peak.
Updates rare (~10k/day).

Capacity

Dimension	Estimate	How we got there
Business listings	100 M	Given
Avg listing size	~1 KB metadata	Text fields; photos stored separately in object storage
Total metadata storage	~100 GB	`100M × 1 KB`
Location density	City block can have hundreds of businesses	Variable — drives need for adaptive indexing
Searches	10 M/day, mostly read	Given
Avg search QPS	~120 QPS	`10M ÷ 86,400 s`
Peak search QPS	~1k QPS	~8× avg burst factor

Takeaway: The bottleneck is spatial queries, not raw throughput.

Building up to the design

The path from V1 to production here is almost entirely about the data structure. The infrastructure is straightforward; the index does all the interesting work. Walking forward shows why each approach breaks and makes the final choice feel inevitable rather than arbitrary.

V1: Full-table scan with distance formula

SELECT id, name,
       earth_distance(point(lat, lng), point(?, ?)) AS d
FROM businesses
WHERE d < 5000
ORDER BY d LIMIT 50;

This gives you correct results with no setup — and reads all 100M rows doing it. Even with parallelism, that's seconds, not 50ms.

V2: Bounding-box pre-filter

Compute a lat/lng box around the query point and pre-filter with a B-tree index on (lat, lng):

WHERE lat BETWEEN ?-Δ AND ?+Δ
  AND lng BETWEEN ?-Δ AND ?+Δ

The index narrows candidates from 100M to a few thousand; exact distance is computed only on those. In a city like Manhattan, though, even that bounding box contains 5k+ businesses, and sorting them is still hundreds of milliseconds. There's also a geometry problem: longitude degrees shrink near the poles, making the box anisotropic. And queries near the antimeridian (180° longitude) wrap the wrong way entirely.

V3: Geohash — convert (lat, lng) to a sortable string

Geohash encodes geography as a string prefix: shared prefix means close. Index on geohash (a normal B-tree). A 5km query becomes a 5-char geohash lookup, then fetch the cell plus 8 neighbors. The spatial query is now a handful of indexed range scans — under 50ms easily.

The catch: cells are uniform size but density isn't. A geohash6 cell in midtown Manhattan holds a thousand businesses; the same cell in rural Wyoming holds zero. You're either over-fetching in cities or under-fetching in countryside.

V4: Quadtree — adaptive cells

Instead of uniform cells, recursively subdivide until each leaf holds at most K businesses. Dense areas get tiny cells; sparse areas get big ones. Result size per cell is bounded regardless of density. The tradeoff: implementation is more complex and updates can trigger rebalancing.

V5: S2 / H3 — hierarchical cell grids

H3 (Uber) uses hexagons at multiple resolutions. Hexagons have an important geometric property that squares lack: every neighbor is equidistant from the center. On a square grid, a diagonal neighbor is √2× farther than a cardinal neighbor, which means "query this cell + its neighbors" captures an uneven ring. On a hex grid, all six neighbors sit at the same distance — radius queries are cleaner and don't over-fetch corners.

Pure proximity still isn't enough, though. "Nearby Italian restaurants with 4+ stars open now" — the cell index narrows to a spatial window, but filters are still needed.

V6: Inverted index for filters + geo for proximity

Pre-filter by category and keyword via Elasticsearch (inverted index), then intersect with the geo result. Alternatively, store businesses pre-grouped by (geohash, category) so both filters collapse into a single indexed lookup.

V7: Production Yelp

The hex cell index (H3 or geohash) for fast spatial lookup, Elasticsearch for text search and facets, Cassandra for business metadata, CDN for photos, and a per-cell Redis cache that absorbs the city-center surge — NYC's 10am "lunch near me" query can hit that cache a million times before a single business changes.

flowchart LR
    V1[V1: SQL scan<br/>seconds] --> V2[V2: + bbox index<br/>fast but dense fails]
    V2 --> V3[V3: geohash<br/>uniform cells]
    V3 --> V4[V4: quadtree<br/>adaptive cells]
    V4 --> V5[V5: H3 / S2<br/>hierarchical grids]
    V5 --> V6[V6: + filters + caching<br/>production Yelp]
    style V1 fill:#0e7490,color:#fff
    style V3 fill:#15803d,color:#fff
    style V5 fill:#ff6b1a,color:#0a0a0f
    style V6 fill:#a855f7,color:#fff

The rest of the article details each candidate index — pick one based on your density characteristics.

The naive solution and why it fails

SELECT *
FROM businesses
WHERE earth_distance(point(lat, lng), point(?, ?)) < 5000
ORDER BY earth_distance(...)
LIMIT 50

This scans every row, computes distance, sorts. With 100M businesses, even with parallelism, it's ~seconds — not 50ms.

We need a spatial index.

Option 1: Geohash

Geohash encodes (lat, lng) as a string where shared prefixes mean geographic proximity.

(37.7749, -122.4194)  → '9q8yyk8u'   ← San Francisco
(37.7750, -122.4193)  → '9q8yyk8u'   ← same prefix (very close)
(37.5485, -77.4360)   → 'dq8tcp3'    ← Richmond, VA (different)

Each character of geohash represents a doubling of precision. Cells are rectangles, not squares — width and height alternate which axis is halved:

Length	Cell dimensions (approx.)
4	≤ 39 km × 20 km
5	≤ 4.9 km × 4.9 km
6	≤ 1.2 km × 0.6 km
7	≤ 153 m × 153 m

For a 5 km radius query, geohash5 is the natural choice — each cell spans roughly 4.9 km, so the center cell plus its 8 neighbors form a 3×3 grid that covers the full radius. You fetch all nine cells, then compute exact distances on the ~500 candidates to produce the final ranked list.

flowchart TD
    QUERY["Query at lat=37.77, lng=-122.42<br/>radius=5km"] --> COMPUTE[geohash5 prefix: 9q8yy]
    COMPUTE --> NEIGHBOR[Compute 8 neighboring cells]
    NEIGHBOR --> FETCH[Fetch businesses in 9 cells]
    FETCH --> FILTER[Filter by exact distance]
    FILTER --> SORT[Sort by distance]
    SORT --> RES[Top 50]
    style COMPUTE fill:#ff6b1a,color:#0a0a0f
    style FETCH fill:#15803d,color:#fff

Schema:

businesses (
  id BIGINT PK,
  name TEXT,
  lat DOUBLE,
  lng DOUBLE,
  geohash6 CHAR(6),    -- ~1.2 km × 0.6 km cell
  geohash5 CHAR(5),    -- ~4.9 km × 4.9 km cell
  category INT[],
  ...
)

INDEX ON geohash6
INDEX ON geohash5

Query: pick the geohash precision matching the radius (radius 5 km → use geohash5), look up that cell + 8 neighbors. ~500 candidates → exact distance → sort.

Simple enough to run on any database. The limitation is those fixed cell sizes: in dense cities you fetch many businesses you don't need; at cell edges, a business 10 meters outside the query cell won't appear unless you also fetch its neighbor. Neighbor-fetching is the standard fix, but it still doesn't solve the density mismatch.

Option 2: Quadtree

A tree where each node represents a region; if too many points fit in a node, split it into 4 quadrants.

flowchart TD
    R[Whole world]
    R --> NW[NW quadrant]
    R --> NE[NE quadrant]
    R --> SW[SW quadrant]
    R --> SE[SE quadrant]
    SW --> SW1[Bay Area]
    SW1 --> SF[San Francisco]
    SF --> M[Mission district<br/>32 businesses]
    style M fill:#ff6b1a,color:#0a0a0f

Dense regions get fine-grained splits; sparse regions stay coarse. The result set per leaf node is bounded by your split threshold (say, K=100 businesses), which geohash can't guarantee. Memory-efficient because the tree only materializes cells that contain data.

The complexity cost is real: harder to shard and persist than geohash, and inserts into a dense node can trigger a rebalance that propagates up the tree. You also can't compute "which cell does this coordinate land in" on the client side — you have to ask the tree.

Option 3: H3 (hexagonal)

Uber's open-source library. Hexagons tile uniformly; neighbor relationships are simpler (6 neighbors, all equidistant). 16 resolution levels from continent to ~1m.

import h3
cell = h3.latlng_to_cell(37.7749, -122.4194, resolution=9)  # ~201m avg edge, ~0.105 km² area
neighbors = h3.grid_disk(cell, 1)  # cell + 6 neighbors

Same query pattern as geohash, but hexagons mean better neighbor uniformity.

H3 resolution guidance:

Resolution	Avg edge length	Avg area	Typical use
7	~1.4 km	~5.16 km²	City-level bucketing
8	~0.53 km	~0.74 km²	Neighborhood
9	~0.20 km	~0.105 km²	Block-level (good for 1 km radius)
10	~0.076 km	~0.015 km²	Fine-grained ride-share

For a 5 km proximity search, resolution 7 or 8 cells plus grid_disk(cell, k) rings are a natural fit.

The most uniform grid available, with great library support — what Uber actually runs. Cell IDs are 64-bit integers (easy to index, less human-readable than geohash strings), and you're taking on a dependency. Worth it if you need the clean neighbor math or plan to layer multiple resolution queries.

Option 4: PostGIS / native spatial extensions

Postgres has the PostGIS extension; MySQL/MariaDB have spatial types; MongoDB has 2dsphere indexes. They build R-trees under the hood, support ST_DWithin, ST_Distance, polygon containment, etc.

CREATE INDEX idx_geom ON businesses USING GIST (geom);

SELECT id, name
FROM businesses
WHERE ST_DWithin(geom, ST_MakePoint(?, ?)::geography, 5000)
ORDER BY ST_Distance(geom, ST_MakePoint(?, ?)::geography)
LIMIT 50;

PostGIS handles arbitrary shapes — useful for "businesses inside this neighborhood polygon" or other geometry queries beyond radius search. It's a mature, well-tested option. The limitation for our scale: R-trees don't shard horizontally as cleanly as cell-based indexes. Most production systems mix approaches — geohash or H3 for sharding and quick proximity lookups, PostGIS for complex polygon queries.

Architecture

flowchart TD
    USER[User] --> APIGW[API Gateway]
    APIGW --> SVC[Search Service]
    SVC --> CACHE[(Redis<br/>cell → results)]
    CACHE -.miss.-> IDX[Index Service]
    IDX --> SHARDS[(Sharded Geo Index<br/>by H3 cell)]
    SVC --> RANK[Ranker]
    RANK --> META[(Business Metadata)]

    INGEST[Add/Update Business] --> META
    INGEST --> IDX

    style IDX fill:#ff6b1a,color:#0a0a0f
    style CACHE fill:#15803d,color:#fff

Request flow end to end

A search request doesn't just hit one service and return. Here's how the pieces connect:

sequenceDiagram
    participant C as Client
    participant S as Search Service
    participant R as Redis Cache
    participant I as Index Service
    participant SH as Geo Shards
    participant M as Metadata DB

    C->>S: "pizza near (37.77, -122.42), 5km"
    S->>R: get(cell=9q8yy, filter=pizza)
    alt cache hit
        R-->>S: cached result list
    else cache miss
        R-->>S: nil
        S->>I: query cell 9q8yy + 8 neighbors
        I->>SH: parallel fetch (1-3 shards)
        SH-->>I: ~500 business IDs
        I-->>S: candidate list
        S->>M: fetch metadata for candidates
        M-->>S: names, ratings, hours
        S->>R: set(cell=9q8yy, filter=pizza, ttl=60s)
    end
    S->>S: rank by distance + rating + relevance
    S-->>C: top 50 results

Notice that the cache key includes both the cell and the filter. A search for "pizza near downtown" and "sushi near downtown" are separate cache entries — both can be warm simultaneously.

Sharding

Shard by cell ID at some coarse resolution (e.g. H3 resolution 5 — city-sized hexagons ~252 km² each, or geohash4 ~39 × 20 km cells). Each shard owns the businesses in its set of cells.

A query at point P:

Compute the cells covering the search radius.
Hash each cell to its shard.
Query each shard in parallel.
Merge and re-rank results.

For radius 5 km in a city, you might touch 5–20 cells across 1–3 shards. Parallel queries = ~20 ms total.

Hot-shard problem. Geographic sharding by cell creates a density skew: Manhattan and central Tokyo have orders-of-magnitude more businesses per km² than rural areas. A naive shard-by-cell approach leaves the NYC shard handling 100× more QPS than the Wyoming shard.

Three mitigations work well together:

Virtual shards: assign multiple virtual cell ranges to large physical shards; redistributed independently when a shard gets hot.
Read replicas per hot shard — all geo-index reads are idempotent.
Cell-level caching: the Redis cache (below) absorbs bursts before they hit the shard.

Shard boundary at query radius. For very large radii (> 50 km), the number of cells grows and a query might fan out to 10+ shards. Cap the max radius (Yelp's API enforces a hard cap at 40,000 m / ~25 miles); for larger needs, use a coarser index at a higher resolution level.

Caching

Most searches happen in the same areas — popular city centers, transit hubs. Cache popular (cell, filter) queries in Redis with a short TTL (60s) so they auto-refresh.

For long-tail rural queries, cache miss → DB. Acceptable.

Filtering and ranking

Beyond distance, users want filters (category=restaurant, open_now=true) and ranking (popularity, rating, "Yelp Score").

A typical pipeline:

flowchart LR
    Q[Query: lat,lng,radius,filter] --> CAND[Spatial candidates<br/>~500 within radius]
    CAND --> FILT[Apply filters<br/>category, open_now, etc.]
    FILT --> SCORE[Score: blend distance + rating + popularity + relevance]
    SCORE --> TOP[Top 50]
    style SCORE fill:#ff6b1a,color:#0a0a0f

The score function is tunable:

score = w1 × (1 / distance)
      + w2 × business_rating
      + w3 × user_relevance(category, business)
      + w4 × popularity_signal

For Yelp, popularity is reviews × time-decay; for Uber Eats, delivery speed dominates.

Updates and re-indexing

Adding a new business:

Insert into metadata DB (source of truth).
Compute its cell at all relevant resolutions.
Add to spatial index shard(s) — write to the shard owning that cell.
Publish a cell-invalidation event; cache workers delete the Redis key for that cell.

Most updates are sparse (~10 k/day vs. 100 M existing) — purely incremental, no full reindex.

Dual-write consistency risk. If step 3 succeeds but step 4 fails, a stale cached result will serve the old business list until TTL expiry (60 s). For Yelp, that is acceptable. A brand-new business appearing 60 seconds late is fine; a business that closes permanently disappearing 60 seconds late is fine.

If stricter consistency is needed: use a change-data-capture (CDC) pipeline (Debezium on Postgres → Kafka → index worker + cache invalidator). The CDC stream serializes updates and ensures both the index and cache are updated from the same event, with at-least-once delivery.

Business moves (rare but real). A restaurant that relocates must be removed from its old cell shard and inserted into the new one. Implement as a delete + insert, not an in-place update, to keep the cell-to-business mapping correct.

Edge cases

Radius spanning the antimeridian

A point near Fiji (lng ≈ 179°) with a 100 km radius crosses 180°/−180°. A naive bounding-box query WHERE lng BETWEEN 175 AND -175 returns zero rows because the comparison wraps the wrong way.

Fix: detect when lng_min > lng_max (i.e. the box crosses the antimeridian) and split into two queries:

lng BETWEEN 175 AND 180
lng BETWEEN -180 AND -175

PostGIS and H3's grid_disk handle this correctly because they operate in cell-space, not coordinate-space — grid_disk traverses the cell graph regardless of where 180° falls. Geohash neighbor computation handles it correctly in most library implementations (standard algorithms explicitly account for the wrap), but naive bounding-box code built by hand does not. The bug lives in homegrown bounding-box arithmetic, not in the standard cell-neighbor APIs.

Polar regions

Geohash and H3 both handle poles, but cell shape distorts heavily. Most consumer apps don't care (no users at the South Pole).

Moving objects

For "drivers near me," the index must update in real time as drivers move (see Uber design).

Polygon queries

"Inside this neighborhood" requires actual geometry. Use PostGIS or a specialized geo store.

Storage choices

Component	Tech
Spatial index	Custom hexagonal cell store, or PostGIS, or Elasticsearch (geo_point)
Metadata	Postgres
Cache	Redis
Photos / reviews (long tail)	S3 + Cassandra

Performance numbers

For 100 M businesses, sharded across 50 nodes by H3 cell (illustrative targets):

Query: 20 ms p50, 80 ms p99 — each query touches ~9 cells, 1–3 shards in parallel.
1 k QPS sustained per shard (index-only reads on small key space; easily raised with read replicas).
Cluster throughput: 50 × 1 k = 50 k QPS (well above the 1 k peak requirement; headroom for burst and future growth).
Index size: ~50 GB for geo index (small fields — cell ID + business ID only); full metadata in a separate store.
Redis cache hit ratio: ~95 %+ for city-center traffic; cache miss rate drives DB load to well under 60 QPS at peak.

Things to discuss in an interview

Geohash vs. quadtree vs. H3: pick one with reason.
Sharding by cell for horizontal scale.
Caching popular queries to absorb city-center hot spots.
Multi-stage ranking: spatial filter → attribute filter → score.
Updates as a smaller, separate concern — no need to optimize.

Things you should now be able to answer

Why is naive WHERE dist < radius too slow at 100M rows?
What's the trade-off between geohash and quadtree?
Why does Uber prefer hexagons over squares?
How do you handle a search radius that spans cell boundaries?
How would you cache results for popular search areas?

Frequently asked questions

▸What is geohash and how does it enable fast proximity queries?