~/articles/design-nearby-friends

◆◆◆Advancedasked at Metaasked at Snapchatasked at Googleasked at Apple

Design Nearby Friends (Real-Time Friend Location Sharing)

How to stream real-time friend locations to millions of users without collapsing under the write amplification of a 30× social-graph fanout.

21 min read2026-06-05Ironclad Academy

#geo #realtime #pub/sub #websocket #social-graph #privacy

// DEPTH

the full breakdown — requirements, capacity, evolution, trade-offs

The problem

Snapchat's Snap Map and Apple's Find My Friends put small location dots on a map beside your friends' names. Open the app and you can see that Maya is at Dolores Park and Raj is two blocks from your office. The dot moves as they move. It feels magical — and the machinery underneath it is one of the nastier fanout problems in social infrastructure.

The core mechanics: each mobile client sends a GPS coordinate to the server every few seconds (a "heartbeat"). The server must take that single write and push the updated position to every friend who is currently watching the map. With 500 M users and an average social graph of 200 friends, a single location write fans out to roughly 30 online viewers. At 10 M location writes per second that is 300 M downstream pushes per second — a 30× amplification factor that must be absorbed by a delivery tier that scales independently of ingestion.

Two things make this genuinely hard. First, write amplification: the server receives far fewer writes than it must emit — the ratio is determined by social-graph density and online concurrency, not by the number of writers. Second, delivery routing: when Alice moves, the server must push her position to Bob's phone — but Bob's WebSocket lives on one of 500 connection servers, and the Fanout Service has no direct way to know which one. Getting that routing right without a global coordination table is the central design challenge.

Privacy and battery life layer on top. Opt-out must propagate within seconds. The mobile client must not drain the battery polling every second. These constraints shape the adaptive heartbeat, the TTL-based eviction, and the per-friend opt-in check that runs on every fanout — all of which the sections below explain in order.

Functional requirements

A user can see the approximate real-time location of any friend who has opted in to sharing with them.
Users control visibility: share with all friends, a custom list, or nobody (ghost mode).
Location dots refresh on the viewer's map within a few seconds of the friend moving.
The feature is mobile-first; battery and data usage matter.

Non-functional requirements

Freshness: a friend's position on the map should be at most ~10 s stale while they are moving.
Scale: 500 M MAU, average social graph of 200 friends.
Privacy: opting out of sharing must propagate within one heartbeat cycle.
Availability: dropped updates are acceptable (location is inherently lossy); the map should not go blank on transient failures.

What makes this different from proximity search

Before diving in, it is worth separating Nearby Friends from proximity search. Proximity search asks "find businesses or strangers near me" — a spatial index query over a fixed set of points. Nearby Friends is a social graph fanout problem: given a known set of people (your friends), stream their positions to each other. The spatial index question almost disappears; you never need to search for unknown entities by radius, because you already know exactly which user IDs you care about. The hard problem shifts entirely to delivery.

The real-time communication patterns article covers WebSocket and long-poll in detail. Here we focus on how those primitives compose at social-graph scale.

Capacity

Let's pin down numbers early so every design choice is grounded.

Dimension	Estimate	How we got there
MAU	500 M	Given
Concurrent active sessions	50 M	`500 M × 10%`
Heartbeat interval (moving)	5 s	Design target
Location writes/s	10 M/s	`50 M ÷ 5 s`
Avg friends per user	200	Given
Online friends per user (est.)	30	`200 × 15%`
Fanout per write	30 downstream pushes	= online friends
Total downstream pushes/s	300 M/s	`10 M × 30`
WebSocket connections	50 M	= concurrent active sessions
Connections per server	100 k	Large async I/O instance
Connection servers needed	500	`50 M ÷ 100 k`

Takeaway: The 30× fanout is the critical insight: the server receives 10 M location writes per second and must emit 300 M downstream messages per second. These are different tiers with different scaling axes: ingestion can be a few stateless services behind a load balancer; the push tier needs to know where each user's WebSocket lives.

Building up to the design

V1: Clients poll on a timer

The simplest thing you can write in an afternoon: every 5 seconds each client calls GET /friends/locations, and the server returns the positions of all opted-in friends.

GET /friends/locations
→ { "alice": [37.781, -122.412], "bob": [37.775, -122.408], ... }

At 50 M active users polling every 5 s, that's 10 M HTTP requests per second directed at the server. Each request has to look up your friend list (Social Graph), then fetch 30 positions — 30 Redis reads. So really it is 300 M Redis reads per second triggered by clients. This isn't impossible, but you're paying a full HTTP round-trip and doing 30 cache reads per poll even when nobody moved. Every friend poll, whether or not anything changed, burns battery, radio, and backend capacity.

The fatal flaw is the thundering herd: 50 M clients all polling on 5-second timers desynchronize in seconds and become a smooth 10 M req/s wall against your backend. You can add jitter, but you're still running the server hot to deliver "nothing changed" answers.

V2: Server pushes over WebSocket

Flip the model. When a user opens the map view, the client opens a WebSocket to a connection server and stays connected. The server pushes location updates only when friends actually move.

Now the server is event-driven. A location write from Alice triggers a push to Alice's 30 online friends — not 30 polls. The connection count is 50 M persistent WebSockets, but each message is small (a few dozen bytes) and only sent when state changes. Battery on the viewing side improves dramatically because the radio stays mostly idle, waking only to receive the server push.

The new challenge: when a location update arrives for Alice, how does the server know which machine hosts each of Alice's friends' WebSocket connections?

V3: Channel-per-user pub/sub

This is where the design snaps into shape. Assign every user a pub/sub channel — a logical address the server can publish messages to, named by user ID. When Alice's friend Bob opens the map view, his connection server subscribes to Alice's channel (and the channels of all Alice's other friends he's opted in to see). When Alice sends a heartbeat, the Fanout Service publishes to Alice's channel. Every subscription on that channel — wherever they live — receives the update.

Redis Pub/Sub is a common choice here: the connection server issues SUBSCRIBE location:alice_id, and the broker delivers any subsequent PUBLISH location:alice_id <payload> to all subscribers on any node. If Bob's connection server and Carol's connection server both have users watching Alice, both receive the message with a single publish.

sequenceDiagram
    participant App as Alice's phone
    participant ING as Ingestion API
    participant FS as Fanout Service
    participant SG as Social Graph
    participant BR as Pub/Sub Broker
    participant CS as Bob's Conn Server
    participant Bob as Bob's phone

    App->>ING: POST /location {lat, lng}
    ING->>ING: write to Redis location:alice
    ING->>FS: async: fanout for alice
    FS->>SG: online opted-in friends of alice?
    SG-->>FS: [bob_id, carol_id, ...]
    FS->>BR: PUBLISH location:alice {lat,lng,ts}
    BR->>CS: deliver (subscribed for alice)
    CS->>Bob: WebSocket push

The sequence makes the flow concrete. Notice that the Social Graph lookup happens once per location write, not once per friend. That's important — we fan out the social-graph cost across the write side, not the read side.

V4: Full production shape

V3 has two loose ends. First, what happens when a connection server crashes — all its subscriptions disappear. Second, a new friend coming online needs to immediately see where everyone is, not wait for the next heartbeat. The location cache solves both: on connect, the client fetches the latest position for all opted-in friends directly from Redis, then subscribes to their channels for live updates. Cache miss means the friend hasn't sent a heartbeat since the TTL expired — treat that as offline.

High-level architecture

flowchart TD
    PHN[Mobile client] -.heartbeat POST /location.-> LB[Load Balancer]
    LB --> ING1[Ingestion Server]
    LB --> ING2[Ingestion Server]

    ING1 --> CACHE[("Location Cache<br/>Redis<br/>location:{uid} → lat,lng,ts,TTL=60s")]
    ING1 --> KAFKA[Kafka\nlocation-events topic]

    KAFKA --> FS[Fanout Service\nN workers]
    FS --> SG[Social Graph Service\ncached opt-in lists]
    FS --> BROKER[(Pub/Sub Broker\nRedis Cluster)]

    BROKER --> CS1[Conn Server 1]
    BROKER --> CS2[Conn Server 2]
    BROKER --> CS3[Conn Server ...]

    CS1 -.WebSocket push.-> PHN2[Viewer's phone]
    CS2 -.WebSocket push.-> PHN3[Viewer's phone]

    LOOKUP[On-connect fetch] --> CACHE

    style ING1 fill:#ff6b1a,color:#0a0a0f
    style ING2 fill:#ff6b1a,color:#0a0a0f
    style FS fill:#a855f7,color:#fff
    style CACHE fill:#15803d,color:#fff
    style BROKER fill:#0e7490,color:#fff
    style SG fill:#ffaa00,color:#0a0a0f

Location ingestion pipeline

The heartbeat message

Keep it small. A full position update from phone to server:

{ "lat": 37.7812, "lng": -122.4134, "accuracy_m": 15, "ts": 1749100800 }

That's around 60 bytes as compact JSON, or half that as a binary frame. With 10 M writes per second, 60 bytes × 10 M = 600 MB/s ingest — very manageable across a few dozen ingestion servers. The ingestion server writes a single Redis hash:

HSET location:alice_id lat 37.7812 lng -122.4134 ts 1749100800
EXPIRE location:alice_id 60

The 60-second TTL means a user who stops sending heartbeats (closed app, phone locked) automatically disappears from friends' maps within a minute, with no explicit "went offline" message required.

Adaptive heartbeat rate

The mobile client measures movement with the accelerometer. When the device has been stationary for 20 seconds, it drops the heartbeat interval to 30 seconds. When movement is detected again, it snaps back to 5 seconds. This alone cuts average server load by a factor of 3-4 for a typical population that is mostly stationary at any given moment (sitting at a desk, eating lunch).

Why Kafka sits between ingestion and fanout

The ingestion servers write the location to Redis synchronously (that's the "source of truth" and the initial cache) and then drop an event on Kafka asynchronously. The Fanout Service reads from Kafka. This decoupling means a fanout backlog — say, during a viral event — does not slow down the ingestion path. The phone's heartbeat ACK comes back as soon as Redis is written, not after 30 downstream pushes complete.

Looking up online opted-in friends

The Fanout Service worker receives an event for alice_id. It needs to answer: "which of Alice's friends have opted in to see Alice's location and are currently online?"

The opt-in list is a social-graph attribute — think of it as an edge property on the friendship graph. The Social Graph Service owns this data and can return a list of user IDs who have opted in to see Alice. This list changes rarely (opt-in is a deliberate user action), so it is cached aggressively — a local in-process LRU cache with a 60-second TTL is reasonable. The Fanout Service worker holds a warm cache of the ~5 000 users it has fanned out for most recently.

"Currently online" is a separate attribute: the Presence Service (or the connection servers themselves) maintains a set of currently-connected user IDs. The Fanout Service intersects the opt-in list with the online set before publishing. Publishing to a channel with no subscribers is harmless but wastes a Redis round-trip; skipping offline users saves 70–80% of publishes in practice (most of Alice's friends are not actively on the map at any given moment).

Push vs. poll: why push wins here

The argument is worth making explicitly.

Pull: each viewer polls GET /friends/locations every 5 s. At 30 online friends per user, the server does 30 cache reads per poll. With 50 M viewers, that's 300 M Redis reads per second driven by polling. The server must respond even when nothing changed.

Push: a location write triggers fanout to the ~30 online opted-in friends (we already computed that as 200 × 15% = 30). With 10 M writes per second the server emits 300 M channel publishes per second — the same headline number as the polling scenario, but each publish is tiny and only reaches the subscriber nodes that actually care. Stationary users generate far fewer publishes — one per 30 s instead of one per 5 s — because the heartbeat rate drops to 30-second intervals; only resumed motion triggers a return to the full rate.

Push delivers lower latency (the map updates as soon as Alice moves, not when the viewer's poll timer next fires), lower battery on the viewer side (radio is idle until the push arrives), and lower aggregate server load because idle users generate zero traffic.

The one case where pull makes sense is when the viewer has only a few friends on the map and the update frequency is high — then the polling overhead is small and the implementation is simpler. At 500 M MAU, we are squarely in push territory.

Real-time delivery

Connection servers and how they route

A connection server is a long-lived process that:

Accepts WebSocket connections from mobile clients.
Subscribes to pub/sub channels for each user the connected client is watching.
Forwards received channel messages down the WebSocket to the client.

Fifty million WebSockets across 500 connection servers is 100 k connections per server — achievable with an async I/O server (Node.js, Go, or an async Python framework). Each connection is mostly idle (occasional small pushes), so the bottleneck is connection state memory, not CPU.

When a viewer opens the map, their client connects to any available connection server (picked via DNS round-robin or a load balancer). The connection server then subscribes to the pub/sub channels for all opted-in friends:

SUBSCRIBE location:friend1_id location:friend2_id ... location:friendN_id

When Alice's Fanout Service publishes to location:alice_id, Redis delivers the message to every subscriber — across all connection servers — in a single pub/sub operation.

Routing a message to the right connection server

This is the key correctness question: how does the pub/sub broker know which connection server to send to? It doesn't have to. That's the elegance of the channel-per-user model. The connection server subscribes for the channels it cares about, and the broker delivers to all subscribers. No routing table, no "which server hosts user X" lookup.

The alternative — a routing table mapping user_id to connection server — is more efficient for unicast but adds a coordination problem: the table must be updated whenever a user connects, disconnects, or their server crashes. The pub/sub model trades slightly higher broker load (all subscribers on a channel receive the message) for much simpler connection lifecycle management.

For very high-fan-in cases (a celebrity with 50 000 opted-in friends all on the map at once), you'd cap the subscriber count per channel or apply a push throttle. We discuss this in Failure Modes.

Connection server sharding

Five hundred connection servers is a non-trivial fleet. Discovery works via a service registry (e.g. etcd or Consul): each server registers its address on startup, and the client load balancer uses the registry to pick a healthy server. Sticky sessions are not needed — the client can reconnect to any server; the new server simply re-subscribes to the relevant channels.

Geospatial layer

Does the backend need a spatial index?

Recall that Nearby Friends is a closed-world problem: you know exactly which user IDs you want positions for. You never ask "find all users within 3 km of me." You ask "give me the current position of user IDs [alice, bob, carol]." That is a set of point lookups, not a radius query.

So the answer is: no spatial index required on the backend. The Location Cache is a simple key-value store keyed by user ID. The rendering happens entirely on the client: you have a list of (lat, lng) pairs for your friends, and you plot them on a map tile layer. The phone's map SDK handles the projection.

This is a meaningful simplification compared to Uber's geohash/H3 dispatch problem or a proximity search system, where the server must answer spatial range queries against a large unknown population. Here, the server only ever does point lookups.

What the client receives and renders

Each WebSocket push is a compact delta:

{ "uid": "alice_id", "lat": 37.7812, "lng": -122.4134, "ts": 1749100855 }

The client maintains a local map from uid → (lat, lng, last_seen_ts). Each push updates one entry. The map view reads from this local map on every frame render — no network call needed for smooth animation. If a friend's last_seen_ts is more than 60 seconds old, the client dims or removes their dot (the server TTL should have already evicted the key, but the client-side age check is a belt-and-suspenders guard for stale pushes that arrived just before the user went offline).

Privacy and presence

Per-friend opt-in

Ghost mode and per-friend opt-in are stored as edge attributes in the Social Graph Service. The data model might look like:

sharing_permission(owner_id, viewer_id) → enabled | disabled | ghost

When owner_id is in ghost mode, the Fanout Service skips the publish entirely. When a specific friendship pair has sharing disabled, that viewer_id is excluded from the fanout target list. This check happens in the Fanout Service at publish time, using the cached opt-in list. Opting out takes effect on the next heartbeat cycle — at most one heartbeat cycle (5 s when moving, 30 s when stationary).

Some designs offer "approximate location" sharing — the user's position is intentionally blurred to the nearest kilometer before being sent to friends. This is done at ingestion time: the ingestion server snaps the lat/lng to a grid cell before writing to the cache. Friends see "Alice is near Union Square" rather than "Alice is at 37.7876° N, 122.4074° W." The implementation is a rounding operation:

precision = 0.01  # ~1.1 km at mid-latitudes
snapped_lat = round(lat / precision) * precision
snapped_lng = round(lng / precision) * precision

Exact sharing uses the unrounded coordinates. The permission model determines which version is written to the cache or sent in the push.

Presence

The Presence Service maintains a set of currently-connected user IDs with a heartbeat TTL. The Fanout Service queries this service (or its cache) to filter the opt-in list before publishing. Presence is a classic "who is online" problem — the design-chat-system article covers this in more depth. For Nearby Friends, we need only a binary online/offline signal, not presence details like "typing" or "last seen."

Failure modes

Connection server crash

When a connection server crashes, all 100 k WebSocket connections it hosted drop simultaneously. Each mobile client implements exponential backoff and reconnects to any available server (from the load balancer). On reconnect, the new server re-subscribes to the relevant channels. Meanwhile, the Location Cache still holds the most recent positions — on re-subscription, the client fetches current positions with a HMGET batch and then resumes the live channel for updates. The worst case is a few seconds of stale data on the viewer's map; the user experience is a brief freeze followed by an update burst.

The Pub/Sub broker (Redis) is unaffected by a connection-server crash — channels simply lose their subscribers, and the next connection re-subscribes.

Stale location and the "stuck dot" problem

If a user's app crashes or their phone goes offline without a clean disconnect, the Location Cache entry ages out after 60 seconds and the connection server stops receiving pushes (there are no more publishes). The viewer's client notices the stale timestamp and greys out the dot. No explicit "went offline" signal is required.

A more subtle failure: a user's app is backgrounded by the OS and the heartbeat silently pauses. The adaptive heartbeat helps here — if the phone detects no movement, it sends a 30-second heartbeat that keeps the cache entry alive. If even that stops (e.g., airplane mode), the TTL expires and the dot fades. This is the correct behavior: the server should not keep displaying a stale position as "current."

Back-pressure with high fan-in

A user with 3 000 opted-in friends all watching their location is a pathological case. Each of their heartbeats triggers 3 000 channel publishes. At 5-second intervals, the Fanout Service generates 600 publishes per second for this one user alone. Multiply across a handful of such users and the Fanout Service is doing meaningful extra work.

Mitigations:

Cap the fanout count per heartbeat: if a user has more than, say, 500 opted-in online friends, apply a random sample or a priority queue (closest friends first, per an application-level social weight). The vast majority of users never hit this limit.
Throttle the heartbeat at the ingestion layer: for users with very large online audiences, the ingestion server can coalesce back-to-back heartbeats and only pass one through to the Fanout Service every 10 s instead of every 5 s.
Horizontally scale the Fanout Service: workers are stateless and can be scaled independently of ingestion and connection servers.

Pub/Sub broker scaling

Redis Pub/Sub does not persist messages and does not support consumer groups — if a subscriber is slow, it simply drops messages. This is fine for location data (a missed position update is replaced by the next heartbeat in 5 seconds) but means you cannot use Redis Pub/Sub for guaranteed delivery. For Nearby Friends, best-effort delivery is exactly what you want: location data has a very short shelf life, and delivery guarantees would require persistence that adds latency and cost.

At 300 M channel publishes per second, a single Redis instance is nowhere near enough. There is a trap here worth knowing: classic Redis Pub/Sub in Cluster mode broadcasts every published message to every node regardless of which shard the channel key hashes to. That means adding cluster nodes does not reduce per-node message load — it multiplies it. The workaround introduced in Redis 7.0 is Sharded Pub/Sub (SSUBSCRIBE / SPUBLISH), which assigns each channel to a specific hash slot and confines message delivery to the shard that owns that slot, giving true horizontal scale.

For a greenfield design at this volume the better choices are a dedicated pub/sub broker: a self-hosted NATS cluster (benchmarked at tens of millions of messages per second across a three-node cluster, no persistence required) or a horizontally partitioned Kafka topic with connection-server consumer groups per region. Both give you more operational visibility and throughput headroom than Redis Pub/Sub at extreme fan-out rates.

Storage summary

Data	Store	Why
Current location per user	Redis hash + TTL	Sub-ms reads; auto-eviction; no disk I/O
Location history	Kafka → cold storage (S3/Parquet)	Retention for analytics, not on the live read path
Social graph + opt-in flags	Graph DB or Postgres (Social Graph Service)	Complex traversals; strong consistency for privacy
Presence (online users)	Redis set + TTL per user	Fast membership check; auto-expires on disconnect
User account data	Postgres	Durable, transactional

Session lifecycle: from app-open to map-close

stateDiagram-v2
    [*] --> Disconnected
    Disconnected --> Connecting: user opens map view
    Connecting --> Active: WebSocket established\nbatch-fetch current positions
    Active --> Streaming: subscribe to friend channels\nreceive live updates
    Streaming --> Reconnecting: network drop / server crash
    Reconnecting --> Connecting: exponential backoff
    Streaming --> Disconnected: user closes map view\nWebSocket closed
    Active --> Disconnected: auth failure

The reconnecting path is where most of the resilience lives. Exponential backoff (starting at 500 ms, capped at 30 s, with ±20% jitter) prevents thundering-herd reconnect storms when an entire connection server fails and its 100 k clients all reconnect simultaneously.

Things you should now be able to answer

Why is Nearby Friends a fanout problem and not a proximity search problem?
Why does push over WebSocket beat polling at this scale, even though the volume of downstream messages is equivalent?
How does the channel-per-user pub/sub model avoid the need for a routing table?
What determines the heartbeat interval, and how does adaptive rate help battery life?
How does ghost mode propagate, and what is the latency to opt-out?
What happens to the viewer's map when a connection server crashes?
Why does the backend not need a geospatial index?

Frequently asked questions

▸Why is Nearby Friends a fanout problem rather than a proximity search problem?

Nearby Friends operates on a closed social graph: you already know the exact user IDs you care about (your opted-in friends), so the server never performs a radius query against an unknown population. Every update reduces to a set of point lookups keyed by user ID, and the hard problem is delivering those updates to the right connection servers, not finding nearby strangers in a spatial index.

▸How many downstream messages per second does the system need to handle, and what drives that number?

At peak the system must emit 300 million downstream messages per second. This is 10 million location writes per second multiplied by a 30-way fanout: 15 percent of each user's 200 friends are concurrently online, so each heartbeat triggers roughly 30 push targets. Ingestion and fanout are separate tiers because they scale on different axes.

▸How does the channel-per-user pub/sub model eliminate the need for a routing table?

Each user gets a named pub/sub channel (for example, location:alice_id). When a viewer's connection server opens a session, it subscribes to the channels of every friend being watched. The Fanout Service publishes once to a channel and the broker delivers to all subscribed connection servers automatically, with no centralized table mapping user IDs to specific machines.

▸When should you use Redis Sharded Pub/Sub instead of classic Redis Pub/Sub for this use case?

Classic Redis Pub/Sub in Cluster mode broadcasts every published message to every node, so adding cluster nodes multiplies per-node load rather than reducing it. Redis 7.0 introduced Sharded Pub/Sub (SSUBSCRIBE and SPUBLISH), which pins each channel to a specific hash slot and confines delivery to the owning shard, enabling true horizontal scale. At 300 million publishes per second, a single Redis node is insufficient and Sharded Pub/Sub or a dedicated broker like NATS is required.

▸How does the adaptive heartbeat reduce battery drain, and what are the two intervals?

The mobile client reads the accelerometer and sends a heartbeat every 5 seconds while moving but drops to every 30 seconds when the device has been stationary for 20 seconds. This 6x reduction in transmit frequency cuts average server load by a factor of 3 to 4 for a population that is mostly stationary at any given moment.

← previous

Design an A/B Testing & Experimentation Platform

Design a Real-Time Fraud Detection System

// RELATED

Frequently asked questions

You may also like

Design a Realtime Voice AI Agent

Design a Real-Time Leaderboard (gaming)

Design a Video Conferencing System (Zoom)