Auth, Reconnects, Heartbeats, and Presence
An authenticated WebSocket keeps running after the HTTP request that opened it has finished.
Most of the lifecycle work comes out of that. The server gets an upgrade request, checks the credential material on it, accepts the socket, and after that the original request object is stale context. The connection is still running. Messages keep arriving, tokens reach their expiry, a backgrounded tab stops sending for a while, a phone moves between networks, an idle proxy drops the connection, and the user opens the same app on a second device. A presence system still reports that user as online because the process never cleaned up the record for the closed connection.
The transport keeps working. The work is in the state the server has to track about the connection once the upgrade is done.
Before a connection carries any application messages, the server has to decide who it belongs to. On a WebSocket that decision happens during the upgrade request. The server authenticates the handshake and accepts or rejects the connection right there. SSE and long polling make the same decision inside ordinary HTTP request handling. The timing is the same everywhere. Authentication comes before any application events flow.
Once the connection is accepted, the server attaches an identity to it. This is the subject the server controls, the fields it needs for later checks like user id, tenant id, session id, and device id. A client can keep sending messages about rooms, cursors, and commands, and the server still reads identity from the connection record it built rather than from any of those messages.
server.on('upgrade', (req, socket, head) => {
const identity = authenticateUpgrade(req);
if (!identity) return rejectUpgrade(socket, 401);
wss.handleUpgrade(req, socket, head, ws => {
bindConnection(ws, identity);
});
});The line that does the work is bindConnection(ws, identity). After handleUpgrade() takes the socket, the later message handlers need that identity object stored somewhere durable. A closure is enough for a small example. Once you add heartbeats, presence, reconnects, and cleanup, a connection registry is easier to work with.
Upgrade rejection still uses HTTP bytes because the protocol switch is still pending.
import http from 'node:http';
function rejectUpgrade(socket, status) {
const reason = http.STATUS_CODES[status] ?? 'Error';
socket.end(`HTTP/1.1 ${status} ${reason}\r\nConnection: close\r\n\r\n`);
}The server writes a normal HTTP response and ends the socket. After a 101 Switching Protocols response, failure reporting moves to the WebSocket close mechanism. Before that response, an HTTP status code is the clean way to refuse the connection.
Authentication material can arrive a few ways, through cookies, an Authorization header from non-browser clients, a short-lived query token, or a protocol-specific first message, and each one has tradeoffs. Cookie and session-store design sits in the identity chapter, along with JWT validation and revocation. For this chapter the transport-level rule is enough. Accept the connection only after the server has built a connection identity it trusts.
A bearer token in a query string needs tight limits. Query strings show up in access logs, proxy logs, browser history, and error reports. A query token used for WebSocket handshake auth should be short-lived and scoped to opening the connection. The long-lived identity decision stays on the server after validation.
A bearer token placed in wss://host/socket?token=... lands in places you do not control. Reverse-proxy access logs, load-balancer logs, browser history, and any error report that captures the request URL all keep a copy, and those copies outlive the connection. If you have to pass a credential in the query string, issue a single-use, short-lived ticket that is only good for opening one connection, then exchange it for a server-side identity right away. Never put a long-lived access or refresh token in the URL.
Auth on SSE and Long Polling
WebSocket makes the lifecycle awkward because the accepted socket keeps running after the HTTP upgrade. SSE and long polling stay closer to ordinary HTTP, but they still need the same auth check. The server authenticates the request, builds a connection identity or request identity, and attaches a realtime auth context to the stream or pending poll.
For SSE, the response stays open, and that open response becomes the connection-like object. It needs the same identity binding as a WebSocket.
function sse(req, res) {
const identity = authenticateHttp(req);
if (!identity) return sendUnauthorized(res);
const ctx = createRealtimeContext(identity);
openSseStream(res, ctx);
}The HTTP request handler accepts the stream. After the headers flush, the response object is what the server writes events to for the life of the connection. Message fanout, presence, and cleanup all read identity from ctx. Query parameters go back to being request metadata once the connection is accepted.
The browser EventSource API cannot set request headers, so there is no way to send an Authorization: Bearer header on a native SSE connection. That leaves two workable options. You rely on cookies that the browser attaches automatically, or you pass a short-lived ticket in the query string and validate it inside authenticateHttp. If you choose cookies, you inherit the same cross-site exposure as any cookie-authenticated endpoint, so SameSite and the origin checks apply here too.
Long polling repeats the check on every poll request, because each poll is a new HTTP request. The cursor tells the server where the client wants to resume event delivery, but the cursor carries no authority by itself. The current request still needs authentication, and the authenticated identity decides which events the cursor may read.
async function poll(req, res) {
const identity = authenticateHttp(req);
if (!identity) return sendUnauthorized(res);
const ctx = createRealtimeContext(identity);
sendEvents(res, await waitForEvents(ctx, req.query.cursor));
}The check prevents a common long-poll bug where the server treats the cursor as proof that the request is allowed to read the stream. The current request still has to authenticate, and the resulting identity decides which events are readable. The cursor only picks the starting point.
Cleanup works differently for long polling. A pending request can time out, finish with events, abort because the browser navigated away, or get cut by a proxy. Each of those has to remove the pending request from whatever in-process list is waiting to deliver events. Presence should wait until the poll cadence goes stale, since the client often sends the next poll immediately. A steady cadence of accepted polls plus heartbeats is better evidence of liveness than a single response ending.
SSE falls between WebSocket and long polling. There is one accepted response, so the auth context can be stored on that response record. Reconnect uses Last-Event-ID, which sends the next request back through the HTTP handler, where it authenticates again. The old SSE response closing and the new one opening are two separate lifecycle events, even though the browser presents them as one continuous stream.
The Auth Context After the Request Ends
The server keeps an auth record for the connection after the HTTP request is gone. Call it the realtime auth context. It holds the connection identity plus the claims that message handlers will need, things like tenant, scopes, plan, role labels, expiration time, and which auth source produced them. Keep it small and explicit.
const ctx = {
connectionId: crypto.randomUUID(),
userId: identity.userId,
tenantId: identity.tenantId,
scopes: identity.scopes,
expiresAt: identity.expiresAt
};The examples here use epoch-millisecond numbers for expiration fields. ctx.expiresAt and ctx.closeAfter should stay numeric if later code compares them with Date.now().
Every later message goes through this context. A join-room message checks ctx.userId and ctx.tenantId, a subscribe message checks ctx.scopes, and a presence update takes ctx.userId straight from the connection record, because that is the value the server trusts.
Any identity field inside a message payload is just data the client put there. The server acts only on the identity stored on the auth context.
Treat any userId, tenantId, or role that arrives inside a message body as untrusted input rather than identity. A client can put anything in a payload. The only identity the server may act on is the one it attached to the connection at handshake time and stored on the auth context. Reading the actor from the message instead of the connection is how a client ends up acting as another user, and it slips in the moment a handler starts pulling fields out of msg for convenience.
ws.on('message', raw => {
let msg;
try {
msg = JSON.parse(raw);
} catch {
return ws.close(1008, 'invalid message');
}
const handler = handlers.get(msg.type);
if (handler) handler(ctx, msg);
});In that handler the trusted and untrusted inputs are easy to tell apart. Malformed JSON closes the socket before any dispatch. The server passes ctx and the client passes msg, so every handler has both, and the trusted identity source is clear to anyone reading the code.
Treat the connection's claims as a snapshot. If a user loses a permission while the socket stays open, the connection keeps the stale claims until the server reauthorizes or closes it. Some systems accept that for a short window, others run a live check on every message. Which one you pick is part of the API contract, and it should be written down, because it changes how the server fails.
One rule keeps this manageable. Every accepted connection has one auth context, every auth context has an expiration policy, and every message handler reads from that context. The moment code starts pulling user ids out of individual messages, presence and authorization start to disagree.
Per-message checks are still normal. The connection identity is who the actor is, the message is what they are asking to do, and the handler checks that action against the auth context and the resource named in the message.
function subscribe(ctx, msg) {
if (!ctx.scopes.includes('rooms:read')) return deny(ctx, msg);
const room = `${ctx.tenantId}:${msg.roomId}`;
joinRoom(ctx.connectionId, room);
}Here the check stays attached to the action. A WebSocket should not become a connection where every later message inherits every permission. Authorization models are Chapter 24's topic. The realtime server still needs one habit. Each command reads the current auth context at the point where it acts.
The auth context also needs a version at the API-contract level. A deploy might add a scope, rename a tenant claim, or start requiring a device id for presence. Long-lived clients stay connected across deploys, so handlers should either tolerate an old context during a rollout or close those connections with a documented code. An assumption baked into ctx can break only the oldest sockets, and that makes the bug hard to reproduce.
A small authVersion field on the context handles this. When a deploy needs a new auth format, the server closes the older contexts with a reauth code and lets those clients reconnect. The migration logic stays in one place instead of in every message handler.
The Origin Check on Browser Handshakes
When a browser opens a WebSocket, it attaches an Origin header naming the page that created the socket. The server reads that header and compares it against an allowed-origin policy before accepting the upgrade. That comparison is the origin check.
const allowedOrigins = new Set(['https://app.example.com']);
function acceptsOrigin(req) {
const origin = req.headers.origin;
return typeof origin === 'string' && allowedOrigins.has(origin);
}Run that check before handleUpgrade(). A browser can open a WebSocket to any origin, and the handshake includes Origin, so the server has what it needs to decide. CORS preflight is a separate request flow handled in the API security chapter. A WebSocket server still needs its own origin policy.
The browser same-origin policy gives you nothing on a WebSocket. A ws:// or wss:// handshake has no CORS preflight, so a page on any origin can open a socket to your server, and the browser will attach your site's cookies to that handshake whenever the cookie policy allows it. That attack has a name, Cross-Site WebSocket Hijacking, or CSWSH. A malicious page authenticates as your logged-in user over a socket you never meant to accept. Validating the Origin header at the upgrade is your only browser-side defense, so treat it as required for any cookie-authenticated WebSocket.
Non-browser clients are a separate case. Some send no Origin at all, some send one anyway. A public machine-to-machine endpoint might allow a missing origin and lean on bearer authentication. A browser-only app might reject every missing origin. Pick one policy and make it explicit at the upgrade.
server.on('upgrade', (req, socket, head) => {
if (!acceptsOrigin(req)) return rejectUpgrade(socket, 403);
const identity = authenticateUpgrade(req);
if (!identity) return rejectUpgrade(socket, 401);
wss.handleUpgrade(req, socket, head, ws => bindConnection(ws, identity));
});Origin rejection comes first because it is cheap, before the expensive credential check. Authentication still decides who the connection belongs to. Both run before any WebSocket code takes the socket.
Skipping the origin check often works in local tests because local clients are trusted by habit. Production browser traffic comes from many more places. A stale tab, an embedded page, or a compromised origin can still cause the browser to send cookies during a WebSocket handshake if the cookie policy permits it. The WebSocket server should decide which origins get to attempt that handshake.
The origin check should run before subprotocol negotiation and room subscription setup. The request is still cheap at that point, with no user-level state allocated, no heartbeat timer started, and no presence record created. Rejection is just an HTTP response and a closed socket.
Allowed-origin policy should be exact. A suffix rule such as "anything ending in example.com" can admit names the application team never meant to trust. A development environment can keep its own allowed list, and a preview deployment can mint its own expected origin. The runtime check should compare concrete origins, including scheme and host.
Reverse proxies can complicate origin handling when they rewrite host headers or terminate TLS. The WebSocket server should still read the browser's Origin header as the browser supplied it, then compare that value to the public origins the application supports. Forwarded headers and proxy trust are covered in the HTTP and platform chapters. Origin policy at the upgrade stays a local realtime decision.
Token Expiry on an Open Connection
On a long-lived connection, credential expiry becomes something that happens at runtime. A normal HTTP request finishes before most tokens expire. A WebSocket can stay open long enough that the credential it was opened with expires while the socket is still live.
When that happens the server has three usual moves.
Close the connection at expiry, ask the client to reconnect with fresh credentials, or reauthorize the existing connection with a token refresh message.
An expired token does nothing to the socket on its own. The transport stays open and messages keep flowing until your code acts. If you want expiration to mean something, the server has to enforce it. Hold expiresAt on the connection record and let a timer or the sweep loop close or reauthorize the connection once the deadline passes. Waiting for the client to do the right thing leaves authenticated sockets running on credentials that expired minutes ago.
To refresh in place, the client sends an application message that carries new credential material over the connection that is already open. Give it a narrow type and a narrow response. The server validates the message, updates the realtime auth context, and either accepts the new context or closes the connection.
ws.send(JSON.stringify({
type: 'auth.expiring',
expiresAt: ctx.expiresAt
}));The server uses that message to warn the client that the connection's auth is close to expiry. The client still has to get fresh credential material from the auth system. Token issuance is Chapter 24's topic, and the realtime server only consumes the result.
Client-side code sends the refresh message back over the same socket.
ws.send(JSON.stringify({
type: 'auth.refresh',
token: freshAccessToken
}));Keep the refresh message plain, just a message type plus the credential material. Do not fold it into a normal subscribe or publish command. Auth state changes should be easy to see in logs, tests, and review.
if (msg.type === 'auth.refresh') {
const next = verifyRealtimeToken(msg.token);
ctx.scopes = next.scopes;
ctx.expiresAt = next.expiresAt;
}The server reauthorizes the connection in that step. It takes an open connection, validates the new credentials, and replaces or narrows the auth context on it. If validation fails, close with a policy code. 1008 covers many policy violations. Teams often reserve private close codes in the 4000-4999 range too, such as 4001 for expired auth. Document the code in the API contract, because reconnect behavior depends on it.
These examples use server-side ws objects. Node's browser-compatible global WebSocket follows the browser close API, which accepts client-initiated close codes 1000 and 3000-4999. Server-side libraries can send protocol policy codes such as 1008 and restart codes such as 1012.
Reauthorization should preserve identity. A refresh token for user A should keep the connection under user A. If the fresh credential belongs to user B, the server should close and force a new connection. A single socket changing users mid-flight makes presence, room membership, and audit trails harder to reason about.
Auth expiry and the send queue interact too. A connection that has fallen behind can receive its auth.expiring message late, because that message waits behind application events in the queue. A server that needs tight expiry should schedule the close against server time, since the client acknowledgement can arrive late. The refresh message updates state before the deadline, but the timer is what enforces the deadline.
Make grace periods explicit. A short grace period absorbs clock skew and network delay. A grace period that runs too long turns expiry into a suggestion. The server record can hold both values when the product needs them, expiresAt for the credential and closeAfter for the server's final deadline.
ctx.expiresAt = next.expiresAt;
ctx.closeAfter = next.expiresAt + AUTH_GRACE_MS;
ctx.authVersion = next.version;The refresh code updates the context in one place, and the deadline sweep reads those same fields later. Message handlers can also check ctx.expiresAt for sensitive commands when the API contract wants stricter behavior near expiry.
Server-driven reauthorization is a good middle ground. The server sends auth.expiring, accepts auth.refresh, and keeps the socket open when validation passes. When validation fails, it closes with a code that tells the client what to do next. Expired auth should send the client to refresh its credentials. Invalid auth should stop the realtime loop until the main application session recovers.
function closeForAuth(c, code) {
c.closingReason = 'auth';
c.ws.close(code, 'reauthorization required');
}The close reason stays internal to the server, and the close code is the part the client sees. Keep both. Internal cleanup reads closingReason, and client reconnect code reads the close code.
The Reconnect Contract
When a connection fails or closes, the client tries to open a new one, again and again until it succeeds. That is the reconnect loop, and each try is a reconnect attempt. The delay between attempts is the backoff, and it usually grows after repeated failures.
General retry theory is in the resilience chapter. A realtime API needs its own small contract on top of that. When the connection drops, the contract decides which close codes retry, which stop, what state the client shows on the next attempt, and when the client has to reload state through an HTTP API.
for (let attempt = 0; ; attempt++) {
try {
return await openRealtimeSocket({ resumeToken, cursor });
} catch {
await sleep(delayForAttempt(attempt));
}
}The loop carries two pieces of state, a resume token and a cursor. The cursor records which events the client has already processed. The resume token names which recent connection session the server is allowed to resume. Both are application-defined, and both expire.
A reconnect attempt should create a fresh transport. It should also carry the last known event cursor when the transport supports replay. SSE has Last-Event-ID, long polling has a poll cursor, and WebSocket needs an application message or query parameter because the protocol has no built-in event cursor field.
ws.on('open', () => {
ws.send(JSON.stringify({
type: 'resume',
token: resumeToken,
cursor: lastSeenEventId
}));
});With that message the client asks the server to attach the new socket to recent session state and replay any events after lastSeenEventId. The server can accept it, reject it, or send the client into a full resync instead.
Reconnect and resume are different events. A reconnect means the transport went away and the client is opening a new one. A resume means the server still has enough session and event history to pick up from recent state without a full reload. A client can reconnect at any time the service is reachable. It can only resume inside a bounded window.
Close codes should drive the reconnect loop. Some closes are retryable, some mean the credentials need a refresh, and some mean the client should stop because it broke the contract. If the client retries on every close without reading the code, it produces reconnect storms and it hides auth failures.
if (close.code === 4001) {
await refreshCredentials();
return scheduleReconnect();
}
if (retryableCloseCodes.has(close.code)) scheduleReconnect();Code 1006 means the connection closed abnormally as seen by the local endpoint, and the value is generated locally rather than sent by the peer. A client can still treat 1006 as retryable, since it usually means the network connection dropped. Codes like 1008 or private auth codes need handling defined by the contract.
Backoff should be small enough to recover quickly and large enough to protect the server after a fleet-wide disconnect. The detailed timing math comes later. What the realtime contract needs is the categories. The first retry is fast, repeated attempts slow down, an auth failure triggers a credential refresh, and a terminal policy failure stops the loop.
Add randomized jitter to every backoff delay. When a deploy drops a whole fleet at once, clients that share the same fixed backoff schedule reconnect in synchronized waves and all hit the server at each interval. Spreading each client's delay across a random range spreads those reconnects out over time, so the server sees a gradual rise in load instead of repeated spikes. With jitter, a fleet-wide reconnect arrives as a gradual rise in load instead of a burst that can take the server down on the first retry.
The server has its own work during a reconnect storm. When a deploy restarts a whole fleet and every browser reconnects at once, upgrade authentication and room restoration both turn bursty. For a planned restart the server can send a retry hint before it closes. For an application-level rejection it can include a retryAfterMs field in the response.
ws.send(JSON.stringify({
type: 'server.restarting',
retryAfterMs: 1500
}));The hint is advisory, and clients should still apply their own limits. What it buys you is an explicit branch in the protocol. A service restart, an expired-auth close, and a failed resume each lead the client to do something different.
Reconnect needs a fresh auth check too. A resume token identifies recent connection state, but the new transport still has to authenticate on its own. A browser WebSocket might send current cookies plus the resume token. A bearer-token client might send a fresh access token plus the resume token. The server binds the resumed state only after both checks pass.
Resume Windows
The server issues a session resume token, an opaque value the client uses to reconnect to recent connection state. It should point at or protect a server-side resume record, it should expire, and it should be scoped to the authenticated subject that received it.
The server only keeps enough state to resume for a limited time or event range. That range is the resume window. The stored state might include the last connection identity, the room subscription set, a cursor range, and a small replay buffer. Once the window closes, the server rejects the resume and tells the client to resync.
t0 connection accepted, resume token issued
t1 client receives event cursor 418
t2 socket drops
t3 reconnect presents token and cursor 418
t4 server replays 419..current or rejects resumeThe timeline has two separate tests. The resume token has to still be valid, and the server has to still hold events after the cursor the client sent. Passing only one of them is not enough to recover.
const record = resumes.get(token);
if (!record || Date.now() > record.expiresAt) {
return sendFullResyncRequired(ws);
}The first check covers token lifetime. The next one compares the client's cursor against the replay buffer. If the client asks for event 418 and the buffer only starts at 500, the server is missing that range. The correct response is a full resync, because a partial replay would hide the gap.
Resume tokens need a duplicate connection policy too. When a reconnect succeeds while the old socket is still half-open, two connections hold the same session for a short time. The server has to decide what happens.
It can replace the old connection, allow both connections, allow both but make one passive, or reject the new connection while the old one is alive.
Each policy behaves differently. Replacing the old connection clears stale sockets quickly and suits single-tab products. Allowing duplicates fits multi-tab or multi-device products, but then presence has to count per connection. A passive duplicate can receive state while it holds back from publishing commands. Rejecting duplicates keeps a single connection per session, but a user on an unstable network can get stuck behind a dead socket until heartbeat cleanup runs.
Store the policy next to the connection state. Most reconnect bugs come from state that was valid for an old socket and is still trusted by a new one.
Room and subscription restoration is part of the resume contract too. The server can restore the previous subscriptions automatically, make the client resubscribe, or restore only a subset. Automatic restore is smoother for the client, but the server has to keep recent subscription state to do it. Client resubscribe is simpler for the server, though the client can miss events for a moment unless it pairs the resubscribe with an event cursor.
The resume response should tell the client which of these happened.
ws.send(JSON.stringify({
type: 'resume.ok',
restoredRooms: restored,
cursor: currentCursor
}));The resume.ok message gives the client a new baseline. If restoredRooms leaves out a room, the client can send a subscribe command for it. If the server replies with resume.required or resume.rejected instead, the client should fetch state through the regular API and restart from the cursor it gets back.
Treat session resume tokens as short-lived connection-recovery secrets. They can be stolen from memory or logs like any other token. Bind them to the user and tenant, and rotate them when a resumed connection succeeds. Token design is the identity chapter's topic. What follows from it is this chapter's. A resume token should only reopen recent realtime state for the same authenticated subject.
One Lifecycle Record Per Connection
The clearest way to hold everything in this subchapter together is one lifecycle record per accepted connection. That record is the single server-side place that holds the socket reference, the auth context, the liveness timestamps, the presence state, and the cleanup handles. The exact object structure is application code. The rule about what goes where should stay simple.
connections.set(ctx.connectionId, {
ws,
auth: ctx,
lastSeen: Date.now(),
lastHeartbeatAt: Date.now(),
presence: 'online'
});The record is the source of truth for the process. auth is the identity the connection is bound to. lastHeartbeatAt is the last time the application got a heartbeat from the peer. lastSeen is the last time the user did something the server counts as activity. presence is what the product currently publishes about availability, and the socket object carries the transport state.
The event loop only runs callbacks, and the lifecycle record is the shared state those callbacks read and write. A message callback updates lastSeen. A heartbeat callback updates lastHeartbeatAt. A close callback deletes the record. An auth timer closes or reauthorizes the socket. A presence timeout moves a quiet connection to away or offline. Every one of them touches the same record.
The bug to avoid here is split state. Say Map A tracks which connections are authenticated, Map B tracks which users are present, and Map C tracks which heartbeat timers exist. The socket closes abruptly. One cleanup function deletes the Map A entry, another forgets Map B, and presence reads online until a separate sweep catches it, or it never gets caught. Now the process has a state leak, and in the later fanout chapter that same mistake becomes a cross-process leak.
Keep the state transitions explicit.
accepted -> authenticated -> active
active -> refreshing -> active
active -> closing -> closed
active -> stale -> closedThose labels are application states, and they are separate from the WebSocket ready state. Ready state is where the protocol object is in its open, closing, or closed lifecycle. The application lifecycle is whether the server still trusts the auth context, still gets heartbeats, and still publishes presence.
Timers need the same discipline. A naive version creates a setInterval() for every socket, plus a token-expiry timer, plus a presence timer. That is fine at small counts and gets noisy as the connection count grows. A process holding 50,000 sockets with several timers each is holding a lot of timer handles, and each one is something else to clean up. Node can run that many timers, but the application still has to clear every one on close.
One sweep interval per process avoids most of that. The sweep walks the connection registry and checks each deadline against Date.now(), closing expired auth contexts, marking heartbeat timeouts, and updating stale presence as it goes. That is one interval over one registry instead of thousands of separate timers. Per-connection timers are still worth it for narrow deadlines, but the sweep keeps the lifecycle policy in one place you can read.
setInterval(() => {
const now = Date.now();
for (const c of connections.values()) {
checkAuthDeadline(c, now);
checkHeartbeatDeadline(c, now);
checkPresenceDeadline(c, now);
}
}, 1000);The interval only evaluates the server's deadlines. The actual network work stays in the functions it calls, things like closing a socket, publishing a presence change, or marking a record stale. Passing now in as an argument also makes the policy easy to test with an injected time.
Lifecycle state should measure elapsed time with a monotonic clock even when it also stores wall-clock timestamps. Date.now() is fine for timestamps you persist or publish, such as lastSeen. To measure how much time has passed inside one process, performance.now() avoids the jumps a wall clock takes when the system time gets adjusted. Plenty of services use Date.now() anyway because their tolerances are wide. When heartbeat timeouts are tight, use the monotonic clock for the elapsed checks and convert to wall time only for the values a user sees.
Cleanup needs to be idempotent, meaning it can run more than once on the same record without doing damage. The close callback, the heartbeat sweep, auth expiry, and process shutdown can all reach the same connection, so the cleanup function has to tolerate repeated calls. Delete from the registry first, or mark the record closing first, then clear handles and publish final state. That ordering stops a second callback from treating an already-closing connection as active.
function cleanup(c, reason) {
if (!connections.delete(c.auth.connectionId)) return;
c.ws.removeAllListeners();
updatePresenceAfterClose(c, reason);
}removeAllListeners() is blunt, and real code usually removes only the listeners it added itself. The point is the structure. One exit function does the registry deletion and the presence update, and every failure case goes through that one function.
The lifecycle record also gives reconnect code one place to handle old and new sockets together. When a new socket resumes a session, the server marks the old record closing, moves the resumable state into the new record, and publishes at most one presence change. If the old socket emits close later, idempotent cleanup finds that the record has already moved or is gone.
Presence, heartbeats, and auth expiry look like three systems on a diagram. Inside a Node process they are callbacks writing to the same connection registry. Handle them as one lifecycle and there are only a handful of edge cases to track.
SSE records are the same with one field changed. The record points at res instead of ws, and cleanup listens for request abort or response close. Long polling records are shorter-lived. A pending poll record usually holds the response, the auth context, a cursor, a deadline, and an abort cleanup function. The rule is the same across all three. One record holds each pending realtime unit.
pendingPolls.set(requestId, {
res,
auth: ctx,
cursor,
deadline: Date.now() + POLL_TIMEOUT
});When an event arrives, the server completes the matching polls and deletes their records. A timeout sends an empty response and deletes the record. A client abort just deletes the record. All three end at the same registry.
The registry also gives you a place to read local metrics that later chapters build on. Even before any metrics system exists, the process can count active connections, pending polls, stale heartbeats, auth-expiring records, and presence states. Those counts come from the same state the server makes decisions with, so they expose local leaks early.
Application Heartbeats
A WebSocket ping and pong prove the peer can still answer protocol control frames. TCP keep-alive checks a quiet TCP connection down at the transport layer. An application heartbeat proves something higher up, that the application handler is still receiving bytes, decoding them, getting scheduled by the event loop, and updating the server's state.
A TCP connection can go half-open, where the peer disappears, never sends a FIN, and your close event never fires. The socket sits in the registry looking alive, and presence keeps publishing "online" for a user who left an hour ago. Neither TCP keep-alive nor WebSocket ping/pong on its own keeps the application state correct here. You need an application-level heartbeat with a server-side timeout that closes the connection and runs cleanup when beats stop arriving. Without it, stale presence is the normal result of any ungraceful disconnect.
An application heartbeat is a message in your own realtime protocol, something like { "type": "heartbeat" } or { "type": "pong", "at": 123 }. One side sends it on an interval. The other side waits a timeout before it counts the peer as having missed a beat. Above both is the liveness timeout, the wider server policy for when a connection is dead enough to close and clean up.
function onHeartbeat(c, msg) {
c.lastHeartbeatAt = performance.now();
c.lastSeen = Date.now();
c.clientTime = msg.now;
}Both timestamps get updated for a reason. lastHeartbeatAt feeds the elapsed-time checks inside the process. lastSeen is the wall-clock value the product can publish or store.
setInterval(() => {
const now = performance.now();
for (const c of connections.values()) {
if (now - c.lastHeartbeatAt > HEARTBEAT_TIMEOUT) {
c.ws.close(4000, 'heartbeat timeout');
}
}
}, HEARTBEAT_INTERVAL);The interval scans the server state and closes any connection that has missed the timeout. The close callback still runs cleanup afterward. The sweep starts the close. The close callback finishes it.
Heartbeat values are operational policy. If the interval is shorter than a normal mobile network stall, you get false disconnects. A timeout longer than a proxy's idle limit lets the intermediary close the connection before you do. A sane server usually picks a heartbeat interval below the known idle timeout limits and a timeout that tolerates a few missed beats. The exact numbers depend on the client platform, the proxy in between, and how much delayed offline state the product can tolerate.
Application heartbeats can carry a little metadata, but they should stay cheap. A heartbeat that does database writes, permission checks, and fanout on every beat turns into a periodic load spike. The server only needs enough to prove the connection is still alive at the application layer and to update liveness state.
Heartbeat direction is a contract choice. With a server-to-client heartbeat, the server prompts and the client answers. With a client-to-server heartbeat, the client sends on its own schedule. A bidirectional heartbeat does both and costs more messages. WebSocket ping/pong can handle protocol liveness while the application heartbeat handles the app's own state machine.
ws.send(JSON.stringify({
type: 'heartbeat.ping',
at: Date.now()
}));The server sends that message when it wants an application-level acknowledgement. The client answers with an ordinary application message, and the server updates lastHeartbeatAt only once the application handler actually receives that answer.
{
"type": "heartbeat.pong",
"seen": 1782
}The pong payload can carry the last processed event sequence. If the answer says the client has processed only event 1782 while the server has sent through 1840, the connection is alive but falling behind, which is a backpressure signal. Identity is still read from the auth context. The heartbeat payload only reports liveness. Slow-consumer policy is the previous subchapter's topic, and the heartbeat can hand that policy a cheap observation.
SSE works differently. A server can send comment heartbeats to keep the response alive through intermediaries, but the browser's EventSource API only ever receives stream data, never sends it. If the product needs client-to-server liveness over SSE, it has to add a second request, such as a small heartbeat POST or a reconnect that carries the last-seen cursor. Long polling gets its liveness from poll cadence instead, since a client that stops polling has stopped presenting fresh requests.
Browser tabs complicate this. A background tab may throttle its timers, and a mobile app may pause execution entirely. A server that demands a heartbeat every 5 seconds will start closing healthy users the moment the platform delays the client. A better policy names platform classes, things like desktop foreground, browser background, mobile foreground, and mobile background, and gives each class its own heartbeat contract the server can enforce.
A missed heartbeat should not turn into an immediate presence claim. All a missed beat tells you is that the server has not heard application liveness recently. From there it can mark the connection stale, close the socket, or move presence toward away or offline after a timeout. Publishing "offline" on the very first missed beat just creates noisy presence churn, while a small presence timeout absorbs brief stalls and still exposes the long disconnects.
Heartbeat timeouts also have to account for outbound queue pressure. When the server sends its heartbeat pings through the same per-connection queue as application messages, a slow consumer gets the heartbeat late. Closing that socket might still be the right move, but the reason is now mixed, since both liveness and send pressure caused the miss. Some systems put heartbeat messages ahead of normal application events to avoid this. Others keep strict ordering and accept that send pressure can trip the liveness check. Either way, make the choice visible where the sending happens.
Timeouts also need server-side cleanup for a close that never finishes. Calling ws.close() only starts a close handshake, and a peer that has already vanished may never answer. Many WebSocket servers follow up with a terminate or destroy after a shorter grace period, and that second step is the local lifecycle policy's responsibility.
function forceCloseLater(c) {
c.terminateAt = performance.now() + CLOSE_GRACE_MS;
c.ws.close(4000, 'liveness timeout');
}The sweep can later destroy any socket whose terminateAt has passed. That keeps dead connections from sitting in the closing state while their presence stays stale.
Computing Presence
Presence is the part the UI shows, online or offline or away or idle or connected from another device. The server computes it from the connection state it can observe, and stores it against a user, a device, or a session.
Presence is always computed by the server. The client does not get to assert it. The server watches connections, heartbeats, messages, close events, and timeouts, and derives the published state from all of that. A profile record says who the user is. Presence reports what the realtime system sees about them right now.
function markSeen(c) {
c.lastSeen = Date.now();
c.presence = 'online';
publishPresence(c.auth.userId, c.presence, c.lastSeen);
}The last-seen timestamp is the wall-clock time of the last activity the server decides to count as meaningful. That might be any message, a heartbeat, a subscribe command, or an explicit active ping from the client. Pick one and stick with it. Products mean different things by seen, and the server needs one definition.
If a connection goes long enough without qualifying activity, its presence state changes, from online to away and later to offline. That cutoff is the presence timeout. The failure case is stale presence, where the published value still reads online or active after the evidence behind it is gone.
function checkPresenceDeadline(c, now) {
if (now - c.lastSeen <= PRESENCE_TIMEOUT) return;
if (c.presence === 'online') {
c.presence = 'away';
publishPresence(c.auth.userId, 'away', c.lastSeen);
}
}The deadline check reads the connection's own lastSeen value. A process-wide user map can aggregate those values later, but the raw observation stays on the local connection record.
Duplicate connections change the presence math. When the same user, session, or resume token shows up more than once, the server needs a duplicate connection policy for what to do. For presence, three policies are common. Single active connection, per-device presence, and aggregate user presence.
Single active connection means the newest connection replaces the old one, and presence reads online while that single connection stays active. Per-device presence tracks each device on its own, so a phone and a laptop can sit in different states at once. Aggregate user presence publishes online as long as any accepted connection for that user is online.
function userPresence(userId) {
const userConnections = byUser.get(userId) || [];
if (userConnections.some(c => c.presence === 'online')) {
return 'online';
}
return 'offline';
}That aggregate rule is simple and common, and it drops detail on purpose. If one device goes offline while another stays online, the user still reads online. A product that wants device-level state needs a different output. Multi-device identity policy is covered elsewhere, but the realtime server still needs some duplicate connection policy, since cleanup and reconnect both depend on it.
Presence updates should be idempotent and monotonic enough that clients can rely on them. If a connection closes and a resumed connection becomes active right after, a client might receive offline followed immediately by online. Sequence numbers from the previous subchapter let the client ignore the stale update, and a last-seen timestamp helps the UI decide whether a late offline update should really replace a newer online one.
Presence can be persisted, but it is still computed from connection state. Many systems store lastSeen so a profile can show "last active at 14:03" after the user disconnects. That stored timestamp is just history. The live presence state should still be computed from current connection evidence and current timeouts.
Presence transitions should be small and named.
offline -> online
online -> away
away -> online
online -> offline
away -> offlineThose transitions cover most realtime products. The jump from offline -> online happens when the first accepted connection for a user becomes active. online -> away fires when the presence timeout passes while the connection still exists, and away -> online fires when fresh activity arrives. Both online -> offline and away -> offline happen when the last relevant connection closes or times out.
Source events should be narrow too.
accepted connection
application heartbeat
application message
close cleanup
presence timeout
resume successThe server can test presence by feeding those source events into the lifecycle record one at a time. That is far easier to follow than presence logic scattered across WebSocket handlers, HTTP routes, and room membership code.
Presence needs a publication policy too. Publishing every heartbeat as online sends traffic that carries no new information. A better local rule is to publish only on a state change or a change in the last-seen bucket. A product might publish last-seen at one-minute granularity while the connection is active, then publish offline at close. The exact UI policy is up to the application, but the realtime server should not emit the same state transition on every beat.
function publishIfChanged(c, next) {
if (c.presence === next) return;
c.presence = next;
publishPresence(c.auth.userId, next, c.lastSeen);
}The guard does real work during a reconnect. A new connection can mark the user online while the old connection is still closing. If aggregate presence already reads online, there is no reason to publish it again. When the old connection finally closes, aggregate presence should stay online, because the new connection is active.
Presence privacy sits outside this chapter, but the local data structure should leave room for it. Some products show exact online state to teammates, a coarse last-seen to strangers, and nothing at all to blocked users. The realtime connection record should store the raw observed state, and the fanout or API layer can decide who receives which derived view.
Presence semantics differ by transport because the available evidence differs.
WebSocket gives the server bidirectional application messages over one accepted connection. From there the server can observe heartbeats, commands, close frames, missed liveness checks, and send pressure. Presence can stay tied closely to that connection record, since both directions share the same lifecycle.
SSE gives the server a long-lived response stream. The server can see that the response is open and that writes are succeeding, and it can see reconnects through Last-Event-ID. What it gets weaker evidence on is whether the user is actually reading, because the browser receives events through EventSource and sends anything back over a separate request. For many products, SSE presence should mean "stream connected" and treat "user active in the UI" as a different state. If the product wants a stronger activity signal, the client can send a separate activity request on clicks, focus changes, or a heartbeat cadence.
Long polling gives the server a series of request arrivals. One request ending tells you almost nothing on its own. A steady cadence of authenticated polls tells you the client is still participating. So the presence timeout needs to be longer than the normal gap between one poll completing and the next one arriving. If the long-poll timeout is 25 seconds and the browser fires the next poll immediately, a 30-second presence timeout is already too tight once network delay and tab scheduling get involved. A 60 to 90 second timeout usually represents the product better. The exact value is a product and infrastructure contract.
A transport change should preserve the meaning of presence. If a client falls back from WebSocket to SSE or long polling, the server should keep publishing the same presence vocabulary even as the evidence source behind it changes. "Online" has to mean the same product state across every transport. The internal proof can be a heartbeat response, an open SSE stream, or an accepted poll cadence, depending on which one the client is using.
Keeping the vocabulary stable is what carries a fallback. A user on a network that blocks WebSockets should see the same presence meaning once the app moves to long polling. The server stores the transport in the lifecycle record, applies the transport-specific liveness checks, and produces the same public presence state from them.
Fallback transitions should look like reconnects from the same subject. If a WebSocket connect fails and the client opens SSE, the server binds the SSE stream to the same authenticated subject and records the new transport. If a WebSocket closes under pressure and the client drops to long polling, the server uses the last known cursor and its duplicate connection policy to decide whether the new transport resumes recent state or starts over with a full resync.
A transport generation number handles this. Each successful connection or stream gets a higher generation number for the same user and session pair. When cleanup runs for an older generation, it can tell that a newer generation is already publishing presence and skip the offline update. Inside one process, the lifecycle registry compares the generation it is cleaning against the generation currently active for that subject.
if (c.generation < latestGeneration(c.auth.userId)) {
return releaseLocalResources(c);
}The guard does just one thing. It protects presence from a late cleanup after a fallback or resume. Room membership and send queues still have to be released for the old transport. The newer generation is the one publishing presence now, and the older generation only has its leftover local resources to clean up.
Connection Cleanup
Every connection ends eventually. How it ends is what decides whether presence stays correct afterward.
A normal close frame should reach the WebSocket close handler. An abrupt network drop might surface instead as a socket error, or as a close event with no clean close code. A heartbeat timeout starts a server-initiated close, and auth expiry either closes the socket or marks it unauthenticated before close. Process shutdown should stop accepting new connections and run every active record through the one cleanup function. A rejected resume should leave the new socket closed and the old resume state either intact or expired, according to policy.
ws.on('close', (code, reason) => {
cleanup(connection, { code, reason: String(reason) });
});
ws.on('error', err => {
connection.lastError = err.code;
});The error handler only records context. The close handler does the actual cleanup. Many WebSocket libraries emit both events for one failure, so put cleanup in the close handler, or in a shared function guarded by registry deletion.
function expireAuth(c) {
c.closingReason = 'auth expired';
c.ws.close(4001, 'auth expired');
}It starts an application-defined close. The cleanup itself still runs when the socket actually closes. If the peer has already vanished, either the close event or the heartbeat sweep should still reach that same cleanup function.
Graceful shutdown adds a process-level reason. The server stops accepting new upgrade requests, tells existing clients to reconnect, and closes their sockets after a deadline. Deployment draining in detail is the deployment chapter's topic. The local connection record still needs a closingReason so presence and reconnect behavior can tell a shutdown apart from an auth failure.
for (const c of connections.values()) {
c.closingReason = 'server shutdown';
c.ws.close(1012, 'service restart');
}1012 is the code WebSocket servers commonly use for a service restart. Clients can treat it as retryable, and the API contract should say so. A 4001 auth close points the client at a credential refresh. A pressure close might point it at a resync. The reconnect behavior follows from the close reason.
A rejected resume needs the same cleanup discipline. A new socket that fails resume should close cleanly, and it must not publish the user offline when another active connection still exists. The outcome comes from the aggregate state after the failed attempt. The failed socket is only one input to that.
function closeRejectedResume(ws) {
ws.send(JSON.stringify({ type: 'resume.required' }));
ws.close(4002, 'resync required');
}The response gives the client one clear branch. Stop trying to resume, fetch fresh state, and reconnect with a current cursor or none at all. The server does not replay a range it no longer has.
Lifecycle cleanup is simple when one place is responsible for it. Remove the connection from the registry, clear its timers or just let the sweep ignore the now-missing record, and update presence according to the duplicate policy. Then release room memberships, drop send-queue references, and let resume state expire at its own deadline. Publish one final state change, but only if the aggregate state actually changed.
Room cleanup happens here too, even though fanout is the next subchapter. A closing connection should leave its local room membership before the presence update fans out, or the closed connection can receive its own offline update through a stale membership list. Cross-process membership comes later. Inside one process, deleting memberships is part of connection cleanup.
function releaseConnection(c) {
leaveAllRooms(c.connectionId);
dropSendQueue(c.connectionId);
cleanup(c, c.closingReason);
}The wrapper puts transport cleanup and application cleanup in one place. leaveAllRooms() drops the routing state, dropSendQueue() releases buffered application messages, and cleanup() updates the lifecycle registry and presence.
SSE cleanup runs on the response close event.
res.on('close', () => {
releaseConnection(connection);
});Long polling cleanup runs on request close, timeout completion, and event completion. Each one removes the pending poll record. Presence updates only when the accepted poll cadence goes stale, depending on the product contract.
The last failure mode is split cleanup during a deploy. One subsystem closes sockets for graceful shutdown, another marks presence offline, and a third removes room membership. When each runs its own partial cleanup, ordering bugs appear. Put the order in one cleanup function and let callers pass in a reason.
Realtime lifecycle bugs almost always come back to split state. Keep one accepted connection record, one auth context, one liveness policy, one cleanup function, and presence that is always computed from observed state.