Real-time features are easy to demo and hard to operate.
A chat prototype can work in an afternoon. A live dashboard can update beautifully on localhost. A collaboration canvas can feel magical with two users in the same room. Then production arrives: mobile networks drop, users open the same account on three devices, events arrive out of order, servers restart, Redis has a hiccup, and nobody knows whether the UI is wrong or the backend is late.
I have built real-time features for wallet updates, chat, dashboards, and collaboration products. The recurring lesson is simple: real-time architecture is not about WebSockets. It is about state, delivery guarantees, recovery, and observability.
This article lays out a practical approach for building real-time systems with Node.js, Socket.io/WebSockets, queues, and event-driven architecture.
First Question: Does It Need To Be Real-Time?
Before adding WebSockets, ask what product problem you are solving.
Use real-time when:
- Users are collaborating on the same object.
- Financial or operational state changes need immediate visibility.
- A support or admin team needs live monitoring.
- Latency changes user trust or decision quality.
Avoid real-time when:
- A 10-30 second delay is acceptable.
- The UI is mostly read-only.
- Updates are rare and user-triggered.
- Polling would be simpler and cheaper.
Real-time infrastructure creates operational responsibility. If the product does not need it, polling may be the senior decision.
Choosing The Transport
flowchart TD
Need[Need live updates?] --> Critical{Bi-directional?}
Critical -->|Yes| WS[WebSockets / Socket.io]
Critical -->|No| SSE[Server-Sent Events]
Need --> Rare{Updates rare?}
Rare -->|Yes| Poll[Polling]
Rare -->|No| Push[Push Notifications]
Common options:
- Polling: simple, reliable, cache-friendly, often enough.
- Server-Sent Events: good for one-way updates from server to browser.
- WebSockets: best for bidirectional interaction, presence, collaboration, and rooms.
- Push notifications: useful when the app is backgrounded or closed.
For Node.js products, Socket.io is often pragmatic because it gives rooms, reconnection behavior, and adapters. Raw WebSockets can be better for lower-level control, but you will build more yourself.
Architecture That Does Not Collapse
A healthy real-time system separates writes, events, and delivery.
flowchart LR
Client[Client] --> API[HTTP API]
API --> DB[(Database)]
API --> Bus[Event Bus / Queue]
Worker[Worker] --> Bus
Bus --> Gateway[WebSocket Gateway]
Gateway --> Client
Gateway --> Presence[(Presence Store)]
Gateway --> Metrics[Metrics and Logs]
Key rule: the WebSocket gateway should not become your entire backend.
The gateway should:
- Authenticate socket connections.
- Join users to rooms.
- Deliver events.
- Track presence if needed.
- Emit metrics.
It should not own critical business state. Business state belongs in your database and domain services.
Event Contracts
Messy real-time systems often start with messy event names.
Use stable, versionable event contracts:
type RealtimeEvent<TPayload> = {
id: string;
type: string;
version: number;
occurredAt: string;
actorId?: string;
resourceId: string;
payload: TPayload;
};
Example event names:
transaction.updatedchat.message.createdcanvas.node.moveddocument.presence.updatednotification.created
Avoid vague names like update, refresh, or message across the whole system. They become impossible to reason about once the product grows.
Each event should answer:
- What changed?
- Which resource changed?
- Who should receive it?
- Can it be processed twice safely?
- Does it replace state or describe a patch?
Implementing A Socket.io Gateway In NestJS
In NestJS, keep socket connection logic inside a gateway and business logic inside services.
@WebSocketGateway({
cors: { origin: process.env.APP_ORIGIN },
namespace: "realtime",
})
export class RealtimeGateway implements OnGatewayConnection {
@WebSocketServer()
server: Server;
constructor(
private readonly auth: SocketAuthService,
private readonly permissions: PermissionService,
) {}
async handleConnection(socket: Socket) {
const user = await this.auth.authenticate(socket.handshake.auth?.token);
socket.data.user = user;
await socket.join(`user:${user.id}`);
}
@SubscribeMessage("workspace.join")
async joinWorkspace(socket: Socket, payload: { workspaceId: string }) {
const user = socket.data.user;
await this.permissions.assertWorkspaceAccess(user.id, payload.workspaceId);
await socket.join(`workspace:${payload.workspaceId}`);
}
}
The gateway does three things:
- Authenticates the socket.
- Joins authorized rooms.
- Emits events.
It should not decide how transactions, chat messages, or canvas nodes work. That belongs in services that can also be tested without a socket connection.
To publish from a domain service:
publishToUser(userId: string, event: RealtimeEvent<unknown>) {
this.server.to(`user:${userId}`).emit(event.type, event);
}
This keeps delivery separate from business state.
Rooms, Permissions, And Multi-Device Sessions
Rooms are not authorization. They are delivery channels.
For example:
user:{userId}for personal notifications.wallet:{walletId}for wallet updates.workspace:{workspaceId}for collaboration.document:{documentId}for active editors.
Before joining a room, the gateway should verify access using the same permission model as your API.
Multi-device behavior matters too. A user may be connected from mobile, desktop, and tablet. If one device changes a resource, should every device receive the event? Usually yes, but the UI may treat the originating device differently for optimistic updates.
Scaling Socket.io
A single Node.js process can handle many connections, but production systems usually need multiple instances.
For Socket.io, use a shared adapter such as Redis:
flowchart LR
C1[Client A] --> G1[Gateway 1]
C2[Client B] --> G2[Gateway 2]
G1 <--> Redis[(Redis Adapter)]
G2 <--> Redis
API[API / Worker] --> Redis
This lets an event published from one instance reach sockets connected to another instance.
Operational concerns:
- Use sticky sessions if your infrastructure requires them.
- Set connection limits per user/IP.
- Monitor connection count per instance.
- Gracefully drain connections during deployment.
- Keep payloads small.
Reconnection And Missed Events
Real-time delivery is never guaranteed on the client side. Browsers sleep. Mobile apps background. Networks break.
Design for recovery:
- Client stores
lastSeenEventIdorlastSyncedAt. - On reconnect, client calls an HTTP sync endpoint.
- API returns missed changes or current resource state.
- Client reconciles local optimistic state.
Do not rely on the socket stream as the only way to get correct state.
For wallet transaction updates, the UI can reconnect and fetch /transactions/:id. For a collaboration canvas, the client may fetch a document snapshot plus recent operations.
Implementing Missed Event Recovery
A simple recovery endpoint can solve many real-time consistency bugs.
@Get("/workspaces/:id/events")
async getEventsSince(
@Param("id") workspaceId: string,
@Query("after") afterEventId: string,
@CurrentUser() user: User,
) {
await this.permissions.assertWorkspaceAccess(user.id, workspaceId);
return this.events.findAfter({
workspaceId,
afterEventId,
limit: 500,
});
}
On the client:
socket.on("connect", async () => {
const missed = await api.getEventsSince(workspaceId, lastSeenEventId);
if (missed.length >= 500) {
await refetchWorkspaceSnapshot();
return;
}
for (const event of missed) {
applyEvent(event);
}
});
This gives you two recovery paths:
- Replay missed events when the gap is small.
- Refetch a snapshot when the gap is too large.
That is much safer than hoping the socket connection never drops.
Ordering, Idempotency, And Duplicate Events
Real-time bugs often come from assuming events arrive exactly once and in order. They do not.
A user can receive:
- Event 10 before event 9.
- The same event twice after reconnect.
- An optimistic local update followed by a server update.
- A stale event from a previous connection.
- A room event after permissions changed.
Design for this explicitly.
Useful techniques:
- Add event IDs.
- Add resource versions.
- Include
occurredAt, but do not rely on client clocks. - Make handlers idempotent.
- Ignore events older than the current resource version.
- Refetch state when the client detects a version gap.
For example, a canvas node update can include a version:
type NodeMovedEvent = {
id: string;
type: "canvas.node.moved";
canvasId: string;
nodeId: string;
version: number;
payload: {
x: number;
y: number;
};
};
If the client has node version 42 and receives version 41, it can ignore it. If it receives version 44 while expecting 43, it can request a snapshot or a missed-event range.
This sounds like extra work until you debug a collaboration product where users see different versions of the same object.
Optimistic UI Without Lying To The User
Optimistic UI is necessary in many real-time products. Users expect a chat message to appear immediately and a canvas node to move as they drag it. But optimistic UI becomes dangerous when the product treats local intent as confirmed truth.
A better pattern:
- Apply local optimistic state.
- Mark it as pending.
- Send mutation to API.
- Receive authoritative event or response.
- Reconcile local state.
- Show failure and rollback if needed.
For financial products, optimism should be more conservative. A wallet can show “withdrawal requested” immediately, but it should not show “withdrawal confirmed” until settlement rules are satisfied.
For collaboration tools, optimism can be more aggressive, but the system still needs conflict rules. If two users edit the same object, does last-write-wins apply? Are operations merged? Does one edit create a conflict state?
The user experience should match the certainty of the backend.
Testing Real-Time Systems
Real-time features need tests beyond normal request/response cases.
Test scenarios:
- Client reconnects after missing events.
- Two devices use the same account.
- User loses permission while connected.
- Gateway restarts during active sessions.
- Events arrive out of order.
- Queue worker retries an event.
- Redis adapter is unavailable.
- Mobile client backgrounds and returns.
For local testing, create scripts that simulate multiple clients. For staging, test deployments while sockets are connected. For production, track whether clients reconnect successfully after deploys.
If a real-time feature cannot survive reconnect, it is not production-ready.
Observability
If you cannot observe real-time behavior, you cannot operate it.
Track:
- Active connections.
- Connections by user, workspace, region, or app version.
- Event publish latency.
- Event delivery errors.
- Reconnection frequency.
- Dropped connection reasons.
- Queue lag.
- Payload size.
Log event IDs across services. When support asks why a user did not see an update, you need to trace the event from domain service to queue to gateway to socket room.
Example: Wallet Transaction Updates
For a crypto wallet transaction:
- Blockchain listener detects confirmation.
- Worker normalizes provider event.
- Ledger service updates transaction state.
- Domain event
transaction.updatedis published. - Gateway emits to
user:{userId}andwallet:{walletId}rooms. - Mobile app updates the transaction row and optionally refetches details.
This keeps blockchain complexity away from the client and keeps the gateway focused on delivery.
Example: Collaboration Canvas Updates
A collaboration canvas has a different shape from wallet updates because users are changing the same shared object at the same time.
A practical flow:
- User drags a node locally.
- UI updates position immediately for responsiveness.
- Client debounces persistence while drag is active.
- API saves the final position with an expected version.
- Backend publishes
canvas.node.moved. - Other clients apply the update if the version is newer.
- If a client detects a version gap, it refetches the canvas snapshot.
This solves two problems at once. The local user gets a smooth experience, and other users receive durable updates without being spammed by every pixel movement.
Presence should use a separate path. Cursor location, active selection, and “user is viewing this document” signals are useful, but they do not need the same durability as saved content. If a cursor event is dropped, the product is still correct. If a node move is dropped forever, the product is wrong.
That distinction is the kind of decision that keeps real-time systems maintainable.
Deployment Strategy
Real-time deployments need more care than normal API deployments because users can be actively connected during releases.
Recommended deployment behavior:
- Stop accepting new connections on an instance being replaced.
- Give existing sockets a short drain period.
- Tell clients to reconnect when a server is shutting down.
- Keep event payloads backward compatible during rolling deploys.
- Avoid changing event names without supporting the old name temporarily.
If the frontend and backend deploy separately, event contracts become public interfaces. A new backend may talk to an old frontend for several minutes or hours. Version fields and additive payload changes help avoid breaking active sessions.
How This Solves The Chaos Problem
The architecture prevents backend chaos by separating responsibilities.
The API owns validation and mutations. The database owns durable state. The queue/event bus owns asynchronous delivery. The WebSocket gateway owns live fanout. The client owns presentation and recovery behavior. Observability ties the path together.
When those responsibilities blur, real-time systems become fragile. When they stay separate, teams can reason about bugs: was the mutation saved, was the event published, was it delivered, did the client reconcile it?
That clarity is the difference between a feature that demos well and a system that survives production.
Production Checklist
- Use WebSockets only where product latency matters.
- Keep domain state outside the socket gateway.
- Define stable event names and payload versions.
- Verify permissions before joining rooms.
- Use Redis/pub-sub for multi-instance delivery.
- Implement reconnect recovery through HTTP sync.
- Make events idempotent where possible.
- Track connection count, queue lag, and delivery latency.
- Keep payloads small and resource-specific.
- Test deployments while clients are connected.
Closing Thought
Real-time systems should feel instant to users and boring to operate. That only happens when the architecture treats WebSockets as one delivery mechanism inside a larger state system.
The goal is not to send events quickly. The goal is to keep users correctly informed while the network, backend, and product state keep changing.