Debugging Node.js in Production

Most Node.js tutorials stop at "it works on my machine." Production is where the interesting failures live, and they rarely show up in your logs. Memory climbs for no visible reason. One endpoint pegs a CPU core. The app is fine locally but slow under real traffic. You stare at a graph going up and to the right with no idea what is causing it.

I spent some time this week running N|Solid locally to see where a dedicated Node monitoring tool fits, and it sent me back through the whole toolbox for diagnosing this class of problem. So this post is in plain terms: the real problems you hit running Node in production, how to diagnose each one, and where the free tools end and a paid tool like N|Solid starts to earn its place.

If you only take one thing away: monitoring tells you THAT something broke. Introspection tells you WHY. They are different jobs, and most of the confusion around these tools comes from treating them as the same thing.

Problem 1: A memory leak you cannot catch

This is the classic one. Your process memory creeps up over hours or days, garbage collection cannot keep up, and eventually the process either gets OOM killed or your orchestrator restarts it. Your logs show nothing because nothing is technically erroring. The app is just slowly eating itself.

The reason this is hard is that a metric showing "memory is high" tells you the symptom, not the cause. To find the cause you need a heap snapshot: a dump of every object alive in the V8 heap and what is keeping it there.

The free way to get one is the built in inspector. Start your process with the inspector enabled:

bash

node --inspect dist/main.js

Then open chrome://inspect in Chrome, click your target, go to the Memory tab, and take a heap snapshot. Take a second one a few minutes later under load, then use the "Comparison" view to see which object types grew. Growing object counts that never come back down are your leak. The retainers panel shows you exactly what reference is holding them in memory, which is usually a cache that never evicts, an array you keep pushing to, or an event listener you never removed.

You can also automate snapshots with clinic:

bash

npm install -g clinic
clinic heapprofiler -- node dist/main.js

So where does N|Solid come in? The friction with the free approach is that you usually did not start the process with --inspect, and enabling it in production means a redeploy and hoping the leak comes back. N|Solid's agent runs inside the runtime on a separate thread, so you can pull a heap snapshot from a live production process on demand, from a console, no redeploy and no debugger to attach. The snapshot data is the same V8 data either way. You are paying for the convenience of getting it from a running prod box without disrupting anything.

Problem 2: A CPU spike with no obvious cause

One route starts pinning a CPU core. Latency on that path goes through the roof. You suspect a specific endpoint but you cannot prove which function is actually burning the cycles.

What you need here is a CPU profile, which samples the call stack many times per second and tells you where the time actually went. The free built in profiler:

bash

node --prof dist/main.js
# run your load, stop the process, then:
node --prof-process isolate-*.log > processed.txt

The processed output is dense but it shows you the hottest functions by tick count. For something far more readable, clinic flame generates a flame graph:

bash

clinic flame -- node dist/main.js

A flame graph makes the hot path obvious at a glance. Wide bars are where your time goes. Most of the time the culprit is something unglamorous: a regex run on every request, JSON parsing of a large payload, or a synchronous crypto call.

N|Solid does the same CPU profiling, captured from a live process through its console rather than via a flag you set at startup. Again, same underlying V8 profiler, different delivery. Useful at scale when redeploying with a profiling flag is not practical.

Problem 3: Slowness in production you cannot reproduce locally

Everything is fast on your laptop, then under real traffic requests start hanging. The usual offender is the event loop. Node is single threaded for your JavaScript, so one synchronous operation that takes too long blocks every other request behind it. A synchronous file read, a heavy JSON.parse, a tight loop over a big array, any of these can quietly stall the whole process.

You can measure event loop delay yourself with the built in perf_hooks:

javascript

const { monitorEventLoopDelay } = require('perf_hooks');

const histogram = monitorEventLoopDelay({ resolution: 20 });
histogram.enable();

setInterval(() => {
  console.log('event loop delay p99 (ms):', histogram.percentile(99) / 1e6);
  histogram.reset();
}, 5000);

If that p99 number is climbing into tens or hundreds of milliseconds, something is blocking the loop. clinic doctor will diagnose this automatically and even point at whether the problem is the event loop, GC, or I/O:

bash

clinic doctor -- node dist/main.js

This is one area where a tool like N|Solid is genuinely convenient: it tracks event loop lag continuously in production and alerts on it, so you catch the blocking behavior as it happens rather than after a user complains. You can build the same alerting yourself with the snippet above plus your metrics pipeline, but you have to build it.

Problem 4: Not knowing which loaded dependency carries a known vulnerability

Your package.json lists what you intended to install. What actually runs is the full resolved tree, including transitive dependencies you never chose. A known CVE three levels deep is easy to miss.

The free baseline is built into npm:

bash

npm audit

This checks your dependency tree against the advisory database. It is good, but it works off your lockfile, not off what your running process has actually loaded into memory. N|Solid scans the modules actually loaded at runtime and flags known vulnerabilities against them, which catches the gap between "what is installed" and "what is executing." For most teams npm audit in CI plus Dependabot covers this well enough; the runtime angle matters more in locked down enterprise environments with compliance requirements.

So do you actually need N|Solid?

Here is the honest summary, because it is easy to get talked into tooling you do not need.

Everything above can be done for free. node --inspect with Chrome DevTools, node --prof, clinic.js, perf_hooks, and npm audit cover memory snapshots, CPU profiles, event loop analysis, and vulnerability scanning. If you are a solo developer or a small team, that toolbox is genuinely enough, and learning it makes you a better engineer than leaning on a dashboard ever will.

What N|Solid sells is not a capability you are missing. It is the removal of friction at scale. Pulling a heap snapshot from a live production process without redeploying, profiling a running box from a browser, continuous event loop monitoring with alerting, all aggregated across a fleet. That friction is trivial when you have one process on one box and genuinely painful when you have dozens of processes scaling up and down across an autoscaling group at 2am during an incident.

The deciding question is simple. Can you name a specific, recurring problem your current setup has failed to solve? A memory leak you keep losing because you cannot catch a snapshot at the right moment. CPU spikes you cannot profile because the box is gone by the time you SSH in. If yes, a tool in this category earns its place. If you cannot name one, you are buying a dashboard for a problem you do not have, and the free tools plus a bit of discipline will serve you better.

The takeaway

Production debugging is less about tools and more about knowing which question you are asking. Is memory the problem, or CPU, or the event loop? Do you need to know THAT something is wrong, or WHY? Get that straight first, reach for the cheapest tool that answers it, and only pay for convenience once you have felt the pain that convenience removes.

What do you reach for when you need to debug Node in production? I am curious whether people lean on the built in tooling or pay for something on top.

Debugging Node.js in Production: The Problems, and What Actually Helps

Problem 1: A memory leak you cannot catch

Problem 2: A CPU spike with no obvious cause

Problem 3: Slowness in production you cannot reproduce locally

Problem 4: Not knowing which loaded dependency carries a known vulnerability

So do you actually need N|Solid?

The takeaway

Comments

Weekend Logs: What I’m Thinking, Learning & Building

Cloudflare Workers Load Balancer: How I Built a Free AWS ALB Alternative in 22 Hours

More from this blog

The Week My Database Learned to Defend Itself

I Mentored 12 Teams at a Hackathon. Here's What I Learned.

Cloudflare Workers Load Balancer: How I Built a Free AWS ALB Alternative in 22 Hours

I Spent a Day Learning How the Linux Kernel Actually Works — Here's What Surprised Me Most

Command Palette

Problem 1: A memory leak you cannot catch

Problem 2: A CPU spike with no obvious cause

Problem 3: Slowness in production you cannot reproduce locally

Problem 4: Not knowing which loaded dependency carries a known vulnerability

So do you actually need N|Solid?

The takeaway

Comments

Weekend Logs: What I’m Thinking, Learning & Building

Cloudflare Workers Load Balancer: How I Built a Free AWS ALB Alternative in 22 Hours

More from this blog