
It’s always the same story: your CI/CD pipeline breaks at the worst possible time — right before a demo, a release, or a crucial UAT handoff. That’s exactly what happened when a client approached us at Sygnls with a frustrating issue: their GitHub Actions pipeline had started randomly failing during Docker builds. We hadn’t set up their CI/CD stack — it had evolved internally over time — but we were brought in to investigate and stabilize the situation.
What looked like a minor flake turned out to be a deeply embedded problem with how their Docker layers and package installation steps were cached. The worst part? The failure wasn’t consistent. Builds would sometimes hang silently, pass falsely, or explode mid-way without leaving a clear trail.
The kicker? It wasn’t code.
It wasn’t infra drift.
It was a stuck Docker layer cache combined with a weird dependency install race condition.
Let’s break down how we found it — and the stupidly simple one-script hack that now lives in their repo.
The Setup
Client’s architecture:
- Monorepo
- GitHub Actions for CI/CD
- AWS ECS for container deployment
- Self-hosted runners behind the scenes
- Docker builds per service using
docker-compose
- NPM + Yarn (because… reasons)
The pipeline looked fine. Green on smaller PRs. But randomly, it would choke hard — hanging at the “Build & Push Docker Image” stage for 8–10 minutes, then randomly fail on Yarn install due to missing binaries.
We added more logging.
We added --verbose
flags everywhere.
We even swapped out runners.
Still broken — and inconsistently so.
Then we noticed something subtle:
Builds that followed a PR where the Dockerfile changed were fine.
Builds after JS-only commits? 🧊 Frozen.
The Diagnosis
What made this issue particularly painful was how invisible it was. Logs didn’t scream errors — they just hung. Sometimes the build passed. Sometimes it took forever. Sometimes it failed on step 3 out of 7 with cryptic Yarn/NPM issues that didn’t make sense given the clean state of the repo.
We ran side-by-side builds with identical commits across multiple branches — same result: complete inconsistency. One would fly through the install phase; the other would choke or randomly fail to build native dependencies like node-sass
, sharp
, or bcrypt
.
At first, we assumed it was a flaky runner or resource starvation. We switched between GitHub-hosted and self-hosted runners. We tried disabling concurrency. We pinned Node.js versions. We even dumped entire dependency trees for diffing across commits — no dice.
Then we noticed a pattern:
When someone changed the Dockerfile
, the builds worked.
When the code changed but the Dockerfile
didn’t, things got weird.
That’s when it clicked — the Docker cache.
Classic: cache is fast until it kills you.
Because node_modules
was cached inside a broken layer, the next build skipped yarn install
entirely and continued. It had no idea it was shipping broken binaries inside the container. And the weird part? This broken state persisted across branches, because GitHub’s layer cache doesn’t scope cleanly across runner jobs unless you explicitly configure cache keys per job context.
Even worse: Yarn didn’t throw an error because it wasn’t even running.
The Fix: A Dumb Little Script
Here’s what saved the day:
#!/bin/bash
if [ -d node_modules ]; then
rm -rf node_modules
fi
if [ -f yarn.lock ]; then
yarn install --frozen-lockfile
else
echo "No yarn lock found. Skipping install."
fi
We dropped that as prebuild.sh
and updated every Dockerfile
to use:
COPY prebuild.sh .
RUN chmod +x prebuild.sh && ./prebuild.sh
That’s it.
No fancy cache keys. No third-party actions.
Just remove the thing that lies and re-do the thing that matters.
📈 The Impact
- Pipeline time went down 30% (because no silent retries)
- Zero install-related build errors in 3 weeks since
- Faster rollback cycles because we could trust build consistency
We eventually wrapped the logic into a reusable step and now inject it via a GitHub Actions reusable workflow. But honestly? Just having this in your Dockerfile is enough.
In DevOps, it’s usually not the big things that kill you — it’s the flaky little ghosts in your build chain. And when time-to-restore matters more than elegance, a “dirty but deterministic” script beats an overengineered abstraction 10/10 times.
We’ve since ported this logic into Python and Go repos too — same story, different venv
or vendor
folder.
If you’ve got mystery CI flakes, start looking at your layer cache lies. And maybe… just maybe… run a dumb little cleanup script before you build again.
Conclusion
In the end, it wasn’t some obscure edge case or a broken dependency — it was a cached lie hiding in plain sight. The client’s CI pipeline had become fragile because of small oversights stacked over time: assumptions about layer reuse, missing validation, and the lack of an enforced clean build step.
The fix was stupidly simple — a single script that nuked stale node_modules
and ensured fresh installs. But the process to uncover it involved careful debugging, pattern recognition, and a deep understanding of how Docker, Yarn, and GitHub Actions interplay during builds.
At Sygnls, this is exactly the kind of work we thrive on — stepping into complex systems, identifying subtle failure points, and delivering lightweight, maintainable solutions that make pipelines fast, stable, and boring (in the best way).
Because in DevOps, boring is beautiful.