8 Reasons why you want to avoid GitHub Actions in big monorepo projects in 2021/22

tomascubeek
6 min readOct 14, 2021

This article is based on extensive (few months lasting) experience on a big project, and its purpose is not to serve as hate, rather more like feedback or advice which may save several days to someone else.

I think GitHub Actions is a handy platform not only for the open-source community. Many tools, libraries, and well-known projects benefit from it. Still, you may find deal-breaking limitations for the enterprise-level big-scale projects (especially when it becomes a highly paid service) and it is good to know about them as soon as possible.

Also, be aware that described limitations will be very likely valid also for similar services like CircleCI, Azure DevOps? etc.

Let’s provide some project metrics we are working on (to know what “big” really means in the context of this article).

Monorepo size: ~1 GB of various sources in git
Monorepo tech-stack: NodeJS, C++, Powershell, .NET with hundreds of projects and tests.
Pipeline: ~35 separate jobs/configurations connected together with ultimate goal to provide a single installer for customer in the end
Pipeline times: ~1.5 hours to the installer, including tests (billable up to 3–4 hours because we run many things in parallel and utilize cache heavily)
Engineering team: ~50 people daily producing PRs
Other metrics: ~100 PRs in queue in peak, 50% managed agents + 50% self-hosted runners

The pipeline structure

We’re talking about an on-prem product, so requirements are slightly different from what you may expect for SaaS applications (where microservices may be released separately and thus splitting into many repositories may sound more natural, etc.)

Tour to failure

We started simple - we utilized GitHub Actions as it seemed to be a popular choice, offering everything we needed (at first look). We always knew we would migrate to something else eventually (because of security requirements), but we didn’t expect it to be so soon. However, once we got further with the project (once we achieved the measurements above), the GHA pipeline basically stopped working for us. Why?

  1. Limited caching — if you have a huge project, you don’t want to rebuild everything again and again (if the only simple change is introduced). You don’t want to pay for it either. So, you utilize caching capabilities (pretty standard for GHA, CCI, Azure DevOps). That’s a powerful concept, and you can cache what you need and when you need. As you probably can imagine, if you have a more extensive pipeline, you have more places where you need to utilize the cache. Unfortunately, GHA offers only 5GB cache per repository, and we grow so much that we were able to hit this several times per day. So the cache utilization was minimal, and the bigger the project/pipeline we had, the less cache we could utilize. This is a big deal because it makes your project rebuild very often without real changes and costs you time and money.
  2. Managed agents with max 12–16GB HDD space and limited tooling installed. Not all jobs actually experienced this issue, but still, there was a limited amount of situations (e.g., VB6, some C++) when this was really an issue, and it basically forced us to use self-hosted runners as well. Unfortunately, when you start to use self-hosted runner, GitHub has no option to make them “short-living”, i.e. killed right after the job and re-created later (that’s a security requirement we can’t avoid). Azure DevOps has “scale-set agents” which would probably help us overcome this requirement, but it shares other limitations (+ both are driven by one company now so I would expect a similar future soon).
  3. Regularly reaching API call limits (caused by many jobs multiplied by existing PRs) even with enterprise account (see usage limits). That would not be that bad if those builds are at least postponed and not canceled/failed (more about cancellation below). Our project consumed most of the dedicated resources for the enterprise account in the first few days.
  4. Terrible stability — when we started to use GitHub Action heavily, we got API usage limits issues. We got frozen builds (e.g., when you download the artifact/cache) and paid for them until we realized they hung and killed them. We got to the point when PR count only grew in time, and we could not process them. They also had some outages (but that’s kind of expected). When we had to cancel more than one workflow, we got errors and exceptions saying it’s not even possible to cancel those scheduled workflows.
  5. As we reached some limits (API calls) and the build was canceled, it allowed people to merge their PRs because even if you have “required steps” set to a proper job, the job is considered “ok” when it is skipped or canceled. The only way how to avoid this would be to implement your custom check. So it caused us many incorrect merges and made the “auto-merge” feature unusable. If you use GH Actions, let’s try it, and you will see, cancel running job to see you can merge — cool feature.
  6. Inability to re-run one particular job (for some reason, only the “re-run all jobs” button is available when you need to deal with the environmental issues or flaky tests in one particular job) — not a big deal if you use caching…but…
  7. Inability to re-use cache in a different job (with different types of OS). You can overcome this by utilizing artifacts. However, it means that sometimes you need to unpack cache to pack it again into artifact and pass further… Implementation of cross-platform cache is not big deal.
  8. YAML maintenance — We had over 2100 lines of code in a single file defining our workflow (utilizing custom actions where we could). There is no simple way how to split it into more files effectively without creating more workflows. Custom Actions are currently limited to the composition that does not allow other (custom) actions to be reused inside. It really complicates whatever abstraction/simplification you would like to introduce.

I could continue further because I enjoyed it deeply (e.g. almost no “native” test reporters and poor-quality ones from the community), but this is not just about GitHub Actions. Azure DevOps has a similar policy and limits, it basically utilizes the same infrastructure, and I would expect some “merge” in the near future. I think CircleCI does not support short-living self-hosted runners as well and recommended size of the cache is around 500MB per job, which is still not enough (not sure about limitation per repo). The fact is that after analyzing our requirements, the ideal size of the cache for our repository was around ~1TB (keeping data 14 days around) = 200x more than GHA can offer you (afaik you can’t even buy more from GitHub). Considering that having 1TB of data is nothing today…

I consider it necessary to repeat that if you have a product consisting of few services (like web + backend, or just some library and tests), the GitHub Actions will serve well (and you will survive even with free tier). You may never hit the limits described here. The focus here is really on bigger-scale projects, describing the long-lasting state of GitHub Actions (around 9/2021).

If I have to use alternative public CI/CD as a service for bigger projects. I would probably go to Azure DevOps (as their scale-set agents look really promising for our use case) or I would try TeamCity Cloud. In the meantime, I hope GitHub Actions will join their forces with Azure DevOps and build a platform for next-generation. But, unfortunately, it is not now.

In the next article, I will describe how we effectively utilized TeamCity for big projects like this and how you can tune it up for a monorepo setup.

--

--

tomascubeek

I’m a technical enthusiast, interested in new trends, problem solving and software architecture (focused on fullstack in various technologies).