What you think about ArgoCD might be wrong

It's like the other CI/CD tools right?
GitOps continuous delivery tool for Kubernetes
Dude, what's my rollout status?
When things go wrong, how do you debug it?
The power of "Out of Sync" status
Diff View
What to do when your code is out of date
War of the Operators! ⚔️💥🪖
The Three ArgoCD Modes
1. Hands-on Mode
- Don't fall for fools mode
2. Hands-off Mode
- A/B, Blue/Green, Canary you name it
3. Controlled mode
Conclusion

Unless you've been using ArgoCD for quite some time, let's say a year or more, what you think about ArgoCD might be wrong.

It's like the other CI/CD tools right?

ArgoCD brings a wealth of capability to Kubernetes-oriented delivery flows. It's arguably the leading tool in what we sometimes call the GitOps space. At a glance, its value proposition is often misinterpreted making it both under-estimated and over-estimated depending on the context. Let me explain.

It happens that certain tools in the continuous integration and build automation space routinely position themselves as Continuous Delivery tools. Historically CI and CD were quite loosely related. That's probably why we get that CI/CD split in the name instead of, say, a single concept to describe both.

What happened with time is that pioneering tools like Jenkins which were CI-focused evolved their capability or at least their messaging to include CD. Eventually from a market perspective the concept of being a CI/CD tool applied to all of the existing tooling; of course, regardless of whether it was actually being used for CD or just CI.

I believe because of this background, it's reasonable that the first perspective one might have of ArgoCD is that it's an alternative to the existing tools, let's say Jenkins, GitHub Actions, etc, this is arguably half true but it doesn't paint the full picture and I believe leads to a warped view of what it actually does.

ArgoCD is not a CI tool. You'll need a separate tool for that.

GitOps continuous delivery tool for Kubernetes

To its credit ArgoCD has never claimed to be more than it is. At the time of writing this, if you go to the ArgoCD website, it clearly states that it is a

Declarative, GitOps continuous delivery tool for Kubernetes

Pretty specific. It's just that we engineers are so used to vendors claiming to solve more problems than they should, rather than staying in their lane and doing one thing well.

Back to Cloudbees Jenkins and Github Actions, If you asked someone from Cloudbees whether Jenkins is a Continuous Delivery tool, I'm sure they would say yes. And they wouldn't be wrong. After all, you can use Jenkins to continuously deliver your code to production. Same is true of Github Actions.

But there is a very important nuance here. Let's say you deployed to Kubernetes purely with Jenkins or Github Actions.

How do you ensure it was successful?
What happens if things go wrong, how do you debug it?
What happens if someone tampers with the environment afterwards? Will it go unnoticed?
If a change breaks something will it get remediated quickly? ...and automatically?

It's important to ask these questions when you are working with Kubernetes. It's not particularly easy to answer them without a lot of bespoke scripting around Jenkins, Github Actions or whatever other general purpose CI/CD tool.

ArgoCD is a complementary beautiful UI with rich Kubernetes awareness. It complments existing CI/CD pipeline tool such as Github Actions.

Dude, what's my rollout status?

If this question sounds strange or trivial to you, you may not be grasping one of the fundamental concepts behind Kubernetes configuration management.

Unlike an apply in Terraform, kubectl apply does not guarantee a change completed nor whether it was successful. Huh? You may say.

This is because the apply command tells the scheduler to accept the changes, but that's just about it. There is no guarantee that it actually completed, albeit successfully. This is something that happens in the background.

Ok, sure. It's asynchronous you might say. But it eventually works right? Well, sometimes, but definitely not always. With great power and flexibility comes great complexity. Here are some examples of the things that could go wrong for a Deployment kind in Kubernetes.

❌ No capacity to serve the resource creation request
❌ The request was valid but the pod didn't start due to a code or config bug
❌ The request limits were too low, so the pod is being killed by the scheduler (and rightly so)
❌ The resource being requested has a dependency on other resources that don't exist

Wow! That's quite a lot of ways that a CI/CD pipeline that does a kubectl apply could complete successfully only for developers to find out later that the scheduler never completed with success (and for good reason).

Ok smarty pants, you might be thinking. That's why they have the kubectl wait and/or the rollout status command.

Well yeah, that's right. And so you could update your pipeline to something like this

kubectl apply -f ./manifests/

# Wait for the deployment to be ready
echo "Waiting for deployment $DEPLOYMENT_NAME to be ready..."
kubectl rollout status deployment/$DEPLOYMENT_NAME -n $NAMESPACE --timeout=300s

if [ $? -eq 0 ]; then
   echo "Deployment $DEPLOYMENT_NAME is ready!"
else
   echo "Deployment $DEPLOYMENT_NAME failed to become ready within the timeout period."
   exit 1
fi

But do you really want to endlessly write bash scripts? And what about the others resource kinds?

We've given an example for a Deployment kind, but Kubernetes has 100s if not 1000s of kinds. Each of these has different failure modes and ArgoCD is capable of visualising this well.

As an example, you can consider that a deployments fail predominantly based on the app or resource allocation as we previously mentioned, but what about a Certificate?

It can fail due to issues with provisioning such as missing DNS records. In any case, they always fail asynchronously. That's why ArgoCD application provides such a nice realtime view. You get application-centric insight into the current state/health.

ArgoCD has native support for synchronising resources on a Kubernetes cluster.

You can sync with the argo app sync command or hit that big arse "SYNC" button.

If you trigger ArgoCD from your CI/CD pipeline of choice¹ and do a wait, it will guarantee that the pipeline will not complete until it gets a result (i.e. a success or failure). This means that you can have a reusable pattern that works for any kind of resource.

And so, we get a solid CI/CD approach for Kubernetes. The pipeline will show as green when we are happy and red when we have things to fix. And, we can always rely on the App view in ArgoCD to give us the ability to drill-down into why an application may be degraded. From ArgoCD, We can even view the pod logs directly or exec into them.

When things go wrong, how do you debug it?

In case you are still wondering what a world without ArgoCD might look like, consider this.

If things go wrong, your developers are literally stuck. If they've never read the Kubernetes manual before, they are frankly in for a bad time. This means that they'll either throw it over the fence to Platform/SRE/DevOps (oh the irony) or, will not be happy. It's a terrible developer experience (DX), and why we see ArgoCD popping up in Kubernetes clusters everywhere!

For modern CI/CD pipelines, you can link the pipeline to the related app² in ArgoCD. This makes for a very good DX where the developer can simply observe the argocd app and drill-down to where it might have failed. In the console itself failing resources will show as "Degraded" and the reason for this and the logs are only a click or two away.

The power of "Out of Sync" status

And maybe you're still not convinced that putting another tool in the middle of deployment is going to be worth it. Let me give it one last shot.

It's not completely uncommon that engineers change things on the fly, manually, via web consoles, and forgot to go back and automate it, or maybe have no intention to do so. This can be pretty problematic if the original setup was done by automation. It could mean that the next time the automation runs, it will undo the previous work or, it might fail and annoy everyone due to wasting time in bringing the automation under control ³

With ArgoCD there are some very good built in controls to handle these cases very elegantly.

When your code is newer than the instance it manages, ArgoCD will give you a beautiful little "Out of Sync" status. And not just that, it will give you a "Sync" button that when pressed will remediate the environment. This also works in reverse. If you code is out-of-date with your instance then it will also show as "Out of Sync" ⁴. This is where the "App Diff" function becomes especially useful.

Diff View

With the application difference view, we can see what is represented in our code and what is actually there. So, how do we interpret this?

Well, the simplest and most common interpretation is that we have newer code and we want to sync it. This could be a change as small as a difference in the docker image to account for a newer version or it could be the inclusion of whole new resources, like new deployments, services or ingresses.

A more complicated type of diff might be one where it looks like the environment itself is a head of our representation of it in code. ⁵ An example where I've seen this happen in the real world is when you have an Operator that attaches labels or sidecars directly to deployment. This might be for the purpose of a server mesh or some other observability characteristics. This is one of the things that can pose a problem with the ArgoCD/GitOps model.

What to do when your code is out of date

When your code is behind your actual representation, you have to make a decision between two options.

Option 1: Disable the Operator and decide to make the changes through the code
Option 2: Keep the Operator running but align your code with the changes being inflicted/made by the Operator.

There is also a third option which is to do nothing and deal with the war between ArgoCD and Operator. I would not recommend this, if you do this, it will basically look like this...

War of the Operators! ⚔️💥🪖

ArgoCD applies a change. App is "Synchronised".
Some time passes and the non-ArgoCD Operator changes resources. App is now "Out of Sync"
The user (or auto-sync) triggers a sync again. App is "Synchronised"
Some time passes and the non-ArgoCD Operator changes resources. App is now "Out of Sync"
Repeat indefinitely...

Of course, the fastest way to break the above cycle is to choose from option 1 or 2 :) ⁶

The Three ArgoCD Modes

Now that I've explained a bit about the ArgoCD operator and other operators and their occational battles. You might be wondering what is the optimal way to get started.

To that, I would say that at a high-level there are really only just three usage modes for ArgoCD. This is not a concept you'll find in any manual but it's something I'd like to coin based on an analysis of the usage patterns that I've seen in the industry.

Hands-on Mode
Hands-off Mode
Controlled Mode

1. Hands-on Mode

This is the simplest and most common mode of usage for ArgoCD. With hands-on mode, you commit a change to git and then a user goes into ArgoCD and synchronises the change. This approach is nice and simple and gives a lot of control around when changes will go out. Basically, the user after making the change in git (or being informed that someone else has made the change), will go to each environment and press the sync button and wait to make sure it was successfully.

The user can manage the promotion through the various environments. This approach is very useful especially in controlled or heavily regulated industries where a change has to be manually reviewed, approved, applied and of course, tracked for all non-production and production environments.

This is the mode that I would strongly recommend you adopt when you are getting started. Yes, it's manual but it keeps you in control which is important when you are getting started.

So what are the downsides of this approach? Well one main one perhaps, if you are managing changes to your application (i.e. docker image) this way, you will need to keep track of the version you want to deploy and change that manually in your manifest before you sync. Either that or you use latest tag which will is ok for getting started but is not really production-grade in certain contexts ⁷.

Don't fall for fools mode

Before we move on to the other modes, I just wanted to call out a variant of the hands-on mode that you should completely avoid. To help us avoid this, I am going to call it "Fool's mode".

Fool's mode is when you get too excited about git. It is when an engineer goes hey, why don't we just have a branch per environment? It seems like a great idea. That way, we can manage changes to environments via pull requests in Github. Let's go!

There was probably a moment I also though this was a good idea. Or, at least though I would entertain doing it.

So, what' so bad about fool's mode? Well, it assumes two dangerous things

Dangerous assumption 1: That every change to every environment needs a pull request. This is dangerous because it is a way to guarantee inefficiency. It's ok if you want every change to have a pull request but having a PR for every environment deploy is mixing code change and code distribution, they aren't the same.
Dangerous assumption 2: That promoting changes through branches will be easy for all team members. It won't. Unless you like dealing with merge pains and trouble, you just don't want to entertain this. Push button promotion is not a pipe dream but introducing branches per environment will guarantee that it is.

If you see this happening in your organisation please help your friend out. Bring them aside and tell them there is more important things to life than overcomplicating things for everyone.

2. Hands-off Mode

Hands-off mode is a variant of hands-on mode, really. It's just where the sync part doesn't have a human involved. It's on the complete other end of the spectrum. It's basically where the code change is commit and ArgoCD picks up on that and goes ahead and applies it. This is a pretty good mode to have for a lean startup. It will generally always mean that your latest code is shipped. It prioritises efficiency over control. But there is even a variant of this that gives pretty good control

A/B, Blue/Green, Canary you name it

Whilst starting in hands-off mode is a surefire way to get your code out asap. It's of course also a surefire way to break production very quickly, especially if you (or associated developers) don't have good coding practices. This is where there is a mature variable off hands-off mode that can be very helpful. It's where you deploy your change to a subset of the cluster. It may be that you deploy the change to some users or to internal users only and then you run automated tests or at least check that health is still good (e.g. no increase in 400 or 500 errors). When the cost is clear it will auto-promote it to all users. So this mode is still completed hands-off but it has some extra safetly controls. I wouldn't start with the approach but it is certainly where hands-off mode can evolve.

3. Controlled mode

The final mode which is possibly my personal favourite is controlled mode. This is essentially a hybrid of hands-on and hands-off. Whilst hands-on and hands-off modes don't require a separate CI/CD pipeline, controlled mode does. Controlled mode is where you model a CI/CD pipeline using your tool of choice (Github Actions or Jenkins or Buildkite whatever) and then you rely on the manual gate feature of these tools to gate the deployment.

The benefit of this approach is that you can keep the process as close to your code as possible. You will be able to see for a given commit if it is still applying or if it is in production. You will be able to drill-down and manually promote to the downstream environments if it is waiting for approvals. You will be able to easily turn manual approvals to automated ones as you feel comfortable and you will be able to introduce whatever progressive deployment strategies (canary, blue/green etc) that you see fit.

Conclusion

ArgoCD is a powerful tool that brings a new level of control and reliability to Kubernetes deployments through its GitOps approach. Unlike traditional CI/CD tools, ArgoCD ensures that your deployments are not only applied but also continuously monitored and synchronized with your desired state. This eliminates many of the common pitfalls associated with manual deployments and ad-hoc changes, providing a more robust and developer-friendly experience.

By leveraging ArgoCD, you can choose from various operational modes—hands-on, hands-off, and controlled—each offering different levels of automation and control to suit your organizational needs. Whether you are just getting started or looking to optimize your existing deployment processes, ArgoCD offers the flexibility and reliability to help you achieve your goals.

In summary, ArgoCD is not just another CI/CD tool; it is a specialized solution designed to handle the complexities of Kubernetes deployments, ensuring that your applications are always in sync with your code. Embracing ArgoCD can lead to more efficient, reliable, and manageable deployment workflows, ultimately enhancing your overall DevOps practices.

Footnotes

I imply that you are using an additional tool like Jenkins and Github Actions for the CI/CD pipeline itself, not ArgoCD directly, as this is most common usage pattern. ↩
You can group multiple apps as one app as your application goes in complexity. This is called the app-of-apps pattern. ↩
Bring manually things under control us difficult with Terraform. If someone creates a resource manually in a web console and then another user wants to manage it via automation, it will fail with an error say it already exists. THis is because Terraform relies heavily on it's own state. ↩
This can be caused by either a manual change or, with Kubernetes it may be that something you were mangaging has started being managed by an automated "Operator". ↩
Thought leaders in this space sometimes refer to the concept of a the code representation and the actual as "Digital Twins". The idea is that they should be in synchronisation most of the time, but there is some times that they will be out of sync for good reason. ↩
There is also a third way to deal with ArgoCD operator wrestling against another operator. You can use ignoreDifferences to mask the synchronisation issue although this does not really fix the root cause. ↩
The latest tag could be automatically applied to the latest built image. It is a way to not have to manage specific versions and can work ok. But there are some challenges related to this. I won't really go into these as their are plently of existing resources online that talk about this. ↩