"DevOps" ramblings and observations

Credits goes to https://www.pexels.com/photo/black-and-gray-laptop-computer-546819/

“DevOps” movement started “publicly” (first time we heard this term) circa 2007. Based on that, we (you?) might think that in 2021 that thing is utterly oudated but I totally disagree and I thought I would share some thoughts about DevOps in 2021, and what it means in real life.

I am quite uncomfortable using the word “DevOps”. I’m serious. I guess it’s because we’re putting so many things behind it and I feel we might have lost track of the real sense behind that term - “DevOps”.

This blog post boils down after I was invited to speak to a french podcast about CI/CD on “If This Then Dev”: https://ifttd.io/42-integration-continuite-derive-paul-amar/ (notice the podcast number, did I give the “Accurate Answer to Life”?)

In this post, I will share you some observations but also conversations I’ve had with friends from different companies. Therefore, any resemblance to specific events or locales or persons is (almost) entirely coincidental.

Context is key

If there’s one golden rule, I think that I would summarize this article as Context is key. What I might have observed, succeeded in might not work for you, because of your context. Deal with this post as food for thought but not as rules that you have to apply.

As an analogy, I remember those companies who wanted to replicate as identical the “Spotify model” for Agility. Guess what? Most of those companies have mimicked it without getting it to work. And ultimately failed.

I recommed you this article: Spotify’s Failed #SquadGoals.

Speaking about Agility, I think that, again, it pretty much depends on your team, company. Stating that you have to do Scrum because all other companies have implemented it and worked is a dead-end. For about two years, we’ve adopted the “trial and error” approach, iterating (through Scrum, …) until we found one (some kind of Kanban) that was fitting to the all team. Again, what fitted in our team will not specifically works for others, even in the same company.

This leads to extremely hard trade-off (speaking of which, I think life is all about trade-offs in some way) to have:

  • Should we do something utterly specific for us and don’t care about the other teams? (this might lead to a total disconnection to others)
  • Should we stick with some procedures which do not fit in the team and are impacting our production?

My opinion here is to have a balance. Balance everything. Don’t go too specific but do what’s best for your team to move forward. Moreover, move accordingly with your company’s pace.

I will finish this section with a personal anecdote where I found out that going from a full technical role to a team lead position “got me away” from code. At first, I thought I was literally useless for the organisation, by not developing new capabilities by myself, using my own skills. I was right but also pretty wrong on other aspects. If you might follow a similar path (and eventually feel the same way), I encourage you to read this great article: Maker’s Schedule, Manager’s Schedule

A team composed of Devs. And Ops.

Again, if we take the term DevOps literally, I (and again, it’s personal) read it as:

  • Dev for Developers
  • and Ops for operations

Those two family skills are essential in today’s IS/IT teams.

This section is a tale about a fully-featured team. Again, I will take an analogy with soccer. In order to have a good soccer team, you need multidisciplinary people who are good in their functions - a goalkeeper, sweepers, strikers and who have complementary skills.

The mistake, here, would be to only hire strikers and think that you have the best team. By doing so, you will be able to do rapid actions, move fast, create proof of concepts in no time, and that’s right but is that the overall goal for your team and more globally for your company? This is pretty much up to your appreciation and your manager’s.

Back to the team organisation when we’re speaking about “DevOps”, my opinion is that you need people with Ops skills, people with Dev skills and having the ability to make them work together on specific facets.

Back in the time, developers were developing on their own machine, made sure it was working locally and sent (sometimes by e-mails or other channels) those packages to people, responsible to deploy them. I guess you see where this is going. Ultimately, this will fail.

I understand that application deployments were a big thing couple of years back and needed niche skills but operations world evolved and we finally started getting “developer-friendly” tools to deploy apps such as Jenkins, Gitlab CI, … which are now part of the developer ecosystem and I can’t stress that point enough.

By not intensive (shouldn’t be they the same team afterall?) collaboration between those “two teams”, you might end up in having silos within your organisation with non-sense remarks like:

  • “But it works on my machine! It should work in production!”
  • “Did you add new libraries in your project? No? Yes? Okay, which ones?”

which leads me to this specific point:

You build it, you run it!

The motto is strong but utterly important. Again, this is a personal view about that matter but I think that devs should deploy their own applications to production and be responsible for the run activities.

By not doing so, developers will never get the sense of impact if an outage occurs on their application - “Alright, it crashed. So what?”

I am truly an over-optimistic person but I have seen Murphy’s law in action so many times now so I don’t take any big risks by saying that what you will bring up today will (ultimately) go down tomorrow.

Devs need to understand outcomes of their actions and if they are introducing instability (bugs, performance outages…) within the code base (which will happen) which results in impacts, they need to face it and I don’t see this as a bad thing, it just brings more responsability in their actions and their thoughts process.

My advice here is: (if applicable) make your devs deploy their applications. This will totally bring a new sense to their work.

Automation

Automation is key for today’s Information System’s development - we need solid foundations to move forward and on which we can build bigger systems on it. Automation is part of this answer, for its determinism and avoiding manual actions on server, which will eventually go wrong.

Based on the pinned tweet, don’t get me wrong! I am a true believer in automation, but I think it’s also important to acknowledge and mitigate its daemons. After some time, some people lose sight of the (now hidden) complexity which often (if not dealt with) ends up in “magic click thingy” on Jenkins, Gitlab, Github Actions, … : “Click there” and it will deploy your application.

If you let your CI/CD pipelines rot, untouched for months (if not years), it will ultimately finish like:

  • “How does it deploy your application?” “- No idea.”
  • “What does it rely on?” “- No idea.”
  • “Does it use an outdated library?” “- No idea.”

Here the nightmare comes.

When we’re speaking about “automation”, this leads to wide areas of improvement such as:

  • Data referential, where is stored the valuable information I need. Eg. where to store VLANs information? How is that accessible?
  • Standardization, to avoid specificites which, in the long term, cost a lot of money
  • Naming convention, to make sure that all your assets can be identified in a same and unique way
  • .. and the list goes on

As soon as you will scratch this, you will also need to take into consideration if your services are available through API and be wrapped in modules for complex workflows. This is why I can’t stress enough the needs of having API (again, how to make them accessible within my organisation? What naming convention should I use?) to not be gated by other teams, meaning that you will be able to collaborate with other teams by consuming services directly. This is one simple sentence but speaks a thousand words by itself.

Last point but not least, don’t rely on a single technology for your organisation. That specific technology X or Y is behind a person, an organisation and you might need to understand the risks of using it widely within your organisation. Here is an example with Jeff Geerling, a former Red Hat employee who worked on Ansible.

As of 2020/2021, Ansible and Terraform are the go-to tools for Infrastructure as Code and some automation tasks. Bear in mind that the tech landscape around Automation is boiling and will change in the future. That’s why, when bringing tech X within your org, always wonder how will you maintain it? Will you still have the required skills in 5 years time? 10 years time? Interesting question, right?

Know your people

A friend told me this story and this frightened me. They had a utterly important release coming end of December and the project manager didn’t know anything about their DevOps, including their names but also where they were located.

DevOps are not magic wands that you can use to unblock a critical project. We’re talking here about people, human beings with some skills which are mandatory in today’s world.

If you plan to get successful results, invest in the people you are collaborating with. I mean, really.

This bring me to:

Agressive Roadmap, tight schedule but most of all: amazing friendship

Context is moving and we need to adapt really fast for that. We often come up with roadmaps which evolves with time and sometimes it’s a matter of hours. Some releases will eventually fail, stress will happen on the worst day. You might have budgetary constraints because of X or Y and need to maintain an extremely tight schedule. For those reasons (and there are plenty more), I think that building trust within your team is the most important thing.

I realised that a team where friendship was one of the core value allowed us to move mountains.

Speaking of which, soft skills are equally important as technical ones. Everytime I was interviewing someone for a role in the team, I was asking myself my own: “Will it make the boat go faster?” aka “Will that person fit in the team?”. You might hire a tech guru witin the team but if that person is toxic, the whole team’s effort will decrease. What about someone who’s kind with his co-workers and picking him up to speed on technical subjects?

Again, balance occurs in these situations, but honestly, don’t underestimate the people toxicity and its negative impacts it can have on either a single person but also the entire team.

I consider myself as a social person and I need those bounds with my colleagues. Without this, it’s.. problematic.

Security by default

I couldn’t make an article about “DevOps” without speaking about “DevSecOps”. When I was saying that operations world evolved and started having “developers-friendly” tools to deploy apps, I think that security has to evolve (as of 2021, I think that this is still the beginning) in the same way.

I might be biased but I covered some of this topic in a DEFCON 28 AppSec Village, with a talk called “Our journey into turning offsec mindset to developers’ toolset” and I still mean every word of this presentation. I will get back to each lessons learnt from the talk:

  1. Even though the security community “boomed” (and still), I think that security has to understand its internal users. Devs don’t necessarily get security things. Same the other way around. As security professionals, do you understand the major issues your teams are facing? Have you tried helping them?

  2. “SSTI, XSS, XXE, SQLi, …” but also “CVE-2014-6271, CVE-2017-0147, …” but also “MS08-067, MS17-010, …” but also “Heartbleed, EternalBlue, BlueBorne, …” We (sec folks) are already losing track of all that. How can devs keep up with that? But more importantly, should devs keep up? Maybe not. And actually, they should not.

I recommend you Alex Ionescu’s keynote at SSTIC in 2019.

  1. Security vocabulary is one thing. Remediation is another one. Bringing assistance and not expecting security background is key.

Again, by bringing actionable processes will help people understand where you want to lead them.

  1. Contextualisation is (also) key.

Should we consider the same criticity for a hit on either:

  • An internet exposed host
  • An air-gapped host

Considering “Yes” to that question might send too many red flags to devs. Trade-offs are now part of the game.

  1. Collaboration will empower your organisation

Find the best tools to do that (might be tailored-made) Make sure the learning curve/ramp up is not too steep. Do your devs use YAML and like it? Go for it. Do they stick with XML and like it? Go for it. Drive your decision with discernment and clear-thinking. Change posture from “No, you can’t” to “Yes, let’s see how to make it

On a side-note, check the movie Yes Man. If you find those items interesting, (shameless promotion), feel free to check out our DEFCON talk.

Reactive vs Proactive

Reacting to incidents was the current way to fix them and this evolved within the last couple of months/years. In order to improve today’s needs such as Quality of Service (QoS) and the overall stability of the solutions nowadays, moving from the reactive world (I have an incident and I need to fix it) to a more proactive one is a necessity. Most websites/platforms from big players are up most of the time, even with extensive peak of activity. Eg. Amazon for the Black Friday. This was not the case couple of years back and this is driving the overall industry by making platforms available all the time the new norm.

We need to understand that avoiding “disasters” is the way to go instead of just stopping them. By just stopping the side-effects, we’re not dealing with the root cause, where the real challenges are. Make sure to have a Problem Management process in place within your organisation and be ready to tackle the real issues. Don’t hesitate to deep dive in those productions issues to bring to light the real problems and avoid bias.

I also recommend you some posts on Mental models and biases it might bring up while solving problems.

Observability

In order to move from a reactive space to a proactive one (notice that I didn’t use the expression “move away” because I think that both are needed), we need to use appropriate tooling but also new set of skills (there are even virtual school which are totally free: https://linkedin.github.io/school-of-sre/ ).

This is why we see flourish new job titles such as SRE (Site Reliability Engineering) but also new tech for real-time monitoring coupled with alerting, such as Prometheus and Grafana but also other tech like eBPF which are getting me soooo excited!

In case you want to get to know more about eBPF, I recommend you the excellent keynote from Brendan Gregg who worked tirelessly on this topic.

Again, what we are seeing here is that this space is boiling right now and we see that operability tech landscape is slowly getting to developer’s hands. For example, we now have proper Prometheus bindings directly from Springboot and this is just the beginning. This will also bring way more interaction between traditional support teams with project teams.

Ending

I hope you had as much fun reading this post as I had for writing it. It’s quite different from the other posts I’ve written on here but I think it was necessary and I am happy I wrote about all those topics I am passionate about.

If you have suggestions or want to chat about that, feel free to ping me on Twitter so that we can have a chat.

Anyhow, have a good day and stay safe you all!