A meditation on the Antithesis of the VMS Ethos

Discussion:

(too old to reply)

Subcommandante XDelta

2024-07-21 09:41:06 UTC

A meditation on the Antithesis of the VMS Ethos, and the DEC way.

A freshly minted neologism: "CloudStrucked" (six ways Sundays, and
then some)

https://www.wheresyoured.at/crowdstruck-2/

Ed Zitron's Where's Your Ed At

CrowdStruck

Edward Zitron

Jul 19, 2024

Soundtrack: EL-P - Tasmanian Pain Coaster (feat. Omar Rodriguez-Lopez
& Cedric Bixler-Zavala)

When I first began writing this newsletter, I didn't really have a
goal, or a "theme," or anything that could neatly characterize what I
was going to write about other than that I was on the computer and
that I was typing words.

As it grew, I wrote the Rot Economy, and the Shareholder Supremacy,
and many other pieces that speak to a larger problem in the tech
industry a complete misalignment in the incentives of most of the
major tech companies, which have become less about building new
technologies and selling them to people and more about capturing
monopolies and gearing organizations to extract things through them.

Every problem you see is a result of a tech industry from the people
funding the earliest startups to the trillion-dollar juggernauts that
dominate our lives that is no longer focused on the creation of
technology with a purpose, and organizations driven toward a purpose.
Everything is about expressing growth, about showing how you will
dominate an industry rather than serve it, about providing metrics
that speak to the paradoxical notion that you'll grow forever without
any consideration of how you'll live forever. Legacies are now
subordinate to monopolies, current customers are subordinate to new
customers, and "products" are considered a means to introduce a
customer to a form of parasite designed to punish the user for even
considering moving to competitor.

What's happened today with Crowdstrike is completely unprecedented
(and I'll get to why shortly), and on the scale of the much-feared Y2K
bug that threatened to ground the entirety of the world's
computer-based infrastructure once the Year 2000 began.

You'll note that I didn't write "over-hyped" or anything dismissive of
Y2K's scale, because Y2K was a huge, society-threatening calamity
waiting to happen, and said calamity was averted through a remarkable,
$500 billion industrial effort that took a decade to manifest because
the seriousness of such a significant single point of failure would
have likely crippled governments, banks and airlines.

People laughed when nothing happened on January 1 2000, assuming that
all that money and time had been wasted, rather than being grateful
that an infrastructural weakness was taken seriously, that a single
point of failure was identified, and that a crisis was averted by
investing in stopping bad stuff happening before it does.

As we speak, millions or even hundreds of millions of different
Windows-based computers are now stuck in a doom-loop, repeatedly
showing users the famed "Blue Screen of Death" thanks to a single
point of failure in a company called Crowdstrike, the developer of a
globally-adopted cyber-security product designed, ironically, to
prevent the kinds of disruption that weve witnessed today. And for
reasons well get to shortly, this nightmare is going to drag on for
several days (if not weeks) to come.

The product called Crowdstrike Falcon Sensor is an EDR system
(which stands for Endpoint Detection and Response). If you arent a
security professional and your eyes have glazed over, Ill keep this
brief. An EDR system is designed to identify hacking attempts,
remediate them, and prevent them. Theyre big, sophisticated, and
complicated products, and they do a lot of things thats hard to build
with the standard tools available to Windows developers.

And so, to make Falcon Sensor work, Crowdstrike had to build its own
kernel driver. Now, kernel drivers operate at the lowest level of the
computer. They have the highest possible permissions, but they operate
with the fewest amount of guardrails. If youve ever built your own
computer or you remember what computers were like in the dark days
of Windows 98 you know that a single faulty kernel driver can wreak
havoc on the stability of your system.

The problem here is that Crowdstrike pushed out an evidently broken
kernel driver that locked whatever system that installed it in a
permanent boot loop. The system would start loading Windows, encounter
a fatal error, and reboot. And reboot. Again and again. It, in
essence, rendered those machines useless.

It's convenient to blame Crowdstrike here, and perhaps that's fair.
This should not have happened. On a basic level, whenever you write
(or update) a kernel driver, you need to know its actually robust and
wont shit the bed immediately. Regrettably, Crowdstrike seemingly
borrowed Boeings approach to quality control, except instead of
building planes where the doors fly off at the most inopportune times
(specifically, when youre cruising at 35,000ft), it released a piece
of software that blew up the transportation and banking sectors, to
name just a few.

It created a global IT outage that has grounded flights and broken
banking services. It took down the BBCs flagship kids TV channel,
infuriating parents across the British Isles, as well as Sky News,
which, when it was able to resume live broadcasts, was forced to do so
without graphics. In essence, it was forced back to the 1950s giving
it an aesthetic that matches the politics of its owner, Rupert
Murdoch. By no means is this an exhaustive list of those affected,
either.

The scale and disruption caused by this incident is unlike anything
weve ever seen before. Previous incidents particularly rival
ransomware outbreaks, like Wannacry simply cant compare to this,
especially when were looking at the disruption and the sheer scale of
the problem.

Still, if your day was ruined by this outage, at least spare a thought
for those wholl have to actually fix it. Because those machines
affected are now locked in a perpetual boot loop, its not like
Crowdstrike can release a software patch and call it a day. Undoing
this update requires some users to have to individually go to each
computer, loading up safe mode (a limited version of Windows with most
non-essential software and drivers disabled), and manually removing
the faulty code. And if youve encrypted your computer, that process
gets a lot harder. Servers running on cloud services like Amazon Web
Services and Microsoft Azure you know, the way most of the
internet's infrastructure works require an entirely separate series
of actions.

If youre on a small IT team and youre supporting hundreds of
workstations across several far-flung locations which isnt unusual,
especially in sectors like retail and social care youre especially
fucked. Say goodbye to your weekend. Your evenings. Say goodbye to
your spouse and kids. You wont be seeing them for a while. Your life
will be driving from site to site, applying the fix and moving on.
Forget about sleeping in your own bed, or eating a meal that wasnt
bought from a fast food restaurant. Good luck, godspeed, and God
bless. I do not envy you.

The significance of this failure which isn't a breach, by the way,
and in many respects is far worse, at least in the disruption caused
is not in its damage to individual users, but to the amount of
technical infrastructure that runs on Windows, and that so much of our
global infrastructure relies on automated enterprise software that,
when it goes wrong, breaks everything.

It isn't about the number of computers, but the amount of them that
underpin things like the security checkpoints or systems that run
airlines, or at banks, or hospitals, all running as much automated
software as possible so that costs can be kept down.

The problem here is systemic that there is a company that the
majority of people affected by this outage had no idea existed until
today that Microsoft trusted to the extent that they were able to push
an update that broke the back of a huge chunk of the world's digital
infrastructure.

Microsoft, as a company, instead of building the kind of rigorous
security protocols that would, say, rigorously test something that
connects to what seems to be a huge proportion of Windows computers.
Microsoft, in particular, really screwed up here. As pointed out by
Wired, the company vets and cryptographically signs all kernel drivers
which is sensible and good, because kernel drivers have an
incredible amount of access, and thus can be used to inflict serious
harm with this testing process usually taking several weeks.

How then did this slip through its fingers? For this to have happened,
two companies needed to screw up epically. And boy, they did.

What we're seeing today isn't just a major fuckup, but the first of
what will be many systematic failures some small, some potentially
larger that are the natural byproduct of the growth-at-all-costs
ecosystem where any attempt to save money by outsourcing major systems
is one that simply must be taken to please the shareholder.

The problem with the digitization of society or, more specifically,
the automation of once-manual tasks is that it introduces a single
point of failure. Or, rather, multiple single points of failure. Our
world, our lifestyle and our economy, is dependent on automation and
computerization, with these systems, in turn, dependent on other
systems to work. And if one of those systems breaks, the effects
ricochet outwards, like ripples when you cast a rock into a lake.

Todays Crowdstrike cock-up is just the latest example of this, but it
isnt the only one. Remember the SolarWinds hack in 2020, when Russian
state-linked hackers gained access to an estimated 18,000 companies
and public sector organizations including NATO, the European
Parliament, the US Treasury Department, and the UKs National Health
Service by compromising just one service SolarWinds Orion?

Remember when Okta a company that makes software that handles
authentication for a bunch of websites, governments, and businesses
got hacked in 2023, and then lied about the scale of the breach? And
then do you remember how those hackers leapfrogged from Okta to a
bunch of other companies, most notably Cloudflare, which provides CDN
and DDOS protection services for pretty much the entire internet?

That whole John Donne quote No man is an island is especially
true when were talking about tech, because when you scratch beneath
the surface, every system that looks like its independent is actually
heavily, heavily dependent on services and software provided by a very
small number of companies, many of whom are not particularly good.
This is as much a cultural failing as it is a technological one, the
result of management geared toward value extraction building systems
that build monopolies by attaching themselves to other monopolies.
Crowdstrike went public in 2019, and immediately popped on its first
day of trading thanks to Wall Street's appreciation of Crowdstrike
moving away from a focused approach to serving large enterprise
clients, building products for small and medium-sized businesses by
selling through channel partners in effect outsourcing both product
sales and the relationship with a client that would tailor a business'
solution to a particular need.

Crowdstrike's culture also appears to fucking suck. A recent Glassdoor
entry referred to Crowdstrike as "great tech [with] terrible culture"
with no work life balance, with "leadership that does not care about
employee well being." Another from June claimed that Crowdstrike was
"changing culture for the street, with KPIs (as in metrics related to
your success at the company) driving behavior more than building
relationships with a serious lack of experience in the public sector
in senior management. Others complain of micromanagement, with one
claiming that management is the biggest issue, with managers
ask[ing] way too much of youand it doesnt matter if you do what
they ask since theyre not even around to check on you, and another
saying that management are arrogant and need to stop lying to the
market on product capability.

While I cant say for sure, Id imagine an organization with such
powerful signs of growth-at-all-costs thinking a place where you
have to get used to the pressure thats a clique that youre not
in likely isnt giving its quality assurance teams the time and
space to make sure that there arent any Kaiju-level security threats
baked into an update. And that assumes it actually has a significant
QA team in-house, and hasnt just (as with many companies) outsourced
the work to a bodyshop like Wipro or Infosys or Tata.

And dont think Im letting Microsoft off the hook, either. Assuming
the kernel driver testing roles are still being done in-house, do you
think that these testers who have likely seen their friends laid off
at a time when the company was highly profitable, and denied raises
when their well-fed CEO took home hundreds of millions of dollars for
doing a job hes eminenly bad at are motivated to do their best
work?

And this is the culture thats poisoned almost the entirety of Silicon
Valley. What were seeing is the societal cost of moving fast and
breaking things, of Marc Andreessen considering risk management the
enemy, of hiring and firing tens of thousands of people to please
Wall Street, of seeking as many possible ways to make as much money as
possible to show shareholders that youll grow, even if doing so means
growing at a pace that makes it impossible to sustain organizational
and cultural stability. When you arent intentional in the people you
hire, the people you fire, the things you build and the way that
theyre deployed, youre going to lose the people that understand the
problems theyre solving, and thus lack the organizational ability to
understand the ways that they might be solved in the future.

This is dangerous, and also a dark warning for the future. Do you
think that Facebook, or Microsoft, or Google all of whom have laid
off over 10,000 people in the last year have done so in a
conscientious way that means that the people left understand how their
systems run and their inherent issues? Do you think that the
management-types obsessed with the unsustainable AI boom are investing
heavily in making sure their organizations are rigorously protected
against, say, one bad line of code? Do they even know who wrote the
code of their current systems? Is that person still there? If not, is
that person at least contracted to make sure that something nuanced
about the system in question isnt mistakenly removed?

Theyre not. Theyre not there anymore. Only a few months ago Google
laid off 200 employees from the core of its organization, outsourcing
their roles to Mexico and India in a cost-cutting measure the quarter
after the company made over $23 billion in profit. Silicon Valley
and big tech writ large is not built to protect against situations
like the one were seeing today,because their culture is cancerous. It
valuesrowth at all costs, with no respect for the human capital that
empowers organizations or the value of building rigorous,
quality-focused products.

This is just the beginning. Big tech is in the throes of perdition,
teetering over the edge of the abyss, finally paying the harsh cost of
building systems as fast as possible. This isnt simply moving fast or
breaking things, but doing so without any regard for the speed at
which youre doing so and firing the people that broke them, the
people who know whats broken, and possibly the people that know how
to fix them.

And its not just tech! Boeing a company Ive already shat on in
this post, and one Ill likely return to in future newsletters,
largely because it exemplifies the short-sightedness of todays
managerial class has, over the past 20 years or so, span off huge
parts of the company (parts that, at one point, were vitally
important) into separate companies, laid off thousands of employees at
a time, and outsourced software dev work to $9-an-hour bodyshop
engineers. It hollowed itself out until there was nothing left.

And tell me, knowing what you know about Boeing today, would you
rather get into a 737 Max or an Airbus A320neo? Enough said.

As these organizations push their engineers harder, said engineers
will turn to AI-generated code, poisoning codebases with insecure and
buggy code as companies shed staff to keep up with Wall Streets
demands in ways that Im not sure people are capable of understanding.
The companies that run the critical parts of our digital lives do not
invest in maintenance or infrastructure with the intentionality thats
required to prevent the kinds of massive systemic failures you see
today, and I need you all to be ready for this to happen again.

This is the cost of the Rot Economy systems used by billions of
people held up by flimsy cultures and brittle infrastructure
maintained with the diligence of an absentee parent. This is the cost
of arrogance, of rewarding managerial malpractice, of promoting speed
over safety and profit over people.

Every single major tech organization should see today as a wakeup call
a time to reevaluate the fundamental infrastructure behind every
single tech stack.

What I fear is that theyll simply see it as someone elses problem -
which is exactly how we got here in the first place.

Henry Crun

2024-07-21 12:37:06 UTC