Discussion:
A meditation on the Antithesis of the VMS Ethos
(too old to reply)
Subcommandante XDelta
2024-07-21 09:41:06 UTC
Permalink
A meditation on the Antithesis of the VMS Ethos, and the DEC way.

A freshly minted neologism: "CloudStrucked" (six ways Sundays, and
then some)

https://www.wheresyoured.at/crowdstruck-2/


Ed Zitron's Where's Your Ed At

CrowdStruck

Edward Zitron

Jul 19, 2024

Soundtrack: EL-P - Tasmanian Pain Coaster (feat. Omar Rodriguez-Lopez
& Cedric Bixler-Zavala)

When I first began writing this newsletter, I didn't really have a
goal, or a "theme," or anything that could neatly characterize what I
was going to write about other than that I was on the computer and
that I was typing words.

As it grew, I wrote the Rot Economy, and the Shareholder Supremacy,
and many other pieces that speak to a larger problem in the tech
industry — a complete misalignment in the incentives of most of the
major tech companies, which have become less about building new
technologies and selling them to people and more about capturing
monopolies and gearing organizations to extract things through them.

Every problem you see is a result of a tech industry — from the people
funding the earliest startups to the trillion-dollar juggernauts that
dominate our lives — that is no longer focused on the creation of
technology with a purpose, and organizations driven toward a purpose.
Everything is about expressing growth, about showing how you will
dominate an industry rather than serve it, about providing metrics
that speak to the paradoxical notion that you'll grow forever without
any consideration of how you'll live forever. Legacies are now
subordinate to monopolies, current customers are subordinate to new
customers, and "products" are considered a means to introduce a
customer to a form of parasite designed to punish the user for even
considering moving to competitor.

What's happened today with Crowdstrike is completely unprecedented
(and I'll get to why shortly), and on the scale of the much-feared Y2K
bug that threatened to ground the entirety of the world's
computer-based infrastructure once the Year 2000 began.

You'll note that I didn't write "over-hyped" or anything dismissive of
Y2K's scale, because Y2K was a huge, society-threatening calamity
waiting to happen, and said calamity was averted through a remarkable,
$500 billion industrial effort that took a decade to manifest because
the seriousness of such a significant single point of failure would
have likely crippled governments, banks and airlines.

People laughed when nothing happened on January 1 2000, assuming that
all that money and time had been wasted, rather than being grateful
that an infrastructural weakness was taken seriously, that a single
point of failure was identified, and that a crisis was averted by
investing in stopping bad stuff happening before it does.

As we speak, millions — or even hundreds of millions — of different
Windows-based computers are now stuck in a doom-loop, repeatedly
showing users the famed "Blue Screen of Death" thanks to a single
point of failure in a company called Crowdstrike, the developer of a
globally-adopted cyber-security product designed, ironically, to
prevent the kinds of disruption that we’ve witnessed today. And for
reasons we’ll get to shortly, this nightmare is going to drag on for
several days (if not weeks) to come.

The product — called Crowdstrike Falcon Sensor — is an EDR system
(which stands for Endpoint Detection and Response). If you aren’t a
security professional and your eyes have glazed over, I’ll keep this
brief. An EDR system is designed to identify hacking attempts,
remediate them, and prevent them. They’re big, sophisticated, and
complicated products, and they do a lot of things that’s hard to build
with the standard tools available to Windows developers.

And so, to make Falcon Sensor work, Crowdstrike had to build its own
kernel driver. Now, kernel drivers operate at the lowest level of the
computer. They have the highest possible permissions, but they operate
with the fewest amount of guardrails. If you’ve ever built your own
computer — or you remember what computers were like in the dark days
of Windows 98 — you know that a single faulty kernel driver can wreak
havoc on the stability of your system.

The problem here is that Crowdstrike pushed out an evidently broken
kernel driver that locked whatever system that installed it in a
permanent boot loop. The system would start loading Windows, encounter
a fatal error, and reboot. And reboot. Again and again. It, in
essence, rendered those machines useless.

It's convenient to blame Crowdstrike here, and perhaps that's fair.
This should not have happened. On a basic level, whenever you write
(or update) a kernel driver, you need to know it’s actually robust and
won’t shit the bed immediately. Regrettably, Crowdstrike seemingly
borrowed Boeing’s approach to quality control, except instead of
building planes where the doors fly off at the most inopportune times
(specifically, when you’re cruising at 35,000ft), it released a piece
of software that blew up the transportation and banking sectors, to
name just a few.

It created a global IT outage that has grounded flights and broken
banking services. It took down the BBC’s flagship kids TV channel,
infuriating parents across the British Isles, as well as Sky News,
which, when it was able to resume live broadcasts, was forced to do so
without graphics. In essence, it was forced back to the 1950s — giving
it an aesthetic that matches the politics of its owner, Rupert
Murdoch. By no means is this an exhaustive list of those affected,
either.

The scale and disruption caused by this incident is unlike anything
we’ve ever seen before. Previous incidents — particularly rival
ransomware outbreaks, like Wannacry — simply can’t compare to this,
especially when we’re looking at the disruption and the sheer scale of
the problem.

Still, if your day was ruined by this outage, at least spare a thought
for those who’ll have to actually fix it. Because those machines
affected are now locked in a perpetual boot loop, it’s not like
Crowdstrike can release a software patch and call it a day. Undoing
this update requires some users to have to individually go to each
computer, loading up safe mode (a limited version of Windows with most
non-essential software and drivers disabled), and manually removing
the faulty code. And if you’ve encrypted your computer, that process
gets a lot harder. Servers running on cloud services like Amazon Web
Services and Microsoft Azure — you know, the way most of the
internet's infrastructure works — require an entirely separate series
of actions.

If you’re on a small IT team and you’re supporting hundreds of
workstations across several far-flung locations — which isn’t unusual,
especially in sectors like retail and social care — you’re especially
fucked. Say goodbye to your weekend. Your evenings. Say goodbye to
your spouse and kids. You won’t be seeing them for a while. Your life
will be driving from site to site, applying the fix and moving on.
Forget about sleeping in your own bed, or eating a meal that wasn’t
bought from a fast food restaurant. Good luck, godspeed, and God
bless. I do not envy you.

The significance of this failure — which isn't a breach, by the way,
and in many respects is far worse, at least in the disruption caused —
is not in its damage to individual users, but to the amount of
technical infrastructure that runs on Windows, and that so much of our
global infrastructure relies on automated enterprise software that,
when it goes wrong, breaks everything.

It isn't about the number of computers, but the amount of them that
underpin things like the security checkpoints or systems that run
airlines, or at banks, or hospitals, all running as much automated
software as possible so that costs can be kept down.

The problem here is systemic — that there is a company that the
majority of people affected by this outage had no idea existed until
today that Microsoft trusted to the extent that they were able to push
an update that broke the back of a huge chunk of the world's digital
infrastructure.

Microsoft, as a company, instead of building the kind of rigorous
security protocols that would, say, rigorously test something that
connects to what seems to be a huge proportion of Windows computers.
Microsoft, in particular, really screwed up here. As pointed out by
Wired, the company vets and cryptographically signs all kernel drivers
— which is sensible and good, because kernel drivers have an
incredible amount of access, and thus can be used to inflict serious
harm — with this testing process usually taking several weeks.

How then did this slip through its fingers? For this to have happened,
two companies needed to screw up epically. And boy, they did.

What we're seeing today isn't just a major fuckup, but the first of
what will be many systematic failures — some small, some potentially
larger — that are the natural byproduct of the growth-at-all-costs
ecosystem where any attempt to save money by outsourcing major systems
is one that simply must be taken to please the shareholder.

The problem with the digitization of society — or, more specifically,
the automation of once-manual tasks — is that it introduces a single
point of failure. Or, rather, multiple single points of failure. Our
world, our lifestyle and our economy, is dependent on automation and
computerization, with these systems, in turn, dependent on other
systems to work. And if one of those systems breaks, the effects
ricochet outwards, like ripples when you cast a rock into a lake.

Today’s Crowdstrike cock-up is just the latest example of this, but it
isn’t the only one. Remember the SolarWinds hack in 2020, when Russian
state-linked hackers gained access to an estimated 18,000 companies
and public sector organizations — including NATO, the European
Parliament, the US Treasury Department, and the UK’s National Health
Service — by compromising just one service — SolarWinds Orion?

Remember when Okta — a company that makes software that handles
authentication for a bunch of websites, governments, and businesses —
got hacked in 2023, and then lied about the scale of the breach? And
then do you remember how those hackers leapfrogged from Okta to a
bunch of other companies, most notably Cloudflare, which provides CDN
and DDOS protection services for pretty much the entire internet?

That whole John Donne quote — “No man is an island” — is especially
true when we’re talking about tech, because when you scratch beneath
the surface, every system that looks like it’s independent is actually
heavily, heavily dependent on services and software provided by a very
small number of companies, many of whom are not particularly good.
This is as much a cultural failing as it is a technological one, the
result of management geared toward value extraction — building systems
that build monopolies by attaching themselves to other monopolies.
Crowdstrike went public in 2019, and immediately popped on its first
day of trading thanks to Wall Street's appreciation of Crowdstrike
moving away from a focused approach to serving large enterprise
clients, building products for small and medium-sized businesses by
selling through channel partners — in effect outsourcing both product
sales and the relationship with a client that would tailor a business'
solution to a particular need.

Crowdstrike's culture also appears to fucking suck. A recent Glassdoor
entry referred to Crowdstrike as "great tech [with] terrible culture"
with no work life balance, with "leadership that does not care about
employee well being." Another from June claimed that Crowdstrike was
"changing culture for the street,” with KPIs (as in metrics related to
your “success” at the company) “driving behavior more than building
relationships” with a serious lack of experience in the public sector
in senior management. Others complain of micromanagement, with one
claiming that “management is the biggest issue,” with managers
“ask[ing] way too much of you…and it doesn’t matter if you do what
they ask since they’re not even around to check on you,” and another
saying that “management are arrogant” and need to “stop lying to the
market on product capability.”

While I can’t say for sure, I’d imagine an organization with such
powerful signs of growth-at-all-costs thinking — a place where you
“have to get used to the pressure” that’s a “clique that you’re not
in” — likely isn’t giving its quality assurance teams the time and
space to make sure that there aren’t any Kaiju-level security threats
baked into an update. And that assumes it actually has a significant
QA team in-house, and hasn’t just (as with many companies) outsourced
the work to a “bodyshop” like Wipro or Infosys or Tata.

And don’t think I’m letting Microsoft off the hook, either. Assuming
the kernel driver testing roles are still being done in-house, do you
think that these testers — who have likely seen their friends laid off
at a time when the company was highly profitable, and denied raises
when their well-fed CEO took home hundreds of millions of dollars for
doing a job he’s eminenly bad at — are motivated to do their best
work?

And this is the culture that’s poisoned almost the entirety of Silicon
Valley. What we’re seeing is the societal cost of moving fast and
breaking things, of Marc Andreessen considering “risk management the
enemy,” of hiring and firing tens of thousands of people to please
Wall Street, of seeking as many possible ways to make as much money as
possible to show shareholders that you’ll grow, even if doing so means
growing at a pace that makes it impossible to sustain organizational
and cultural stability. When you aren’t intentional in the people you
hire, the people you fire, the things you build and the way that
they’re deployed, you’re going to lose the people that understand the
problems they’re solving, and thus lack the organizational ability to
understand the ways that they might be solved in the future.

This is dangerous, and also a dark warning for the future. Do you
think that Facebook, or Microsoft, or Google — all of whom have laid
off over 10,000 people in the last year — have done so in a
conscientious way that means that the people left understand how their
systems run and their inherent issues? Do you think that the
management-types obsessed with the unsustainable AI boom are investing
heavily in making sure their organizations are rigorously protected
against, say, one bad line of code? Do they even know who wrote the
code of their current systems? Is that person still there? If not, is
that person at least contracted to make sure that something nuanced
about the system in question isn’t mistakenly removed?

They’re not. They’re not there anymore. Only a few months ago Google
laid off 200 employees from the core of its organization, outsourcing
their roles to Mexico and India in a cost-cutting measure the quarter
after the company made over $23 billion in profit. Silicon Valley —
and big tech writ large — is not built to protect against situations
like the one we’re seeing today,because their culture is cancerous. It
valuesrowth at all costs, with no respect for the human capital that
empowers organizations or the value of building rigorous,
quality-focused products.

This is just the beginning. Big tech is in the throes of perdition,
teetering over the edge of the abyss, finally paying the harsh cost of
building systems as fast as possible. This isn’t simply moving fast or
breaking things, but doing so without any regard for the speed at
which you’re doing so and firing the people that broke them, the
people who know what’s broken, and possibly the people that know how
to fix them.

And it’s not just tech! Boeing — a company I’ve already shat on in
this post, and one I’ll likely return to in future newsletters,
largely because it exemplifies the short-sightedness of today’s
managerial class — has, over the past 20 years or so, span off huge
parts of the company (parts that, at one point, were vitally
important) into separate companies, laid off thousands of employees at
a time, and outsourced software dev work to $9-an-hour bodyshop
engineers. It hollowed itself out until there was nothing left.

And tell me, knowing what you know about Boeing today, would you
rather get into a 737 Max or an Airbus A320neo? Enough said.

As these organizations push their engineers harder, said engineers
will turn to AI-generated code, poisoning codebases with insecure and
buggy code as companies shed staff to keep up with Wall Street’s
demands in ways that I’m not sure people are capable of understanding.
The companies that run the critical parts of our digital lives do not
invest in maintenance or infrastructure with the intentionality that’s
required to prevent the kinds of massive systemic failures you see
today, and I need you all to be ready for this to happen again.

This is the cost of the Rot Economy — systems used by billions of
people held up by flimsy cultures and brittle infrastructure
maintained with the diligence of an absentee parent. This is the cost
of arrogance, of rewarding managerial malpractice, of promoting speed
over safety and profit over people.

Every single major tech organization should see today as a wakeup call
— a time to reevaluate the fundamental infrastructure behind every
single tech stack.

What I fear is that they’ll simply see it as someone else’s problem -
which is exactly how we got here in the first place.
Henry Crun
2024-07-21 12:37:06 UTC
Permalink
Post by Subcommandante XDelta
A meditation on the Antithesis of the VMS Ethos, and the DEC way.
A freshly minted neologism: "CloudStrucked" (six ways Sundays, and
then some)
https://www.wheresyoured.at/crowdstruck-2/
<Content snipped>

Thanks for the pointer!

Mike
--
No Micro$oft products were used in the URLs above, or in preparing this message. Recommended reading:
http://www.catb.org/~esr/faqs/smart-questions.html#befor
Craig A. Berry
2024-07-21 12:55:18 UTC
Permalink
Post by Subcommandante XDelta
The problem here is that Crowdstrike pushed out an evidently broken
kernel driver that locked whatever system that installed it in a
permanent boot loop. The system would start loading Windows, encounter
a fatal error, and reboot. And reboot. Again and again. It, in
essence, rendered those machines useless.
It was not a kernel driver. It was a bad configuration file that
normally gets updated several times a day:

https://www.crowdstrike.com/blog/falcon-update-for-windows-hosts-technical-details/

The bad file was only in the wild for about an hour and a half. Folks
in the US who powered off Thursday evening and didn't get up too early
Friday would've been fine. Of course Europe was well into their work
day, and lot of computers stay on overnight.

The boot loop may or may not be permanent -- lots of systems have
eventually managed to get the corrected file by doing nothing other than
repeated reboots. No, that doesn't always work.

The update was "designed to target newly observed, malicious named pipes
being used by common C2 frameworks in cyberattacks."

Most likely what makes CrowdStrike popular is that they are continuously
updating countermeasures as threats are observed, but that flies in the
face of normal deployment practices where you don't bet the farm on a
single update that affects all systems all at once. For example, in
Microsoft Azure, you can set up redundancy for your PaaS and SaaS
offerings so that if an update breaks all the servers in one data
center, your services are still up and running in another. Most
enterprises will have similar planning for private data centers.

CrowdStrike thought updating the entire world in an instant was a good
idea. While no one wants to sit there vulnerable to a known threat for
any length of time, I suspect that idea will get revisited. If they had
simply staggered the update over a few hours, the catastrophe would have
been much smaller. Customers will likely be asking for more control
over when they get updates, and, for example, wanting to set up
different update channels for servers and PCs.
Arne Vajhøj
2024-07-21 13:50:36 UTC
Permalink
Post by Subcommandante XDelta
The problem here is that Crowdstrike pushed out an evidently broken
kernel driver that locked whatever system that installed it in a
permanent boot loop. The system would start loading Windows, encounter
a fatal error, and reboot. And reboot. Again and again. It, in
essence, rendered those machines useless.
It was not a kernel driver.  It was a bad configuration file that
https://www.crowdstrike.com/blog/falcon-update-for-windows-hosts-technical-details/
So not a driver.

But I will not blame anyone for assuming that a .SYS file under
C:\Windows\System32\drivers was a driver.
The bad file was only in the wild for about an hour and a half.  Folks
in the US who powered off Thursday evening and didn't get up too early
Friday would've been fine.  Of course Europe was well into their work
day, and lot of computers stay on overnight.
The impact was pretty huge.
The boot loop may or may not be permanent -- lots of systems have
eventually managed to get the corrected file by doing nothing other than
repeated reboots.  No, that doesn't always work.
The update was "designed to target newly observed, malicious named pipes
being used by common C2 frameworks in cyberattacks."
Most likely what makes CrowdStrike popular is that they are continuously
updating countermeasures as threats are observed, but that flies in the
face of normal deployment practices where you don't bet the farm on a
single update that affects all systems all at once.  For example, in
Microsoft Azure, you can set up redundancy for your PaaS and SaaS
offerings so that if an update breaks all the servers in one data
center, your services are still up and running in another.  Most
enterprises will have similar planning for private data centers.
CrowdStrike thought updating the entire world in an instant was a good
idea. While no one wants to sit there vulnerable to a known threat for
any length of time, I suspect that idea will get revisited. If they had
simply staggered the update over a few hours, the catastrophe would have
been much smaller.  Customers will likely be asking for more control
over when they get updates, and, for example, wanting to set up
different update channels for servers and PCs.
I have already seen speculation that IT security will decrease because
patch deployment speed will slow down.

Arne

PS: I don't like the product!
Craig A. Berry
2024-07-21 17:57:06 UTC
Permalink
Post by Arne Vajhøj
It was not a kernel driver.  It was a bad configuration file that
https://www.crowdstrike.com/blog/falcon-update-for-windows-hosts-technical-details/
So not a driver.
But I will not blame anyone for assuming that a .SYS file under
C:\Windows\System32\drivers was a driver.
It was a reasonable guess, but the OP claimed that Microsoft's kernel
driver approval process was somehow involved, which doesn't seem to be
the case. On the other hand, a kernel driver that can reconfigure
itself multiple times a day from data obtained over the network may
avoid some kinds of problems, but clearly it can cause others.
Post by Arne Vajhøj
CrowdStrike thought updating the entire world in an instant was a good
idea. While no one wants to sit there vulnerable to a known threat for
any length of time, I suspect that idea will get revisited.
I have already seen speculation that IT security will decrease because
patch deployment speed will slow down.
If you update too slowly, you are vulnerable. If you update everything
immediately all at once world-wide, you risk catastrophic failure. There
is no free lunch.
Post by Arne Vajhøj
Arne
PS: I don't like the product!
Since Friday you probably have a lot of company :-).
Lawrence D'Oliveiro
2024-07-21 21:37:54 UTC
Permalink
Post by Arne Vajhøj
I have already seen speculation that IT security will decrease because
patch deployment speed will slow down.
Consider that non-CrowdStrike customers, and even non-Windows-using
CrowdStrike customers, were not affected.

Therefore, would not a more logical conclusion be: “don’t put all your
eggs in one basket”? Spread your Windows systems around different security
providers, and perhaps make more use of non-Windows systems?
Gary R. Schmidt
2024-07-22 05:38:37 UTC
Permalink
Post by Lawrence D'Oliveiro
Post by Arne Vajhøj
I have already seen speculation that IT security will decrease because
patch deployment speed will slow down.
Consider that non-CrowdStrike customers, and even non-Windows-using
CrowdStrike customers, were not affected.
Therefore, would not a more logical conclusion be: “don’t put all your
eggs in one basket”? Spread your Windows systems around different security
providers, and perhaps make more use of non-Windows systems?
But do be aware that a few months ago Crowdstrike bricked a bunch of
Linux and Mac boxes.

The problem, methinks, is not in the OS...

Cheers,
Gary B-)
Lawrence D'Oliveiro
2024-07-22 06:03:17 UTC
Permalink
Post by Gary R. Schmidt
But do be aware that a few months ago Crowdstrike bricked a bunch of
Linux and Mac boxes.
Any details?
Gary R. Schmidt
2024-07-22 08:24:14 UTC
Permalink
Post by Lawrence D'Oliveiro
Post by Gary R. Schmidt
But do be aware that a few months ago Crowdstrike bricked a bunch of
Linux and Mac boxes.
Any details?
https://www.theregister.com/2024/07/21/crowdstrike_linux_crashes_restoration_tools/

https://www.neowin.net/news/crowdstrike-broke-debian-and-rocky-linux-months-ago-but-no-one-noticed/
Arne Vajhøj
2024-07-22 12:58:08 UTC
Permalink
Post by Gary R. Schmidt
Post by Lawrence D'Oliveiro
Post by Gary R. Schmidt
But do be aware that a few months ago Crowdstrike bricked a bunch of
Linux and Mac boxes.
Any details?
https://www.theregister.com/2024/07/21/crowdstrike_linux_crashes_restoration_tools/
https://www.neowin.net/news/crowdstrike-broke-debian-and-rocky-linux-months-ago-but-no-one-noticed/
They do not have a good track record.

A few weeks ago:

https://www.thestack.technology/crowdstrike-bug-maxes-out-100-of-cpu-requires-windows-reboots/

That is back on Windows, but I don't think MS can take any
responsibility for that.

Arne
Lawrence D'Oliveiro
2024-07-23 00:13:46 UTC
Permalink
Post by Gary R. Schmidt
https://www.neowin.net/news/crowdstrike-broke-debian-and-rocky-linux-months-ago-but-no-one-noticed/
“No one noticed” ... perhaps because hardly anybody is using these
sorts of intrusive EDR products on Linux?

Another thing that reduces the chance of screwups: a poster in another
group gave this link to a comment by long-time Linux contributor
Matthew Garrett: on Windows, CrowdStrike has to load its own
proprietary kernel driver to do its anti-malware checks, but on Linux
they just rely on the standard configurable EBPF facility. This helps
to reduce the chance of things going wrong.

<https://nondeterministic.computer/@mjg59/112816011370924959>
Simon Clubley
2024-07-22 12:34:28 UTC
Permalink
Post by Craig A. Berry
Post by Subcommandante XDelta
The problem here is that Crowdstrike pushed out an evidently broken
kernel driver that locked whatever system that installed it in a
permanent boot loop. The system would start loading Windows, encounter
a fatal error, and reboot. And reboot. Again and again. It, in
essence, rendered those machines useless.
It was not a kernel driver. It was a bad configuration file that
https://www.crowdstrike.com/blog/falcon-update-for-windows-hosts-technical-details/
If it's something that can stop the system from booting, then it _should_
be treated as if it _was_ a kernel driver.

IOW, what on earth happened to the concept of a Last Known Good boot to
automatically recover from such screwups ? Windows 2000, over 2 decades
ago, had an early version of the LKG boot concept for goodness sake.

What _should_ have happened, and what should have been built into Windows
years ago as part of the standard procedures for updating system components,
is that the original version of files that were used during the last good
boot were preserved in a backup until the next successful boot.

After that, the preserved files would be overwritten with the updated
versions. OTOH, if the next boot fails, the last known good configuration
is restored and another reboot done, but exactly _once_ only. (If the LKG
boot fails, then it's probably some hardware failure or other external
factor).
Post by Craig A. Berry
The bad file was only in the wild for about an hour and a half. Folks
in the US who powered off Thursday evening and didn't get up too early
Friday would've been fine. Of course Europe was well into their work
day, and lot of computers stay on overnight.
The boot loop may or may not be permanent -- lots of systems have
eventually managed to get the corrected file by doing nothing other than
repeated reboots. No, that doesn't always work.
The update was "designed to target newly observed, malicious named pipes
being used by common C2 frameworks in cyberattacks."
Most likely what makes CrowdStrike popular is that they are continuously
updating countermeasures as threats are observed, but that flies in the
face of normal deployment practices where you don't bet the farm on a
single update that affects all systems all at once. For example, in
Microsoft Azure, you can set up redundancy for your PaaS and SaaS
offerings so that if an update breaks all the servers in one data
center, your services are still up and running in another. Most
enterprises will have similar planning for private data centers.
CrowdStrike thought updating the entire world in an instant was a good
idea. While no one wants to sit there vulnerable to a known threat for
any length of time, I suspect that idea will get revisited. If they had
simply staggered the update over a few hours, the catastrophe would have
been much smaller. Customers will likely be asking for more control
over when they get updates, and, for example, wanting to set up
different update channels for servers and PCs.
Or modern Windows could simply fully implement the LKG boot concept.

Simon.
--
Simon Clubley, ***@remove_me.eisner.decus.org-Earth.UFP
Walking destinations on a map are further away than they appear.
Arne Vajhøj
2024-07-22 12:54:36 UTC
Permalink
Post by Simon Clubley
Post by Craig A. Berry
Post by Subcommandante XDelta
The problem here is that Crowdstrike pushed out an evidently broken
kernel driver that locked whatever system that installed it in a
permanent boot loop. The system would start loading Windows, encounter
a fatal error, and reboot. And reboot. Again and again. It, in
essence, rendered those machines useless.
It was not a kernel driver. It was a bad configuration file that
https://www.crowdstrike.com/blog/falcon-update-for-windows-hosts-technical-details/
If it's something that can stop the system from booting, then it _should_
be treated as if it _was_ a kernel driver.
It was config for and impacting behavior of kernel code.

So yes.
Post by Simon Clubley
IOW, what on earth happened to the concept of a Last Known Good boot to
automatically recover from such screwups ? Windows 2000, over 2 decades
ago, had an early version of the LKG boot concept for goodness sake.
What _should_ have happened, and what should have been built into Windows
years ago as part of the standard procedures for updating system components,
is that the original version of files that were used during the last good
boot were preserved in a backup until the next successful boot.
After that, the preserved files would be overwritten with the updated
versions. OTOH, if the next boot fails, the last known good configuration
is restored and another reboot done, but exactly _once_ only. (If the LKG
boot fails, then it's probably some hardware failure or other external
factor).
Definitely a good concept.

Note though that it would require a smart definition of good boot.

The problem happened rather late and Windows may very well have
considered the startup successfully completed.

Arne
Lawrence D'Oliveiro
2024-07-23 00:25:58 UTC
Permalink
Post by Arne Vajhøj
It was config for and impacting behavior of kernel code.
And it was not subject to the configuration option for turning off
automatic updates. Updates for these files were forced through anyway.
Simon Clubley
2024-07-23 12:20:52 UTC
Permalink
Post by Lawrence D'Oliveiro
Post by Arne Vajhøj
It was config for and impacting behavior of kernel code.
And it was not subject to the configuration option for turning off
automatic updates. Updates for these files were forced through anyway.
Un-bloody-believable. :-( :-(

I hope this doesn't turn out to be some clueless cretin who thought
they knew better than anyone else when creating an update and have now
just discovered the hard way they did not.

I read somewhere the file that got pushed had nulls in it and that the
file that got pushed was not identical to the one that was tested. :-(

Also turns out their fully-privileged kernel mode driver didn't do the
proper level of validation on this file. (So once again, we are back to
clueless cretin who thought they knew better than anyone else, but only
this time we are talking about the kernel-mode driver writer. :-( )

And no, that is _NOT_ with the benefit of hindsight. When you are writing
this kind of code, you don't trust _anything_ external (or even your own
code :-) ), and you instead validate and perform cross-checks accordingly.
And yes, this _is_ standard practice for any code I write.

Simon.
--
Simon Clubley, ***@remove_me.eisner.decus.org-Earth.UFP
Walking destinations on a map are further away than they appear.
Lawrence D'Oliveiro
2024-07-21 21:39:29 UTC
Permalink
As we speak, millions -- or even hundreds of millions -- of different
Windows-based computers are now stuck in a doom-loop ...
Microsoft’s count was 8.5 million. Not that huge a number, really.

Maybe it’s a sign that not as many people depend on Windows for such
mission-critical systems as you might expect.
Simon Clubley
2024-07-22 12:48:54 UTC
Permalink
As we speak, millions -- or even hundreds of millions -- of different
Windows-based computers are now stuck in a doom-loop ...
Microsoft?s count was 8.5 million. Not that huge a number, really.
It's used mainly in business/professional environments only.

That makes 8.5 million a _huge_ number.

BTW, I found this while trying to find out more about the company and
I wonder if they are planning to update it anytime soon to tone it down:

https://www.crowdstrike.com/careers/diversity-equity-and-inclusion/

They talk a lot about how they make people feel good about themselves,
but nothing about how they cultivate people to produce robust reliable
software.

That page above seems seriously OTT, so I just hope their development
processes are engineering-based instead of feeling-based, given how
critical a company they have become.

Simon.
--
Simon Clubley, ***@remove_me.eisner.decus.org-Earth.UFP
Walking destinations on a map are further away than they appear.
Arne Vajhøj
2024-07-22 13:24:38 UTC
Permalink
Post by Simon Clubley
BTW, I found this while trying to find out more about the company and
https://www.crowdstrike.com/careers/diversity-equity-and-inclusion/
They talk a lot about how they make people feel good about themselves,
but nothing about how they cultivate people to produce robust reliable
software.
That page above seems seriously OTT, so I just hope their development
processes are engineering-based instead of feeling-based, given how
critical a company they have become.
I suspect that page is created by people from either HR or
a dedicated DEI team that are not able to distinguish between
a C program and a Java program.

:-)

But if you combine "the big problem", "the Linux problem"
and "the Windows CPU usage problem" which are 3 big problems
within a few months, then I would say that they have
"room for improvements in software quality".

:-)

Arne
Arne Vajhøj
2024-07-22 13:29:11 UTC
Permalink
Post by Arne Vajhøj
But if you combine "the big problem", "the Linux problem"
and "the Windows CPU usage problem" which are 3 big problems
within a few months, then I would say that they have
"room for improvements in software quality".
:-)
Which is not unique in any way.

Old joke:

<joke>
At a software development conference in a session about
software quality the speaker looked out at the audience and
asked "You have just boarded a plane and you realize that the
software of that plane has been developed by your team. Who
would stay on the plane?". Only one guy raised his hand. So
the speaker asked him "How do you ensure such high quality
that you feel safe staying on the plane?". And the guy
answered "Our quality sucks - there is no chance that
the plane would make it from the gate out to the take off
runway, so no need to get off.".
</joke>

Arne
Craig A. Berry
2024-07-22 13:27:22 UTC
Permalink
Post by Simon Clubley
BTW, I found this while trying to find out more about the company and
https://www.crowdstrike.com/careers/diversity-equity-and-inclusion/
They talk a lot about how they make people feel good about themselves,
but nothing about how they cultivate people to produce robust reliable
software.
I think it's the other way around. They have a bad reputation for how
they treat their employees so have made efforts to correct that image.

None of which is relevant to policies around testing a new configuration
before deploying to the entire world all at once.
Simon Clubley
2024-07-22 18:02:39 UTC
Permalink
Post by Craig A. Berry
Post by Simon Clubley
BTW, I found this while trying to find out more about the company and
https://www.crowdstrike.com/careers/diversity-equity-and-inclusion/
They talk a lot about how they make people feel good about themselves,
but nothing about how they cultivate people to produce robust reliable
software.
I think it's the other way around. They have a bad reputation for how
they treat their employees so have made efforts to correct that image.
None of which is relevant to policies around testing a new configuration
before deploying to the entire world all at once.
If they have an engineering culture, you are correct.

If they have a feelings-based culture, the situation is more complicated.

Part of an engineering approach is that you push back on shortcuts and
daft ideas. In a feelings-based culture, you may be afraid to push back
on something because you don't want to be accused of "causing offence"
or some other nonsense because the people you are addressing don't know
how to handle negative feedback.

One of the things I say to people, when I take part in discussions about
some new thing or idea I am responsible for, is to push back on me if you
think the idea is daft or if you can see something I have missed. I tell
people I will not be annoyed if they actively disagree with me, but I will
be annoyed if they see something and don't say anything.

And yes, sometimes people see something I have missed or see a better way
of doing something and _that_ is part of the engineering approach.

Simon.
--
Simon Clubley, ***@remove_me.eisner.decus.org-Earth.UFP
Walking destinations on a map are further away than they appear.
Arne Vajhøj
2024-07-24 01:14:40 UTC
Permalink
Post by Simon Clubley
BTW, I found this while trying to find out more about the company and
https://www.crowdstrike.com/careers/diversity-equity-and-inclusion/
They talk a lot about how they make people feel good about themselves,
but nothing about how they cultivate people to produce robust reliable
software.
That page above seems seriously OTT, so I just hope their development
processes are engineering-based instead of feeling-based, given how
critical a company they have become.
Regarding their culture then it has been mentioned in the press
that their CEO once was CTO at McAfee.

https://www.businessinsider.com/crowdstrike-ceo-george-kurtz-tech-outage-microsoft-mcafee-2024-7

I guess that CrowdStrike's and McAfee's markets are close enough
for that to make business sense.

But I would not want to duplicate McAfee's approach to
engineering quality.

Arne
Arne Vajhøj
2024-07-22 13:04:12 UTC
Permalink
Post by Lawrence D'Oliveiro
As we speak, millions -- or even hundreds of millions -- of different
Windows-based computers are now stuck in a doom-loop ...
Microsoft’s count was 8.5 million. Not that huge a number, really.
Maybe it’s a sign that not as many people depend on Windows for such
mission-critical systems as you might expect.
The number is about as high as to be expected.

Math like:

1000 million Windows PC's & 1/2 business 1/2 private
=>
500 million business Windows PC's

500 million business Windows PC's & CrowdStrike on 1/3
=>
133 million business Windows PC's with CrowdStrike

133 million business Windows PC's with CrowdStrike & bad file available
for 1.5 hours & assumption about restarts evenly distributed over 24
hours (which is not that unrealistic if looking worldwide)
=>
expected 8.3 million PC's to be impacted

:-)

Arne
Arne Vajhøj
2024-07-22 13:09:18 UTC
Permalink
Post by Lawrence D'Oliveiro
As we speak, millions -- or even hundreds of millions -- of different
Windows-based computers are now stuck in a doom-loop ...
Microsoft’s count was 8.5 million. Not that huge a number, really.
Maybe it’s a sign that not as many people depend on Windows for such
mission-critical systems as you might expect.
It may also be worth noting that a lot of the problems was caused
by the issue on "individually non-critical but group critical PC's".

Meaning that you may have 1000 desktop PC's running some
business GUI - if 1 or 10 or even 25 of these goes down it has
very little impact, but if all 1000 are down (because they had
Crowdstrike and they were all rebooted in the bad 1.5 hour window)
then it has a huge impact.

Arne
Stephen Hoffman
2024-07-29 16:58:51 UTC
Permalink
Post by Subcommandante XDelta
A meditation on the Antithesis of the VMS Ethos, and the DEC way.
A heady mix of entertainment and omissions and economically-problematic
hopes and dreams, that.

Brandolini's Law is always in scope, of course. The bulk of the
citations first:

CrowdStrike-related:
https://www.crowdstrike.com/falcon-content-update-remediation-and-guidance-hub/
https://forums.rockylinux.org/t/crowdstrike-freezing-rockylinux-after-9-4-upgrade/14041

https://www.thestack.technology/crowdstrike-bug-maxes-out-100-of-cpu-requires-windows-reboots/


Microsoft has had legal entanglements here:
https://www.techtarget.com/searchsecurity/news/450420491/Microsoft-accused-of-blocking-independent-antivirus-competition

https://www.theregister.com/2024/07/22/windows_crowdstrike_kernel_eu/

Microsoft has been working on security here:
https://www.microsoft.com/en-us/security/blog/2021/12/08/improve-kernel-security-with-the-new-microsoft-vulnerable-and-malicious-driver-reporting-center/

https://learn.microsoft.com/en-us/windows/win32/services/protecting-anti-malware-services-

https://opensource.microsoft.com/blog/2021/05/10/making-ebpf-work-on-windows/

Other vendors have been moving kernel code to user mode, and reducing
the apps that can load extensions, which is somewhat helpful for
security and definitely helpful for avoiding kernel crashes, but then
attacks against user-mode code with access to kernel APIs can be bad,
too.
https://developer.apple.com/support/kernel-extensions/
https://www.sweetwater.com/sweetcare/articles/kernel-extensions-on-mac-with-apple-silicon/

https://ebpf.io on Linux
https://developer.apple.com/documentation/coreservices/file_system_events
(and
https://www.crowdstrike.com/blog/using-os-x-fsevents-discover-deleted-malicious-artifact/
and
https://www.crowdstrike.com/blog/i-know-what-you-did-last-month-a-new-artifact-of-execution-on-macos-10-13/
)
https://support.apple.com/guide/security/welcome/web
https://developer.apple.com/documentation/endpointsecurity


As for kernel mode APIs and design more generally, OpenVMS has gaps
here too, with VCI being the not-really-equivalent and
not-generally-documented API for network interface. And it's a kernel
API, with all that entails. The closest analog to the file change
notification API (FSEvents-like) is parsing security alarms arriving
via an app-declared mailbox, something which I've encountered in only a
handful of apps. An approach which gets scruffy. The only
kernel-code-accessing-user-mode mechanism in OpenVMS is the
ill-documented ACP mechanism, which really isn't an isolation mechanism
given it's passing around kernel data structure pointers such as I/O
request packets. Having written various ACPs, that all works pretty
well, but the APIs are very much set up for mounting and dismounting
file systems, and areas such as mount and dismount are completely
lacking customizations, which usually means writing up your own $mount
and $dismou analog. ACPs aren't a great way to avoid kernel code, and
are more intended for allowing kernel code to call outer-mode APIs.
Which is definitely scruffy. IIRC, the TCP/IP Services package — why
that's still separately installed, a packaging decision straight out of
the last millennium — has a kernel callout for packet filtering too,
but that's still not documented AFAIK.

In short, there's no good place to tie in endpoint security, or tools
akin to CrowdStrike. There are no endpoint security APIs.

Outside of legal entanglements, biggest issue with APIs and API-level
changes for Microsoft is app and API compatibility, and there's a
lineage there from Microsoft back through MICA to OpenVMS and the goal
of OpenVMS compatibility, too. A laudable goal, with
occasionally-intractable results. Such as trying to stuff a modern and
robust password hash into an eight-byte field.

As for the referenced mess, CrowdStrike was basically testing in
production, and seemingly lacked any sort of continuous integration
(what they had reportedly returned a "yep" when it wasn't actually
tested), and given that vendor's other recent issues with other
platforms, hasn't particularly been learning how to deal with and
reduce the damages and damage control arising from their own errors.
Maybe hiring a billionaire former CTO of McAfee as your CEO didn't work
out?
https://en.wikipedia.org/wiki/Continuous_integration
https://www.businessinsider.com/crowdstrike-ceo-george-kurtz-tech-outage-microsoft-mcafee-2024-7?op=1


Alternatives to CrowdStrike exist with some vendors, Microsoft has
Defender (whatever its proper product name is now), Apple has XProtect
and XProtect Remediator and the Signed System Volume and App
Notarization. OpenVMS has no analog. (Yeah, I think you can actually
sign stuff with the long-deprecated CDSA, but I've never seen anybody
use that mechanism outside of OpenVMS Secure Delivery, which itself
moved away from CDSA.) There have been third-party apps that tried to
manage malware and change control on OpenVMS too, and DEC had
DECinspect.


As for the OpenVMS Ethos, the problems and the systems and the
interconnections are vastly more complex than is OpenVMS, and the pace
of required changes in many environments are necessarily far faster
than OpenVMS has ever managed. Any snarking at billionaires and at
ever-loquatious newsletter texts aside, this ever-increasing complexity
is built upon myriad very difficult problems and dependencies. We
aren't ever going back to the pre-millennial era of simpler and less
interconnected computing, either.

Ever-increasing complexity? Yeah. There are issues with Secure Boot and
with self-bricking Intel Raptor Lake 65W+ processors, among many other
recent problems:
https://arstechnica.com/security/2024/07/secure-boot-is-completely-compromised-on-200-models-from-5-big-device-makers/

https://www.tomshardware.com/pc-components/cpus/intel-cpu-instability-crashing-bug-includes-65w-and-higher-skus-intel-says-damage-is-irreversible-no-planned-recall


Yeah, and CrowdStrike absolutely blew it. I expect Microsoft will use
some of the fallout to push vendors into APIs, though that push won't
be free of vendor complaints, and not without the possibility of and
the risks of poorly-secured or poorly-written user-mode code now
causing mayhem.
--
Pure Personal Opinion | HoffmanLabs LLC
Lawrence D'Oliveiro
2024-07-29 21:36:37 UTC
Permalink
... with occasionally-intractable results. Such as trying to stuff a
modern and robust password hash into an eight-byte field.
The Unix tradition of text-based config files (in this case, /etc/shadow)
wins again.
As for the referenced mess, CrowdStrike was basically testing in
production, and seemingly lacked any sort of continuous integration ...
They advertise it as a positive point, that they can respond to new
security threats faster than other companies--certainly faster than
Microsoft.

And yes, they do it by cutting corners on testing. I’ve seen many other
comments raising the hoary old “never implement new system changes on a
Friday” meme ... but what happens if the malware writers release a zero-
day on a Friday?

Loading...