We decided to move 90% of our workload from the cloud to on-prem infrastructure

manigandham · on April 12, 2022

> "Starting a web-based or SaaS (Software as a Service) business was virtually unheard of before the age of IaaS"

Nonsense. There were plenty of SaaS startups. There was even a little event called the dotcom boom all about internet companies. This lack of history and experience is why new companies get into this cloud-first mess in the first place.

Cloud is primarily for flexibility in iteration, dynamic scaling, or complex configurations that would be otherwise hard to do. If you have steady-state load like this then a few servers in your colo is pretty simple and far cheaper.

Companies also vastly overestimate their scale when their entire business could probably fit on a single commodity server.

the_duke · on April 12, 2022

In addition: long before AWS you could easily rent virtualized or dedicated servers.

It took longer to provision than EC2 and all you had for storage is a fixed amount of disk, but that was absolutely sufficient for many businesses.

vmception · on April 12, 2022

> but that was absolutely sufficient for many businesses.

yes, but I am glad about how many job titles are obsoleted for many businesses due to how compute instances are managed.

much smaller organizations used to need a full blown database administrator or two, and other personnel dedicated to keeping the server up. or you were doing it all yourself and spending your time on that.

much higher barrier of entry than today where you have an untold number of computers spun up for you in an instant and a bunch of cached versions on yet more computers in the CDN and another process keeping those caches updated, with you just thinking its one single instance used because you're on the hobby plan.

mywittyname · on April 12, 2022

Honestly, it didn't seem to be that different back then than it is now.

When I worked at smaller places, developers handled infrastructure and development. As the company grew, dedicated specialists came onboard to help.

Today, at smaller places, developers handle the cloud infrastructure, and as the company grows, they bring on dedicated specialists to help.

The biggest difference, I think, is that we have so many specialized products. We are no longer trying to figure out how to make a shoehorned relational DB scale, instead we start with a database designed for specific workloads.

simonebrunozzi · on April 12, 2022

> In addition: long before AWS you could easily rent virtualized or dedicated servers.

I disagree with this statement. Yes, you could rent, but not by the hour and based on compute power, and couldn't rent extra storage again by the hour and by the GB. Plus, you couldn't interact with these "virtual servers" through APIs.

I was at AWS 2008-2014 (early days!), and I think you should consider the impact of the "on-demand", API-by-default, nature of the AWS offering. Oh, and don't forget that with a valid credit card you could be up and running in literally minutes, not weeks.

Back then AWS had decent performance, but it was pretty bad when compared to more traditional Colo offerings; but in regards to the above aspects it dominated the scene, undisputed. IMHO, that's what gave AWS most of the initial traction.

GordonS · on April 13, 2022

> Yes, you could rent, but not by the hour and based on compute power, and couldn't rent extra storage again by the hour and by the GB

Right - because it's so cheap you don't need to rent by the hour.

robertlagrant · on April 25, 2022

Think of the case we might have at my company: we occasionally need a load of GPUs to train a new ML model. We spin up some beefy VMs with stonking great GPUs, churn away for a while, and then spin them down with our new ML model ready to go.

lsaferite · on April 12, 2022

>> In addition: long before AWS you could easily rent virtualized or dedicated servers. > I disagree with this statement.

Your disagreement here is without merit. You are talking about facets that were never even mentioned by the GP. If you wanted to list those things off as why the previous situation was suboptimal, fine, do that. But how can you disagree with a (presumably) completely factual statement?

manigandham · on April 12, 2022

Definitely more process involved, but the requirements were also simpler compared to sprawling modern architectures. I think the overall effort is similar, but was slower due to the speed of communications and paperwork at that time.

Getting hard assets is a core part of business in basically every industry, so I found it strange to claim that it was some major obstacle just because it happened to be servers instead of trucks or factory equipment.

wongarsu · on April 12, 2022

I can only speak for the last 15 years or so, but any time I rented virtual or dedicated servers the experience was basically the same as ordering a book from Amazon: you create an account, select what you want and how you want to pay, the next day you have an email with IP and credentials.

Of course since then things have improved and you can now expect your sever to be provisioned within minutes, along with a nice dashboard to manage it. It's a bit more involved if you want to build your own server and put it in colocation somewhere, mostly because that involves being physically present.

emteycz · on April 12, 2022

Well speaking about the last 15 years, so 2007 to today, for me it was the same as modern clouds - register, add a credit card, order a server, few minutes later the root password arrives in e-mail. And oldschool LAMP webhosting is basically serverless cloud.

Edit: Actually I remember using actual cloud provider back in 2008 - it was called Virtualmaster, one of the first cloud providers in central EU. They offered a free (!) VPS with 256 MB RAM and 1/4 CPU - I had a minimalist Debian image that allowed me to run full LAMP stack on it and an IRC client in tmux session.

Another cool central EU provider was 4smart. Their offering used to be that you only paid for resources you actually used - but for VMs. If you consumed only 10MB RAM, you only paid for that. I had servers cheaper than 1 EUR/month running continuosly with their own IPv4 there. They changed the pricing structure after few years.

chromaton · on April 12, 2022

Linode was founded in 2003.

yellowapple · on April 12, 2022

Hell, it still is sufficient for many businesses.

_hcuq · on April 12, 2022

I still use them. Easier than fiddling with AWS.

post-it · on April 12, 2022

The biggest thing for me is no surprise bills. Sure, it's pretty unlikely that a t2.small instance and its associated storage and network subscriptions are going to produce a $1000 bill one day, but there's literally no way to set a hard cap on billing so clearly Amazon thinks it might be possible.

phonon · on April 12, 2022

Use https://aws.amazon.com/lightsail/pricing/ then...

ludjer · on April 12, 2022

At that point you might as well go with vps. Lightsale is way more expensive and the underlying architecture is way more unstable I have a 1% failure rate for my fleet of instances. Ever two weeks I have a instance is not responding and needs to be rebooted alert from aws.

whiplash451 · on April 12, 2022

OVH has a great dedicated server offering. Definitely a great bang for the buck compared to AWS (if you can handle the downsides of course: security, backup, setup, handled by yourself).

post-it · on April 12, 2022

> OVH

Emphasis on backup.

KptMarchewa · on April 12, 2022

And definitely keep it outside of OVH.

whiplash451 · on April 13, 2022

Indeed. You can use glacier for that. Cheap and efficient.

xuki · on April 12, 2022

I'd say don't even consider colo unless you have a specific use case. Rented dedicated servers are cheap, let someone else take care of the hardware.

stephbu · on April 12, 2022

To add to this statement - don’t make any investment unless you have a solid plan to fully utilize it for the full length of the purchase terms. This applies as much to AWS @ minute terms, as it does to a rental @ daily terms, as it does to co-lo capital investment @ yearly terms.

marcosdumay · on April 12, 2022

> unless you have a solid plan to fully utilize it for the full length of the purchase

Hum... I'd say it's much more reasonable to look at the ROI. Making investments to supply peak demand or to hedge against rare risks is perfectly ok.

stephbu · on April 12, 2022

Fair point, I guess the point is that unallocated resources - space/power/servers etc. can become huge stealth money sinks, eating budget every hour of every day. Being cognizant of the consumption economics before you stump up for resources is important, as is fitting the investment model to match those economics. Setting utilization/allocation targets are just one way of measuring if those models efficiently match. This is true for any service or resource consumed.

bsenftner · on April 12, 2022

Yes, they are cheap. Running one's own server is also easy peasy; far too many think it is difficult, it is not. The most expensive part is the electricity.

briffle · on April 12, 2022

Been a sysadmin for many years. Your right, for a few computers. Once you start getting into more than a quarter rack, you also need to start worrying about cooling (which also is lots of electricity) and usually things like ensuring the electricity stays on. (UPS, generator, etc). Don't forget to monitor and service all this stuff regularly.

Once your past a few racks of equipment, you have generator tests and service appointments, redundant AC, Redundant UPS, dual power to each rack, etc. Dual Internet connection, and a link to your other server room that you use for DR, etc. The costs and complexity quickly escalate after a server or two.

BryanRamirez · on April 12, 2022

(I created an account just to upvote and comment on this)

Absolutely, nevermind managing OS upgrades and needing to use configuration management to mitigate against drift. It took a lot of effort to make sure that each server was not a special snowflake that could not be reliably reproduced.

Also, dealing with vendor warranties, and being on hold with HP (or whoever), then assuring them you're running the latest firmware.. please for the love of god just replace the failed memory/disk/cpu.

I found the sheer physicality of computing infrastructure to be a source of exhaustion and burnout. It's a big part of the reason why I'm a software developer now :)

secabeen · on April 12, 2022

> Once your past a few racks of equipment, you have generator tests and service appointments, redundant AC, Redundant UPS, dual power to each rack, etc. Dual Internet connection, and a link to your other server room that you use for DR, etc. The costs and complexity quickly escalate after a server or two.

This is not always the case. For computationally-focused workloads like the OP describes, without direct customer interaction, it may be reasonable to accept risk of downtime in the event of failure. If you are doing computations that take weeks to complete, and you checkpoint regularly, does it really matter if your computation finishes on Saturday morning or Monday morning? If not, you can probably accept 6 hours of downtime once every couple of years and eliminate all of the redundancy overhead described above.

In HPC, the general rule of thumb is buy your hardware if you can be sure you'll run compute on it more than ~40% of the time.

dhzhzjsbevs · on April 12, 2022

In the bad old days we bought our own servers and rented rack space in a datacenter they took care of all the hardware management, we looked after the software.

Mind you this was only about 4U worth.

vidarh · on April 13, 2022

For on prem. Or you put it in a colo, and let someone else deal with those things, and you still pretty much the same cost savings vs. cloud - sometimes more (e.g. if you can pick a colo somewhere with cheaper electricity you can save; if you can pick one somewhere with cheaper land, you can save). Or you move to managed hosting and still bare metal, and stop worrying about the physical servers at all, at loss of some freedom, and still get around 90% of the cost savings vs. cloud.

bsenftner · on April 13, 2022

I had a cabinet at the colo data center that was formerly Enron's data center - super thick pipe to the public Internet, in a data center built to survive nuclear war. I had 17 servers and a Federal Reserve quality hardware firewall, total cost of hardware was about $55K, the colo expense of $600 a month, and my team of two spent perhaps a day a month on maintenance. This was in contrast to $96K per month for the same setup at Amazon.

maccard · on April 12, 2022

> The most expensive part is the electricity.

No, the most expensive part is the persons time for managing it. I can rent a monstrous Dedicated Server for $400/month from OVH, but even with a UK salary, if I have to spend more than 1 day a month on it in any shape or form (and that includes the initial setup), it's cheaper to use "the cloud" or some form of a managed service.

tiew9Vii · on April 12, 2022

I hear that said a lot. I don’t think it stands.

A ec2 instance or other vps requires the exact same maintenance as a bare metal server. They are essentially the same except one is virtualised and the other isn’t.

The cloud actually requires more investment for large organisations. Previously you might of only had a handful of sysadmins but now you have a large dedicated platform team doing devops type work building your own abstractions/paas on top of your cloud.

The advantage the cloud has is flexibility. You don’t need to go through a lengthy process to acquire hardware. Likewise cloud services are disposable, no longer need something? Hit the delete button.

I think the more PaaS like services such as Heroku, Lambda, Fargate, Google Cloud Run do better realise the less maintenance story but not cloud generally.

maccard · on April 12, 2022

> I think the more PaaS like services such as Heroku, Lambda, Fargate, Google Cloud Run do better realise the less maintenance story but not cloud generally.

Completely agree here. Running an EC2 Spot instance + ebs volumes for 730 hours a month is a total waste of money, but running RDS behind fargate and ECS with an ALB is likely to save you time and money by month 2.

stephbu · on April 12, 2022

I broadly agree with the statement that owning an OS in general is toil that applies any of these IaaS/VPS/BM scenarios.

Owning a BM server different toil - server parts fail, the network it attaches to needs control and it fails too, firmware needs updating more regularly than ever, DC space needs managing over time etc. For a small number of machines maybe this is NBD. For thousands of machines this is just grunt work which while automated, still needs change management and control - rebooting the whole fleet in the middle of the day definitely opens doors in your career.

Doing everything you did in the DC in the Cloud is absolutely the worst way to adopt Cloud. Owning an OS is a non-goal, you’ve gotta climb to a higher abstraction - workloads, and quit caring about machines. This is where most companies fail.

philliphaydon · on April 12, 2022

> A ec2 instance or other vps requires the exact same maintenance as a bare metal server. They are essentially the same except one is virtualised and the other isn’t.

Ani. If the hardware you’re running on is dying. In ec2 you stop it and start it. It’s on new hardware. If you run bare metal you’re screwed.

Disk is dying ? Don’t matter cos your data exists multiple times over in AWS elastic storage. With bare metal you got to shut down and replace.

The cost of managing hardware is gone with ec2.

Software on the other hand yes that is the same amount of effort.

dhzhzjsbevs · on April 12, 2022

The cost isn't gone. It's just included. Fact is, at scale, it's still cheaper to do it yourself. You just need to reach the scale it's worth paying people to do the managing.

BryanRamirez · on April 12, 2022

I agree, with the caveat that at scale it's cheaper to do it well yourself. If that weren't true, then Amazon would not be making money on EC2.

If you don't have sysadmins and network admins with the right experience, you can easily find yourself in a bad spot with single points of failure, servers that can't be easily replaced, oversubscribed PDUs, misconfigured switches/routers... and any number of other problems that aren't occurring to me right now.

vidarh · on April 13, 2022

If you don't have sysadmins with the right experience, there is a vast number of companies that can do this for you on a fractional basis on retainer. I was one of them.

The crossover point where cloud is more expensive is really low even if you have zero in-house experience. Exactly where it is depends on your amount of egress, as that is where AWS in particular really takes advantage of you.

dhzhzjsbevs · on April 12, 2022

This is very true. You also probably need the scale having multiple install locations makes sense.

We went dedicated early and while it didn't make sense at the time we now run at way lower cost than the competition could dream of.

marcosdumay · on April 12, 2022

> In ec2 you stop it and start it. It’s on new hardware. If you run bare metal you’re screwed.

You mean, you reimage? That is the slow step, you reimage, and plug the new server. Wait a bit, and your service has one more server.

> With bare metal you got to shut down and replace.

You take the disk out and plug a new one. You don't turn things off because of a disk.

No doubt, those are costly. They are also rare (disk failure is less rare, but still rare).

philliphaydon · on April 13, 2022

> You mean, you reimage? That is the slow step, you reimage, and plug the new server. Wait a bit, and your service has one more server.

No. When you /stop/ an EC2 instance, and /start/ it again, it moves. You do not need to reimage. This is even requested from AWS when they are having hardware failures and need to move customers off so they can decomission the hardware. They request you stop / start the instance, if you do not do it by the due date they do it for you.

> You take the disk out and plug a new one. You don't turn things off because of a disk.

If you have a storage array sure. But if you're getting bare metal hosting from a provider, you're not always getting hot swappable storage arrays.

> No doubt, those are costly. They are also rare (disk failure is less rare, but still rare).

It was 1 example, obviously there's many different hardware issues that can go wrong.

vidarh · on April 13, 2022

> If you have a storage array sure. But if you're getting bare metal hosting from a provider, you're not always getting hot swappable storage arrays.

If you have any server-level hardware bought in the last 20 years or so, it will have the drives in hot-swappable bays. If you then choose to not set it up in RAID, it's just incompetence.

> If you have a storage array sure. But if you're getting bare metal hosting from a provider, you're not always getting hot swappable storage arrays.

If you're getting bare metal hosting from anywhere including your own colo, you have failover and the ability to order replacements while your system is still running. This is only an issue if you're architecture is fundamentally flawed, in which case you're likely to mess things up whether you're on bare metal or in a cloud.

ludjer · on April 12, 2022

Not really If you running bare metal and have a San you can easily change what it is pointing to. Also most bare metal servers have redundancy and disks can be changed with 0 down time.

philliphaydon · on April 13, 2022

So now you're throwing a ton more money at something, costing you in purchasing the bare metal, but also the cost to maintain it. If you want to build compariable redundancy as you get with EC2 it will cost you. It wont be cheaper.

vidarh · on April 13, 2022

I have no idea what you think the costs of this is. I have managed setups like that. Every year we priced out what a cost to EC2 would cost us, and every year it was about 3x the cost of running our own, with my time - accounted for to the hour - of running the system added in. Every year we also priced out Hetzner and a few other options. After a years Hetzner eventually won out (colo space in Germany was cheaper than where we were in London). So we tied Hetzner servers into our private cloud layer, and migrated containers and shut down servers as it fit into our schedule. Not having to physically go to the colo's to swap drives now and again saved me an average of maybe 2 days a month to deal with several racks worth of hardware.

maccard · on April 13, 2022

> Every year we priced out what a cost to EC2 would cost us, and every year it was about 3x the cost of running our own

> we tied Hetzner servers into our private cloud layer, and migrated containers and shut down servers as it fit

Building your own services on top of AWS is always going to come out more expensive. EC2 + EBS volumes alone are going to be more expensive than going with hetzner (particularly if you're not looking at reserved instances, and not utilising spot for burst). You mentioned that you are building your own private cloud layer and migrated containers; the cost of building that out in the first place is likely enormous compared to building and running on top of fargate.

vidarh · on April 14, 2022

The cost of building out our private setup was my time for about a month.

At the time we didn't have a choice, as nothing like Fargate existed, but today it's also easier to do setups like the one we did. It mostly involved rsyncing base images over, rsync and a super simple storage service for backups, a LDAP based directory service, and and a thin layer over vzctl (first) and docker when that became an option, coupled with a VPN setup to tie our locations together, and a reverse proxy setup that did dynamic lookups in our private DNS fed from LDAP.

It is hard to do as a multitenant public service, it's trivially easy to do as an internal tool that needs to support only exactly what you need.

I've built out setups like this for a number of clients since, and it's typically 1-3 months of work to automate pretty much everything depending on complexity, and so it pays for itself quickly from a very low scale.

The first company I did this at wouldn't have been profitable at all if we'd relied on AWS

GTP · on April 12, 2022

>Disk is dying ? Don’t matter cos your data exists multiple times over in AWS elastic storage. With bare metal you got to shut down and replace.

Not really, this is what RAID is for

philliphaydon · on April 13, 2022

Sure, raids some drives together, you still have downtime when you switch it off replace the drive switch it on wait for it to rebuild.

Or you can fork out the extra cost to have hot swappable.

If you're saying: "I dont want to pay $300 / month for EC2 instances at AWS when I can get the same hardware specs for $150 at X"

Chances are you're getting a shitty old desktop or barebones rack that lacks features like hot swap and raids.

Do you believe you're getting redundancy when you go with hetzner and a cheap desktop grade processor, ddr4 non ecc memory, and a consumer grade SSD?

vidarh · on April 13, 2022

There is no extra cost to have hot swappable unless you're considering buying consumer grade hardware, but consumer grade hardware is more expensive to host because it won't fit in 1U bays.

> Do you believe you're getting redundancy when you go with hetzner and a cheap desktop grade processor, ddr4 non ecc memory, and a consumer grade SSD?

Irrespective of how much I'd skimp on the hardware, I always have a HA setup, including on EC2, so it doesn't matter in any case.

yjftsjthsd-h · on April 12, 2022

Depending somewhat on the quality of the hardware in question. Hot swap capable isn't rare, but it's not going to happen at the bottom of the price continuum.

whiplash451 · on April 12, 2022

> A ec2 instance or other vps requires the exact same maintenance as a bare metal server. They are essentially the same except one is virtualised and the other isn’t.

Definitely not true. With a dedicated server, you need to handle backup and security yourself.

greedo · on April 12, 2022

With ec2 you still need to backup. You still need to validate the backups. Security it still something you need to do since an instance is still just a VM. Same with s3 buckets etc. Google for public s3 bucket "breaches." You still need to apply patches, to configure access, to expand volumes, to configure VPCs and security groups.

whiplash451 · on April 13, 2022

As far as I understand, when you use an EC2 volume, it is already backed up for you. It is not the case for an OVH dedicated server. For instance, backing up your OS image is a lot more work with a dedicated server than with an EC2 instance.

vidarh · on April 13, 2022

That addresses one possible set of failure scenarios. If you think that means you have a backup that is guaranteed to be accessible to you whenever you need it, you don't have a backup. If your only backup is in the same cloud provider where your primary system is, you don't have a backup, you only think you do.

Any reasonable dedicated setup will involve imaging your server, and so the OS image is not something you need to back up - if it fails you reimage. If you even store the OS image on the server at all rather than network boot.

whiplash451 · on April 13, 2022

That's still a lot better than what you have with a dedicated server.

For you to lose your OS image with a dedicated server only takes your HDD to die.

For you to lose your OS image on EC2 (where you made a snapshot of your volume in one-click) would take a lot of shitstorm to happen at AWS -- as I presume that they backup across sites.

vidarh · on April 13, 2022

> For you to lose your OS image with a dedicated server only takes your HDD to die.

Only if you don't have a backup.

Why in the world do you think anyone would store their only copy of an OS image on a single server?

For systems I set up, to start with, the OS is mostly immutable, booted and updated transparently to match a master image. If it gets destroyed, we just image a new server. The applications all run in containers, based on images stored on replicated file servers. If they get destroyed, we just re-deploy on a different server (in fact, automatically redeploying is trivial).

Only the application data is unique to running servers, and that needs to be backed up just as much whether those containers run in a cloud environment or locally, and again it's trivial to have automation in place for the backup and re-deployment of that. Been there, done that many times.

> For you to lose your OS image on EC2 (where you made a snapshot of your volume in one-click) would take a lot of shitstorm to happen at AWS -- as I presume that they backup across sites.

For me to lose my data on any bare metal system I've run, multiple servers in at least two different data centres operated by at different companies would need to fail at the same time. This is not hard to set up, and it's a one of setup. You then need to test your backups, just as you need to with AWS - an untested backup is not a backup.

But your assumptions of failure scenarios is also flawed. You need to protect against e.g. disgruntled employees, hackers, bugs as well. If you rely on the same security to protect your backups as your main setup, you don't have a backup.

EC2 is great when you can justify the cost, but it does not remove the need for a proper backup policy and processes to test them.

vidarh · on April 12, 2022

If you have to use more than one a day a month managing the server, you are doing something wrong.

I used to manage multiple racks worth of servers on top of managing the 1k containers running on them, maintaining the (pre-kubernetes) orchestration software I had written to deploy containers to our servers, and still had time left over to spend the majority of my time on the architecture and project management of new projects for clients.

treis · on April 12, 2022

I'd say if you're not spending one day a month you're doing something wrong. Namely, you're not testing your back ups and disaster recovery often enough.

brimble · on April 12, 2022

You're right, but given the context of the thread, it's worth pointing out that's not extra work that one doesn't also need to do with cloud hosting. Ought to test your backups and disaster recovery regularly, either way.

vidarh · on April 13, 2022

That's not managing the server. That's managing your application setup and not something that takes additional time for managed servers over cloud setups, because you still need all of those things for a cloud setup.

It also should not take anywhere near a day per server per month - if it does, then in a disaster scenario it means you're unable to recover at a reasonable pace.

treis · on April 13, 2022

>That's not managing the server. That's managing your application setup and not something that takes additional time for managed servers over cloud setups, because you still need all of those things for a cloud setup.

It is additional time because the cloud handles the whole class of "your hardware died" problems.

vidarh · on April 13, 2022

Most decent bare metal setups abstracts away the same class of "your hardware died" problems. E.g. my first rule on this is everything runs in containers whether I run in a cloud environment or on bare metal. The OS image is identical, and include tying into service discovery and a suitable orchestration mechanism (which can range from something trivially simple to, say, Kubernetes). Any modern server hardware has IPMI or something equivalent, which means you plug it in, configure the IPMI, configure network boot from a tftp/bootp or similar server holding your installation image, and from there on out you're deploying containers the same way as you would in a cloud environment, and back them up and arrange for failover the same way as in a cloud environment.

I've set up more than one hybrid setup where you didn't need to know if your workload was running in AWS or Hetzner or somewhere else, so we could use AWS for elasticity and Hetzner to keep cost down.

This is not a hard problem. And if people don't have the right skills in house, it's easy to outsource this (I for one used to make my living of automating setups like this and operating them on a retainer basis).

treis · on April 13, 2022

Containers running some work load dying is a lot different from a HDD going kaput on your production DB server. Or someone compromising your file system and encrypting all of your user uploads.

I don't think anyone is claiming it's a particularly hard problem. It's the opposite. We're talking about 1 hour a month. Even that little effort still ~$200 a month and a $200 a month managed DB instance is pretty beefy.

vidarh · on April 14, 2022

None of this is any different I a cloud setup. You still need to check the integrity of your backups. You still need to have appropriate failover. The effort is pretty much the same. In a competent setup it's automated, so you're not spending an hour a month per server. Not even for the production DB server, because you have replicas and setups to automate the switch, and setups to bring back snapshots from backups.

Been there, done that, many times, and I know exactly what it costs people, because I billed by the hour for it, and your cost assumptions are just way off.

maccard · on April 12, 2022

Broadly speaking, managing 100 "cloud" servers is roughly the same amount of work as managing 2. Moving from managing 2 cloud servers to 100 is trivial. Moving from 1 server to 2 is an architecture problem. That architecture problem sometimes comes up when it comes to scaling too. it's the cattle vs pets problem. The last thing you want to find out is that someone ssh'ed in and installed a package that's required for your service to run, _during_ vertical resizing, and that sort of thing is far more common in systems with 1 server instead of many. If you're running containers, you would likely have even less overhead (and save money) by running on ECS/DigitalOcean/Azure App Service, and if you're running a big old monolith, you likely need > 1 instance for some sort of redundancy anyway.

vidarh · on April 13, 2022

> Broadly speaking, managing 100 "cloud" servers is roughly the same amount of work as managing 2

If this was the case, my billable hours when I was doing contracting would be about 1/10th of what they were. I lived very comfortably of troubleshooting for teams who had gotten themselves into a thorough mess with this attitude. In fact, I earned more from the teams who insisted on cloud setups because they rarely understood the operational issues with it, whereas teams who chose dedicated servers generally thought about operational concerns more.

> Moving from 1 server to 2 is an architecture problem.

Moving from 1 container to 2 is an architecture problem.

Moving from 1 server to 2 running those containers is an architecture problem with well established known solutions.

E.g. for starters you're assuming no containers. But putting the application in containers and leaving the host OS only for basic infrastructure setup is basic practice today if you're running your own servers.

Once you've set up a network boot source (tftp etc.) to network boot of an installer with a suitable setup script that you can trigger via IPMI, and a directory service and a basic orchestrator (be it Kubernetes or something else) on your network, it doesn't matter much if you have 1 server or a 100 - they come up and you put containers on them, and they look no different than a cloud service to the devs.

You're right that it's easier to take shortcuts if you have just a few servers, but it's just as easy to take shortcuts with just a few cloud instances - the number of pet containers I've seen over the years is terrifying.

> The last thing you want to find out is that someone ssh'ed in and installed a package that's required for your service to run

Which is why you don't provide ssh access to the host servers to anyone without an understanding of ops concerns, log everything, and do all updates via an automated setup of your preference, and why you regularly recycle the containers whether you run a cloud environment or dedicated servers.

The reality is that cloud systems do not at all make you immune to this - I've done year long projects to regularise AWS setups that were full of undocumented manual changes to bring everything into a terraform config for example. Often they're worse, because there are a whole lot of unobvious places to look for extra bits and pieces.

> and if you're running a big old monolith, you likely need > 1 instance for some sort of redundancy anyway.

Nobody here suggested a monolith. Nobody is suggesting you forgo redundancy. The. The point of this is that often you can put clear upper limits on the computation you will need to do for either your system as a whole or for a given subsystem, and you can guarantee that you will never need more than one server for a given part of the system.

E.g. a real example: I've worked on a system that did some processing of data about companies. We know this will always fit on a single system because the population growth of humanity is slower than the performance growth of a server and the total number of companies worldwide fits on a single system with several magnitudes to spare today, and the type of companies we were interested in is just a subset. At this point, if you architect a system like that on the basis of assuming you will need to resize, you will risk making choices that makes you far more likely to have to resize. E.g. all of the data I'm talking about can easily fit in RAM on a relatively moderate server now and forever, but the moment you start planning for partitioning the data you have added orders of magnitude of performance overhead for communication.

Properly assessing which parts of a system needs to be able to scale is at the core of architecture, and architects (or devs; it's terrifying how many places lack anyone with architecture experience) who are just planning for infinite scalability for everything is huge red flag to me. It usually means they don't understand their system. For a startup, especially, this is an existential question - preparing for unnecessary scaling has killed many startups.

whiplash451 · on April 12, 2022

I think that depends a lot on which stage your startup is at, what your level of funding is, etc.

Definitely not as clear-cut as you seem to imply.

stephbu · on April 12, 2022

It’s pretty hard to generalize this without qualifiers. Electricity can be the most expensive problem, but it requires carefully planned control of the other factors - e.g large scale with highly automated servers, network, and meat-reducing control planes to become true. Otherwise factors such as people and under-utilization can especially dominate smaller and/or less efficient facilities.

Thinking thru the factors, many seem obvious, but are often forgotten/ignored when comparing rental or IaaS costs.:

Space is a fixed cost driven by market rates and maximum Server Capital Costs i.e. floor space. Failing to fill the room increases your cost/server efficiency. Pretty common to run out of thermal/power before you run out space, as equipment efficiency increases through the lifespan of the facility.

Server and Power costs scale together, carry a minimum cost for keeping machines on, and vary based on utilization. Again if the servers aren’t doing work, your efficiency ratio will drop. Larger space typically have pre negotiated power commitments too - failing to consume that carries fiscal penalties. Servers unit costs are fairly cheap, storage not so much. Full utilization throughout capital/lease lifespan is the goal - anything less increases relative cost/server.

Network costs scale with Server Costs, and vary again by utilization - minimum invest rules apply, all servers need at least one network port, as well as upstream Core/TOR/Miniswitch gear. The network gear lifespan is typically longer than servers, but shorter than facilities. It usually incurs annual support/maintenance charges too. Bandwidth charges are variable as expected.

People costs scale with a step-function and numbers driven by minimum coverage requirements, task complexities, and level of human toil. Performing any task on a device by-hand is expensive in most markets - touches on tickets, change management, task time etc. Fully burdened S+R in Western cultures is typically ~2x the salary - a $80K employee probably costs around $150K by the time all the workplace costs, taxes, and benefits are paid. Network folk are typically premium resources compared to DC Ops. Sustainable 24x7 coverage looks like a staff of 3-4 people.

vidarh · on April 13, 2022

The thing is, you can buy your way out of most of these considerations by either renting colo space or renting managed servers until/unless you're at a scale where there are savings to doing it yourself. This is a commodity service with margins a tiny fraction of AWS' margins. Buying services at those levels instead of doing hosting on prem still nets you 80%-90% of the savings vs. cloud. Sometimes it nets you greater savings because it makes it easier to move your workloads to cheaper locations (e.g. for a small company in London operating servers somewhere with low energy prices and low property prices is hard; colo space reachable from London in often 30%-50% above what it can be like many other places in Europe, and so you'll find managed servers from providers like Hetzner is often cheaper than on prem if your company is in a location like that)

stephbu · on April 13, 2022

I think I covered those poin in calling out the cost model elements - renting the “bottom third” of the costs via managed facilities, BMaaS etc. helps in terms of reducing or eliminating capital expenses and some human toil especially in a more stable business that doesn’t benefit from per-minute lease terms or multi-year. As I called out elsewhere, sizing the hosting model to the economic model is really important.

Even in the pure rental or managed BMaaS, the human cost can quickly dominate the economic model. Owning machines and OS’s is expensive at anything more than a couple of racks of machines. Eliminating people and human change/release from touching things in the datacenter is probably the first priority. Otherwise it is hard to consistently drive that human number down and meet service quality expectations for 24x7.

vidarh · on April 14, 2022

If human costs dominate, you're doing something very wrong, and in any case renting the services only obscure this, and you can externalise that just as easily with managed servers by outsourcing the management.

What is truly expensive is buying a bundled service with massive margins.

Sometimes you have the luxury of not caring, e.g. when building very high margin, low scale tools where human costs dominate due to dev or other parts of the business, but for anything that starts requiring hosting costs beyond even as low as a couple of k a month, you're leaving money on the table.

koolba · on April 12, 2022

> The most expensive part is the electricity.

There’s more than a kWh rate for that though. There’s battery backup power, diesel backup, and failover hot switching. It’s not like just plugging in your phone to charge.

m348e912 · on April 12, 2022

Easy or not, it's a job. Configuring, patching, upgrading, troubleshooting, securing, monitoring. It's only not expensive if your time is worth nothing.

greedo · on April 12, 2022

And you still need to do this with an ec2 instance. Sure you can automate a lot of it via config mgmt/orchestration, but unless you're using serverless stuff on AWS, it's an awful lot like having your own VMs.

manigandham · on April 12, 2022

True, there's a wide spectrum in the middle from rented colo to managed servers.

vidarh · on April 12, 2022

I agree fully with this with one caveat: if you're in an location then managed servers can be cheaper than your own colo. E.g. I'm in London, and it's hard to beat Hetzner with colos near enough to me to be practical.

Still sometimes reasons to use colos, but I think it's important people consider that the choice isn't just cloud or your own equipment on prem or in a colo - managed servers can get you most or all of the savings too.

Agree about scale. Most software devs have no idea what can fit on a single server, and sometimes tend to start wanting complex scaling solutions for things where every possible customer they could ever get could fit on a single server

mbostleman · on April 12, 2022

Of course. I stumbled on that sentence too. Maybe he meant something else or he is very young. The only difference was that they were called ASPs, not SaaSs.

croes · on April 12, 2022

The dot-com bubble wasn't about SaaS companies but internet companies as such, basically everything with a .com address. Hence the name. It was more about e-commerce than SaaS. Especially because of the internet data rate in the late 90s.

theta_d · on April 12, 2022

In the 90s we didn’t use the term SaaS. It was ASP back then, Application Service Provider.

edanm · on April 12, 2022

You're mixing two different things. SaaS is a business model. ASP is a technology.

manigandham · on April 12, 2022

No, that's a different ASP. Application Service Providers are a business type: https://en.wikipedia.org/wiki/Application_service_provider

edanm · on April 12, 2022

Oh! I stand corrected. Very interesting, I'll be reading up on this.

vidarh · on April 12, 2022

What it was "more about" is irrelevant the point is there was plenty of SaaS, and plenty of options to host them without owning servers. Many of us here we're around and running internet companies at the time.

manigandham · on April 12, 2022

The quote also says "web-based". Anyways the point is that web/internet businesses were common and growing rapidly way before any modern IaaS provider.

mschuster91 · on April 12, 2022

> Companies also vastly overestimate their scale when their entire business could probably fit on a single commodity server.

Indeed, but that's completely leaving out the single most important thing: backups. With all of the major clouds, snapshots are easy to do both at a VM level and data level (e.g. RDS), and the cloud provider takes care that the backups are sufficiently spread to be disaster tolerant.

In contrast, when you co-locate you have to take care of backups completely on your own, and with many hosters you can't even influence in which of their DCs your servers will be.

If anything, OVHs SBG fire incident should have shown everyone how hard it is to build resilient systems.

marcosdumay · on April 12, 2022

> and the cloud provider takes care that the backups are sufficiently spread to be disaster tolerant

Except against the disaster of the cloud deciding it's not worth to keep you as a customer, or the cloud having a distributed failure, or the cloud getting out of business...

BackBlast · on April 13, 2022

You need your own backups even if you're running EC2. If for no other reason than the vendor disappears, and you want to recreate the system elsewhere.

It's the same problem if you run your own server or on someone else's systems. You need a backup and restore plan either way.

300bps · on April 12, 2022

Nonsense. There were plenty of SaaS startups.

I started an ISP in 1996 and had the same reaction you did to that statement in the article.

The only thing I’d add to what you said is the article said what drew them to the cloud - hefty free credits.

SpaceMartini · on April 12, 2022

I work in HPC for a cloud provider, and fully endorse this move. Anonymously, of course.

You can make an economic argument for or against cloud in practically every IT domain, but in HPC the case for on-prem is really compelling; none of the cloud networking/resiliency value-add is relevant to batch workflows, and costs per core-hour are only remotely comparable if you use spot - which is itself a major compromise.

The only real advantage cloud has for science is object storage, which is genuinely a much better idea than trying to manage your own long-term archival storage.

If I were independent I would recommend people buy and build on-prem clusters and shuffle data out of fast scratch into Glacier, but other than that just don't worry about cloud until price pressure kicks in and we are down to 1-2 cents per core-hour on-demand.

I'd love a role where I can say these things non-anonymously, but the salary for such a position would be at least 50% lower than working for a cloud provider. Keep that in mind when talking to your supplier - we may not believe the pitch ourselves, but making it is just part of the job.

dagw · on April 12, 2022

The only real advantage cloud has for science is object storage

As someone who has done a fair bit of HPC I consider the real advantage to be temporary scalability. If my 'normal' compute notes have 128 GB of RAM and all of a sudden I have job that need 300 GB or RAM, with cloud I can just change a line in a config file and run that calculation on a machine with 300 GB of RAM. Or if I have a job that will optimally run on 100s of 1-core machines with only 4 GB of RAM I can set up a cluster of such machines with in minutes.

That being said I 100% agree that if you have a normal baseline workload that should absolutely be done on in house hardware.

SpaceMartini · on April 12, 2022

Those are fair points - I have seen truly spiky workloads like that very occasionally, but more often those spikes are a precursor to more sustained usage in a similar manner and so would quickly warrant hardware purchases.

SpaceMartini · on April 12, 2022

As an addendum to this: if you absolutely must use cloud, stick with AWS. Using Azure is (IMO) a fucking miserable experience and their only advantage (InfiniBand) is better served by buying your own hardware. GCP and OCI might be fine if you are getting a lot of credits, but the skills will not be useful down the line - while AWS is expensive, you will at least learn a bunch of in-demand operational skills.

smartbit · on April 12, 2022

> you will at least learn a bunch of in-demand operational skills.

Are you saying we should stick with AWS because most stick with AWS?

My personal experience with GKE/GCP was quite good, except as expected for their support.

SpaceMartini · on April 12, 2022

> Are you saying we should stick with AWS because most stick with AWS?

Broadly speaking yes - there is a lot of value in having a deeper pool of skilled people to hire from, and there are enough differences between cloud offerings to knock at least a couple of "effective years" of experience off someone who changes provider.

briffle · on April 12, 2022

So nobody got fired for buying IBM?

smm11 · on April 12, 2022

I can hang 32 terminals off just one PC. You're still blowing your budget on standalones.

earleybird · on April 12, 2022

Dang it all, we had 20+ teletypes hanging off a 32Kb PDP-8 (with DECTape for random access storage), "and we liked it!" :-)

lnwlebjel · on April 12, 2022

> The only real advantage cloud has for science is object storage, which is genuinely a much better idea than trying to manage your own long-term archival storage.

I work with an academic HPC group, and because researchers generally pay only for the hardware, and maybe some recharge rate for occasional maintenance, the cost per TB per month for 100's of TB and larger systems works out to the same for Glacier Deep (about $1/TB/mo) - except there is no 180 day requirement, no egress fees, and no transfer fees. And disk just keeps getting cheaper.

I'm told that big part of the solution is their use of ZFS.

Tsarbomb · on April 12, 2022

If you can setup HPC at scale you can setup Ceph for object storage. It will save you so much money in the long run.

SpaceMartini · on April 12, 2022

I have heard too many horror stories about Ceph (and OpenStack) to be confident about that. I certainly don't think I can truly beat S3 on cost or performance at the terabyte scale for household data - and while larger scale would give on-prem savings there are also higher expectations (in terms of availability and performance) of a multi-perabyte storage array.

secabeen · on April 12, 2022

Really depends on your scale. At the Terabyte to 100s of TB level, you can solve most storage problems at minimum cost with NAS or ZFS on commodity hardware.

Ceph/Object storage comes into its own at the multi-petabyte and higher levels, which is not very many groups or institutions.

SpaceMartini · on April 13, 2022

Solving storage at the tens of TB scale with commodity hardware is fine to a point (I have a ZFS NAS at home) but has much more ongoing maintenance burden than S3 and you need at least 2 copies for it to be a remotely comparable solution in terms of durability.

Ultimately you just have to design for what is important to you; I don't want to spend time managing this stuff any more, so keep a local NAS for my partner to access and put the bulk of my "cold" data into 2 different cloud object storage providers. Note that neither of these is actually S3; for business use I would absolutely use AWS but for personal files I can manage with the reduced capabilities and lower prices others offer.

lmeyerov · on April 12, 2022

Yep it is quite an arbitrage going on. Unless startup credits at the highest tier ($100K), way too expensive for most startups.

For a lot of our customers, cloud is impossible now for HPC: shortages are so bad that you have to know someone high up at top cloud providers to get access to right-sized GPUs. (T4? Forget it -- one of our tickets is open since ~Christmas.)

We have gone hybrid, and for growing compute, going multi-cloud, with main stuff on top 3 clouds (CPU, light minimal GPU...), and GPUs elasticity on other ones. And for a lot... Yep, just buy GPUs for local dev.

duxup · on April 12, 2022

I'm curious what people are spending / over spending all their money on in the cloud?

My exposure to the actual granular costs and billing have only been limited to a small company and in that case the costs were pretty appealing compared to running everything yourself. Granted this was also a bit of a hybrid with some services local and others in the cloud.

I've not had much exposure to where the deep costs start to pile up as far as cloud services goes. I wonder where those pop up?

bsder · on April 12, 2022

> I'm curious what people are spending / over spending all their money on in the cloud?

Outbound data.

Cloud companies generally make inbound data close to free, but outbound data incredibly expensive.

simonebrunozzi · on April 12, 2022

> If I were independent I would recommend people buy and build on-prem clusters and shuffle data out of fast scratch into Glacier,

Pretty much 100% agree with this statement. Only exception is maybe at the beginning of a long series of computation, you might want to start on-demand to fully understand and size your exact needs, and then provision off-prem.

zozbot234 · on April 12, 2022

Scalability might still be a major advantage if you have infrequent enough needs for HPC-scale compute. Basically, use the cloud service as a pure computational "grid". But the pitfalls of cloud business models (including sky-high costs for data egress) often make that unworkable.

fxtentacle · on April 12, 2022

Wholeheartedly agree. After my cloud storage costs exploded (mostly S3 egress over to Hetzner/OVH), I noticed that renting a 1GBit/s fiber connection to my office is actually quite affordable at $80 per month (in northern Germany).

Of course, it's not globally distributed and there is no fail-safe, but for the work we're doing, that is no issue. If we employees are offline, it doesn't matter if our tools are offline, too. And now that all AI storage is local anyway, building a GPU compute node is easy. I'm still waiting for 3090 prices to drop further, though, in contrast to the article. But I also went with Ryzon 5950 and Linux. I was positively surprised that 10G fiber networking is now down to $70 for a PCIe card + 20m cable kit. My workstation now has 1010MB/s 4k random write on the network filesystem. (We used SAMBA 3.1 and CIFS mounts)

I also grabbed the python/ubuntu package lists off Google Colab and created my own Docker to imitate it, and now data processing and AI training is fast (I always get the good GPU, no luck involved) and dirt cheap. Originally the idea was to run it on OVH, but I'm now also running it locally.

https://github.com/fxtentacle/ovh-colab-sagemaker-compatibil...

GordonS · on April 12, 2022

> I was positively surprised that 10G fiber networking is now down to $70 for a PCIe card + 20m cable kit.

Wow, that is surprising, maybe it's time I started upgrading my LAN...

v-yadli · on April 12, 2022

Actually if you have cat5 or cat6 base-T setup, you can get second hand dual port 10gbe pcie cards for 30 dollars each(!)

10GbE baseT switches are more expensive than fiber equivalent though.

fxtentacle · on April 12, 2022

In my case, cat6 cables and the added price of a 10G RJ-45 switch would have been more expensive than a 10G SPF+ switch and some twinax & fiber cables. Amazon has finished SPF-to-SPF assemblies for €10=$12. And for TP-Link, RJ-45 is like 1.5x the price of SPF+ equipment. Also, RJ-45 has a fixed minimum latency, due to it needing to support backwards-compatibility with 1G 100M etc. That means you need much larger send/transmit buffers to saturate the link as ping goes up. Especially if you need multiple hops, fiber is just faster. In my tests, 0.2ms vs 3ms roundtrip time for 10G SPF vs. 10G RJ-45.

hansel_der · on April 13, 2022

> Especially if you need multiple hops, fiber is just faster. In my tests, 0.2ms vs 3ms roundtrip time for 10G SPF vs. 10G RJ-45

ime if your unloaded latency on copper ethernet exceeds 0.2ms per hop there is some form of powersaving involved

fxtentacle · on April 13, 2022

If I remember correctly, the IEEE 802.3an line coding overhead is around 2.6ms for each time you switch between SPF+ and RJ-45. That's in line with my measurements.

SPF+ switch to OM3 fiber to SPF+ PCIe card => 0.2ms

SPF+ switch to RJ-45 cable to RJ-45 PCIe card => around 3ms

v-yadli · on April 14, 2022

NAS-[10GbE]->switch-[1GbE]->asus router, 2 hops ping test: ``` yatli@yatao-nas ~ % ping 192.168.50.1 PING 192.168.50.1 (192.168.50.1) 56(84) bytes of data. 64 bytes from 192.168.50.1: icmp_seq=1 ttl=64 time=0.236 ms 64 bytes from 192.168.50.1: icmp_seq=2 ttl=64 time=0.195 ms 64 bytes from 192.168.50.1: icmp_seq=3 ttl=64 time=0.300 ms 64 bytes from 192.168.50.1: icmp_seq=4 ttl=64 time=0.271 ms 64 bytes from 192.168.50.1: icmp_seq=5 ttl=64 time=0.153 ms 64 bytes from 192.168.50.1: icmp_seq=6 ttl=64 time=0.216 ms 64 bytes from 192.168.50.1: icmp_seq=7 ttl=64 time=0.300 ms 64 bytes from 192.168.50.1: icmp_seq=8 ttl=64 time=0.165 ms 64 bytes from 192.168.50.1: icmp_seq=9 ttl=64 time=0.163 ms 64 bytes from 192.168.50.1: icmp_seq=10 ttl=64 time=0.265 ms 64 bytes from 192.168.50.1: icmp_seq=11 ttl=64 time=0.174 ms 64 bytes from 192.168.50.1: icmp_seq=12 ttl=64 time=0.272 ms 64 bytes from 192.168.50.1: icmp_seq=13 ttl=64 time=0.397 ms 64 bytes from 192.168.50.1: icmp_seq=14 ttl=64 time=0.256 ms ^C --- 192.168.50.1 ping statistics --- 14 packets transmitted, 14 received, 0% packet loss, time 13168ms rtt min/avg/max/mdev = 0.153/0.240/0.397/0.065 ms ```

fxtentacle · on April 14, 2022

OK it seems mistake was to put a RJ-45 transceiver into an SPF port, if your direct RJ-45 line is so much faster.

v-yadli · on April 14, 2022

Yeah I think so. There might also be better converter modules. Anyway, my setup is caused by an oversight of not burying light pipes into the floor. Otherwise I guess I'll go with SFP+ too. I'm using TP-Link TL-ST1005 which is the cheapest 10GbE switch out there and it cost like $150. Also RJ-45 adapters/switches run anecdotally hotter than SFP+ ones.

hansel_der · on April 20, 2022

either that, or the equipment is in awe because you mixed up its acronym

hansel_der · on April 13, 2022

pretty wild, considering my phone averages at 2.2ms ROUND trip time when pinging the router.

theandrewbailey · on April 12, 2022

That's what I did. A few months ago, I bought 2 10GbE NICs off Ebay to connect my main desktop and basement server directly. I already had cat 6a running between them.

Some advice: make sure you know the form factor of your NICs. I accidentally bought FlexibleLOM cards. They look suspiciously like PCIe x8, but won't quite fit. FlexibleLOM to PCIe x8 adapters are cheap though.

fxtentacle · on April 12, 2022

I use Intel 82599ES SFP+ and a TL-SX3008F router. But let me warn you: Things are affordable, but NOT consumer-friendly. I needed to study the 500 page PDF manual and do basic link configuration through telnet via USB before I could connect to the router via Ethernet and use its web-GUI to finish the setup.

sofixa · on April 12, 2022

How's the routing performance on that switch? I'm really struggling finding a device to fit my needs ( routing, 6-8 ports, Ethernet, at least two 2.5/5/10G).

fxtentacle · on April 12, 2022

8x 10G SFP+ with a 160GB/s full duplex backplane. So far, it has passed all of my testing with flying colors.

sofixa · on April 12, 2022

Isn't that only for switching? Do you use it to route, cross-vlan or similar?

lioeters · on April 12, 2022

Just want to highlight how futuristic the author's title is: "Computational Biologist, Head of Protein Design @ Enzymit".

Why they moved to on-prem: lower and more predictable cost. At a public cloud provider, they lost thousands of dollars (of free credit they had) through "architectural blunders". And the running cost of GPU, CPU, storage, and data transfer summed up to $10K a month - at which point they figured they might as well purchase their own compute servers.

tonyedgecombe · on April 12, 2022

>"architectural blunders"

I wonder how much of this is by design. It seems it is in Amazon's interest to keep their systems and pricing as opaque as possible.

mathattack · on April 12, 2022

At MSFT the Azure solutions architects are part of the sales organization. Their commissions are tied to usage (revenue) which tells you everything about their skill sets. And their lack of standard clear pricing means you have to estimate everything yourself.

maccard · on April 12, 2022

I've spent a lot of time with Amazon's Architects who have given us a huge amount of advice on how we could reduce our costs. Maybe we've gotten lucky with our contact, but their approach for keeping us on AWS seems to be "provide enough value on top of AWS that the extra pricing doesn't really matter" mostly in the form of handing us architecture templates for common workflows that we've never set up.

sokoloff · on April 12, 2022

Now, when they make blunders, they don’t see an overt bill for them and are thereby happier.

emteycz · on April 12, 2022

Yeah, but when they do a blunder, they see they're not getting the results they expected and fix it, without any fear of 10x costs.

jdrc · on April 12, 2022

They also have free heating

motoboi · on April 12, 2022

Nice to see how this move can be an architectural blunder by itself.

reacharavindh · on April 12, 2022

The flexibility and scalability of the cloud comes at a cost. Scientific, static and non web but IO heavy workloads are almost always better off run on-prem or in a co-located data centre with servers paid up front. As is most things in businesses, it’s a trade-off one needs to think about and make. The smart ones will come out ahead, and those that blindly follow FAANG and drink their kool-aid may or may not come ahead, albeit with a heavy hit to their(or their VC’s) pocket.

If you’re a startup that simply have a bunch of web apps and APIs where uptime and network are your major costs, on-prem is only going to become a worthless headache.

A good Systems Engineer should help to figure out such choices. Anybody need one?

lvl102 · on April 12, 2022

I hate the fact that hiring now basically requires cloud experience specific to vendors. This is basically going to force people into one of the three major cloud platforms.

pzduniak · on April 12, 2022

Eh, it depends on the features that you're using. As long as you stick to the basic stack of Terraform + Kubernetes / IaaS with cloudinit + networking + S3-compatible storage API, you can quite easily jump between clouds. Sure, the logic that sets them up is different, but the concepts are generally roughly the same. Even if I end up choosing a managed service, I always implement a second OSS backend that gets regularly tested.

Every day I deal with AWS, Google Cloud and Oracle Cloud. Previously I used DigitalOcean and OVH. I have no issues onboarding people to work with the less popular options - as long as they get how Kubernetes / Linux / containers work, it's pretty good.

formerkrogemp · on April 12, 2022

Y'all are always recommending "if you stick to tech stack a, b, and c , then you can always have good work," but a year or two later you'll recommend learning or expertise in "x, y, and c." You're the first person I've seen recommend a vague, general-ish skillset.

lillecarl · on April 12, 2022

This is what I've come to realise too, as long as you can and do stick your workloads in Kubernetes it doesn't really matter what he logo says.

EKS, AKE, GKE, LKE, DOKS, OKD, Rancher... Whatever they're all compatible with what you want to do. There are definitely upsides to the cloud, but Kubernetes is the common denominator everywhere.

Wanna run GPU workloads on-prem? Buy some servers and do so. The only hairy thing is managing your own storage, quite the responsibility. (Look at Atlassian right now).

lillecarl · on April 12, 2022

GCE https://kubernetes.io/docs/concepts/cluster-administration/c...

GKE https://cloud.google.com/container-engine/docs/cluster-autos...

AWS https://github.com/kubernetes/autoscaler/blob/master/cluster...

Azure https://github.com/kubernetes/autoscaler/blob/master/cluster...

Alibaba Cloud https://github.com/kubernetes/autoscaler/blob/master/cluster...

Brightbox https://github.com/kubernetes/autoscaler/blob/master/cluster...

OpenStack Magnum https://github.com/kubernetes/autoscaler/blob/master/cluster...

DigitalOcean https://github.com/kubernetes/autoscaler/blob/master/cluster...

CloudStack https://github.com/kubernetes/autoscaler/blob/master/cluster...

Exoscale https://github.com/kubernetes/autoscaler/blob/master/cluster...

Equinix Metal https://github.com/kubernetes/autoscaler/blob/master/cluster...

OVHcloud https://github.com/kubernetes/autoscaler/blob/master/cluster...

Linode https://github.com/kubernetes/autoscaler/blob/master/cluster...

OCI https://github.com/kubernetes/autoscaler/blob/master/cluster...

Hetzner https://github.com/kubernetes/autoscaler/blob/master/cluster...

Cluster API https://github.com/kubernetes/autoscaler/blob/master/cluster...

Vultr https://github.com/kubernetes/autoscaler/blob/master/cluster...

TencentCloud https://github.com/kubernetes/autoscaler/blob/master/cluster...

These are all cloud providers that invested into their own managed Kubernetes, I'm certain all of them aren't as sleek as the big three, but it shows that there's definitely momentum behind sticking your workloads into Kubernetes.

CodesInChaos · on April 12, 2022

Most of those seem to be kubernetes autoscaler plugins that allow your cluster to manage resources in that cloud. That's quite different from cloud providers managing the cluster.

pjmlp · on April 12, 2022

Sure, but wait for certain HR departments or RFP documents requiring specific certificates as gatekeeping job or specific product deals, and so the business of cloud architect certification goes on.

goodpoint · on April 12, 2022

AKA lock-in. It was their strategy all along and the developer community fell for the trap once again.

anonymousDan · on April 12, 2022

For someone with a good general understanding of systems (distributed systems, operating systems etc), but perhaps lacking in hands on experience of deploying in a specific cloud, what would you say the key things are to learn? For me they almost seem like trivial things you could pick up in no time on the job.

Nextgrid · on April 12, 2022

Product-specific functionality & configuration. Every cloud provider has their own product, features & configuration for the same basic thing.

In the past, if you needed a load-balancer & reverse proxy you'd use Nginx or HAProxy regardless of the underlying machine. Now in the cloud, although you can technically run it on a VM, it's not "best practice" and you should instead reimplement it using your cloud vendor's proprietary equivalent, whether it's AWS ELB/ALB or something else, and that experience isn't portable across competing clouds.

pjerem · on April 12, 2022

"In the past" ? My current employer, a high growth company uses HAProxy and Nginx in production with no plan to change anytime soon.

Those are not tools from the past. They still work really well and there is no law requiring your brand new startup to be on AWS.

Most companies (even more when they are B2B) have very predictable workloads.

Nextgrid · on April 12, 2022

> there is no law requiring your brand new startup to be on AWS

See https://news.ycombinator.com/item?id=30272588.

criddell · on April 12, 2022

I'm pretty confident that somebody who has been able to master one provider is going to be able to figure out a different one relatively quickly. I think that breadth of knowledge could ultimately be pretty valuable.

rcarmo · on April 12, 2022

Most of the skills are transferable. Yes, APIs and provisioning tools can be different, but Terraform and Kubernetes abstract a lot away unless you really need to use PaaS (which can be a lot more useful than VM-centric folk usually care for).

But anyone who’s done infra at scale can easily get up to speed on any of the big three—-as long as you take the time to understand the differences, and actually model costs before starting to play around.

paxys · on April 12, 2022

Hiring for what? 95% of software roles out there don't need you to know what a cloud or server even is. If you are specifically going into ops/DBA/sysadmin type roles, then knowing the basics about the top service providers out there isn't too much to ask for.

bsenftner · on April 12, 2022

Unless, like OP, you work in a high compute field. High compute and the cloud simply do not mix, because the cloud is optimized for web apps, which require practically no real compute power. The cloud has it's purposes, as does the web, and they are not the only game in town.

Nextgrid · on April 12, 2022

Or high-bandwidth. Egress bandwidth charges on the cloud are robbery.

vsareto · on April 12, 2022

There's enough jobs where they're okay with any cloud experience. You might lose access to thousands of jobs by not having Azure on your resume, but there's tens of thousands where it isn't a strict requirement.

bob1029 · on April 12, 2022

> We do not have (yet) any public-facing applications that need to scale across multiple geographical zones and handle millions of requests per minute.

Most don't. 1mm requests per minute is very pedestrian for a single vm in virtually all cases. 1mm per second is totally reasonable too if you are careful with a few things...

I genuinely believe you could put the literal public Netflix biz experience on a single VM. Account management, billing, preferences, view history, etc. The only pieces that need cloud scale are ddos mitigation, 4k video streams and media-dense static web content. Most businesses do not have strong need for these things.

recuter · on April 12, 2022

Correctish in a sense: https://openconnect.netflix.com/en/appliances/#the-hardware

They throw one of these boxes at an ISP and interconnect. 4-6 year no touch reliability, couple hundred TB storage.

Modern hardware is quite something. They will saturate 2x100GE. That's in the thousands of concurrent streams per box.

Nextgrid · on April 12, 2022

> Modern hardware is quite something.

The one weird trick that cloud providers hate.

karmakaze · on April 12, 2022

Here's more of that context that paints a better picture.

> First, it’s important to note that Enzymit’s use of cloud computing mainly entailed computationally intensive calculations for protein design. We do not have (yet) any public-facing applications that need to scale across multiple geographical zones and handle millions of requests per minute. Our primary use case is running CPU and GPU heavy analyses, and for that use case, we have found the IaaS/public cloud solution to be far from cost-effective in the long term.

rhplus · on April 12, 2022

1 million/sec is basically line speed on a 10Gbps link if each request is coming in at MTU of 1500 bytes. Sure, you might be able to push that much data through a VM on a test bench with well-behaved local clients, but you ain’t gonna be doing that rate once you add TLS, authz, logging, throttling, non-trivial serialization, non-trivial database access, A/B tests, metrics, fraud detection, recommendations, and everything else that makes an API like Netflix tick. Whether you do that all on one VM or split across service roles, you’re gonna be much more realistically in the range of 1000 rps per CPU core.

bob1029 · on April 12, 2022

1500 bytes is a pretty big average payload size when you consider information theory and what actually must be communicated for this kind of business (on average).

A user clicking "Watch later" on a video could theoretically be communicated in something as small as 64-bit integer for the user/session id, one for the command type, and another for the identity of the actual video. With serialization, padding, etc., you are still probably well under 50 bytes for this one event.

Being sloppy with data throughout is certainly a good reason to need more pipes and servers. With enough discipline, you can process events at rates far exceeding 1 million per second with a single box and non-exotic network stack.

TheCoelacanth · on April 12, 2022

1500 bytes isn't even enough to send a list of video titles and thumbnail images for a single page of videos.

If you imagine that the client already has a full database of all the available videos and metadata about them, then you could get by with tiny amounts of data, but that's not even close to the actual circumstances that Netflix operates under.

charcircuit · on April 12, 2022

What about Netflix's recommendation system and analytics?

bob1029 · on April 12, 2022

Certainly not going to fit on the real-time production instance, but two computers isn't a whole lot more than one considering how many extra features we just added.

There is a lot of elegance with this type of setup too. You can have your analytics system receive a synchronous replicated log from the production system (what else is it gonna analyze?), so it can also double as a manual failover site.

gnarcoregrizz · on April 12, 2022

<tinfoil hat> I always got the impression that there has been a lot of cloud propaganda/astroturfing, even on HN.

I'm seeing more of these "on-prem infrastructure" posts, citing costs, efficiency, and cloud complexity. We run most of our infra on-prem, and have looked to moving a few bits and pieces to the cloud, but the math almost always works out to buying hardware. Meanwhile, I talk to some <cloud stack> friends, and their opex costs are astronomical for the traffic and size of their products.

The cloud is extremely convenient, and I would choose it if I were launching a new product, but past certain sizes and expenses I would start to do some math. It's not terribly difficult to run these cloud "shrinkwrapped" products (such as load balancers) on-prem. Things like object storage seem more difficult to me. I'm also hesitant to admin a database, they intimidate me :)

thefunnyman · on April 12, 2022

I think the reality is that most companies don’t have the skill set needed to maintain on-prem infra. A lot of us here take this kind of knowledge for granted, but most businesses don’t have the time or resources to build some of this stuff themselves so they reach for the cloud. You also see a lot of VC backed companies splurge on cloud as a trade off of money for rapid growth.

zmmmmm · on April 12, 2022

Hope he's prepared for a letter from nVidia's lawyers for breaking their license agreement for the RTX3090's.

piker · on April 12, 2022

Is all commercial use of the consumer grade nVidia products prohibited under that license? I thought (and this seems to agree: https://www.nvidia.com/en-gb/drivers/geforce-license/) that it was just use in a datacenter that was prohibited.

zmmmmm · on April 12, 2022

I reckon they will consider what he has constructed to be a data center.

A lot of places do it and just fly under the radar, but if you're going to publish a blog post bragging about it and how much money you are saving ...

athorax · on April 12, 2022

It doesn't matter if they consider these 3 servers to be a "datacenter." There are legal definitions[0] that this usage doesn't fit in at all. Unless nvidia provides a different definition in their license (which they don't)

[0]https://www.law.cornell.edu/uscode/text/42/17112#a_1

TheGuyWhoCodes · on April 12, 2022

I don't think so. These are just workstations. It's not like they have a a rack full of 4u servers with RTX3900s and even then it's for their own usage.

The RTX datacenter restriction as far as read, but not a lawyer, is for data center providers like aws, ovh, hetzner etc to provide servers with these gpu and rent them.

penultimatename · on April 12, 2022

They run two GPUs. I highly doubt NVIDIA, their lawyers, or a judge would consider that a datacenter. The license is clearly directed at a different crowd.

Protostome · on April 12, 2022

https://lambdalabs.com/gpu-workstations/vector

Lambda sells "GPU workstation built for Deep Learning" with RTX 3090.

kuon · on April 12, 2022

On premise can be a lot cheaper for bandwidth intensive app. We have a 10gbit/s dedicated fiber we use at like 80%, it cost us 1500$ per month (power+fiber). This kind of bandwidth is 10 to 100 times more expensive on the cloud depending on the service. Of course, no CDN, but as our customers are mostly local, we don't need global presence.

Also, we serve this from a single epyc based server, using elixir/phoenix. It's at about 6gbit/s outbound traffic. I realize this is not uber redundant, but it works and keep the costs low.

Maxburn · on April 12, 2022

We came to the same conclusion.

We do hosing for commercial HVAC systems and due to software requirements and the human factor of training the VM's we had in azure cost us significantly more than on premises servers. Add to that we generally keep those servers a little longer than some others would just increases the savings. This is still true even though we pay a colocation to host those servers.

The cloud flexibility is completely irrelevant for us in light of multi year service contracts from each customer we work with.

AtlasBarfed · on April 12, 2022

Why did openstack fail? Or did it, was it just not adopted?

I think there is still a lot of potential for open source management of core EC2/S3/networking capabilities (aka "core AWS IAAS service"). We have a fair number of cloud abstraction layers now, and obviously kubernetes, you'd think we could as an industry produce core apis for doing resource listing, availability, etc.

Maybe some of the problem is that devs have a LOT of experience with the "ask" side of IaaS: gimme storage, gimme vms, set. But they have no experience with the "provide" side, and the sort of one-off manual nature of installing networking and machines doesn't have good standardization for "reporting available resources".

At this point, aws apis are somewhat stable. (I would bitch about the error codes and documentation... but anyway). It's obviously "good enough" after 10-15 years of them.

Are there projects that try to marry an aws-ish api, which really is a reporting and request api, with a "available resources" reporting api? Are some of these things out there?

AWS ten years ago was liberating. It was progress. It was a good thing. But Amazon is not a "do no evil" corporation, much the opposite. And you see this in AWS with its treatment of startups, open source projects, and other manipulations. They are a monopoly now, or at a minimum a dangerous cartel.

A real open source alternative would be a good thing. It would be good for the rest of FAANG, it would encourage competition by allowing lesser clouds to offer core competencies that are drop-in.

bmj · on April 12, 2022

Starting a web-based or SaaS (Software as a Service) business was virtually unheard of before the age of IaaS (Infrastructure as a Service) companies. There were simply too many hurdles, and the expenses were too high — you’d have to purchase dedicated servers and a high bandwidth connection to handle a load of incoming visitors to your website, hire engineers to build a scalable system, and if you planned to go international, you would have to purchase servers in other geographical locations.

The first internet boom was exactly what the author describes. I worked for an internet e-commerce start-up in 1999, and guess what? We had dedicated hosting bandwidth coming into our office, and the hardware and software in place to serve our application. Of course, one of the founders was a sysadmin, but I had friends working for similar start-ups, and they leveraged one of the many co-location providers in the city to manage all the details. Yet their employer still owned actual hardware that they could touch when necessary.

vidarh · on April 12, 2022

I cofounded an ISP in '95. There were plenty of alternatives offered by ISPs like mine. People often chose to host themselves when they had the skills, sure.

recuter · on April 12, 2022

  After spending $100k in a year.. After doing the math, We decided to purchase three workstations at a total cost of $17k. 

  One is a GPU-based workstation with two RTX 3090s and an Intel i9–12900 CPU, and another two workstations with 16 cores AMD Ryzen 5950X CPUs.

  It took us a few FTE days to set those up to our satisfaction with slurm, NFS, backups, and several other services.

  We noticed that our RTXs, although considered gaming cards, are comparable (if not better) in performance to Tesla V100, which some cloud providers rent at the staggering price of $3.06 an hour.

tl;dr - They did the math.

ldoughty · on April 12, 2022

They also had a static workload, with no changing requirements, or really any need for scalability.

They also had FTEs with experience to configure the systems (which for a lot of us technical people is a no-brainer, but if you had data scientists with no hardware experience, that might be different)

This worked for them, and I'm happy for them.. the cloud does not solve all problems... It's simply one hardware strategy you can pick.. it's important to review the options!

Spooky23 · on April 12, 2022

That’s ok.

If you need a pickup truck two weekends a year, rent. If you need to carry tools and stuff every day, buy!

justsomehnguy · on April 12, 2022

> They did the math.

And bought 3 workstations instead of 3 servers.

IMHO the headline of "Why Enzymit Decided to Build its Own On-Prem HPC Infrastructure" is a bit... stretched.

uniqueuid · on April 12, 2022

Well, they do use slurm, so it's technically a HPC stack.

raelmebrand · on April 12, 2022

This is how universities (or people from academic background) interpret...

HelloNurse · on April 12, 2022

They have a workload that is entirely dictated by their protein design etc. projects, not by external access to a web app. So "high performance" means doing a single definite task quickly (so they can *proceed to the next run), not scaling to many users and requests. If they need to scale, they first hire protein designers or the like and then they set up some new workstation, probably something different from existing ones for diversification and obsolescence reasons. No cattle.

dagw · on April 12, 2022

We noticed that our RTXs, although considered gaming cards, are comparable (if not better) in performance to Tesla V100

You don't buy Tesla cards because you care about maximizing operations pr second, you buy them because you care about operations pr KWh.

adgjlsfhk1 · on April 12, 2022

No, you buy tesla cards so Nvidia doesn't sue you and stop providing firmware.

lostmsu · on April 13, 2022

That's a good point. 3090 RTX may be 1.5x faster than my Titan V, but Titan V consumes 2x less energy on the same load while being much older generation card.

dagw · on April 13, 2022

Plus less energy consumption means less heat generated which means you can pack them tighter and you can spend less on cooling etc etc. At scale these things really add app.

chrxr · on April 12, 2022

Worth noting that using a gaming card for workloads like this would likely void the warranty.

charcircuit · on April 12, 2022

No it doesn't

Cpoll · on April 12, 2022

Isn't it right there in the fine print?

> Warranted Product is intended for consumer end user purposes only, and is not intended for datacenter use and/or GPU cluster commercial deployments ("Enterprise Use"). Any use of Warranted Product for Enterprise Use shall void this warranty. https://www.nvidia.com/en-us/support/warranty/

charcircuit · on April 12, 2022

I must have read over that

stephen_g · on April 12, 2022

As a random aside, I’m glad to see ‘on-prem’ emerge as a common shorthand for this, only because it always grates me to see people make the extremely common mistake of saying ‘on-premise’. Premise, of course, only ever means “an idea or theory on which a statement or action is based”, whereas the actual term is premises (“the land and buildings owned by someone, especially by a company or organisation”, as in “The security guards escorted the protesters off the premises”).

ldoughty · on April 12, 2022

This is a good explanation of cloud issues for a company with resources and consistent workload...

IaaS is not really competitive in this space, I don't think... If you have access to system admins, and have a consistent work load, you could avoid the cloud, trading the cloud premium for more employees and skills in your team. This is fine, if that is what you're team needs.

The cloud is not a silver bullet that solves all companies infrastructure... But they have a very profitable space, especially in small businesses, or businesses that benefit from multiple data centers. AWS simply made it easy to scale up and down, as well as scale around the world... if you don't need to dynamically scale or easy access to multiple data centers, the cloud begins to lose it's best (cost-effective) competitive edge to self hosting... Though at the micro scale, the cloud can do dead simple basics for free -- or near enough (e.g. static websites), which is fun for personal projects

Nextgrid · on April 12, 2022

> If you have access to system admins

Clouds also require sysadmins, they're just called "DevOps engineers" now. Those YAML & Terraform files aren't going to write themselves.

dahfizz · on April 12, 2022

It takes just as much, if not more, IT work to maintain cloud infrastructure as local infra.

People who have never managed servers imagine them blowing up once a week or something.

nijave · on April 12, 2022

>Adding storage, CPU time, and many other costs makes understanding and verifying the cost structure a task suitable for certified experts

Is that easier on-prem? I was under the impression it was even more difficult--especially with shared tenancy (how much incremental cost does App A auth add to our Active Directory deployment?)

You're going to need to know server power utilization under load to calculate power/cooling costs and probably some additional data on network utilization to figure out incremental costs for that

Maybe if it's a colo or managed data center that gets rolled up for you, but if you're managing yourself, you still have to figure it out

Blog post also doesn't mention cost of downtime (maybe not an issue for them) or a metrics solution (you usually get basic machine and service metrics for free on the big cloud providers)

nikanj · on April 12, 2022

1) You save a metric boatload of money this way

2) Onprem experience looks bad on your resume, compared to cloud experience

3) Incentives work for people, and resume-driven development is key, as our industry is very stingy in passing the savings from 1) to the developers

dgb23 · on April 12, 2022

Many care more about cs and engineering fundamentals rather than bespoke products, even more people only care about having their problem solved and how much it costs.

StreamBright · on April 12, 2022

"First, it’s important to note that Enzymit’s use of cloud computing mainly entailed computationally intensive calculations for protein design"

So they got a very specific workload that is easy to run on any infra including on-prem.

_0w8t · on April 12, 2022

5 years ago we evaluated cold storage for 100 TB of data. Even at that relatively small amount magnetic tapes were much more cheaper even after accounting for an extra copy to ship to a remote location.