Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This is a great read and primer for anyone looking to shard and know they need scale. Having helped a ton of people shard their data, my advice still remains don't do it until you know you have to. The old Citus data blog is a great resource for some guidance around best practices and how to's even if not using Citus.

That said some of the things you can do from day 1 make things way easier when the time comes later. Workspace ID/Tenant ID/Customer ID, having your data grouped that way makes it very shardable and saves you a lot of work later on in the case you didn't have that in your data model.

I'm not sure I buy that Citus/Vitess are magic, both are reasonably clear how they work and you can dig into it. At the same time I'd weigh the downsides of Citus (can't speak to Vitess) in that the online rebalancer isn't open source so at that point it's a proprietary product.



Vitess is pretty close to magic for a lot of use cases. The downside of its rebalancer is that the re-sharding process requires a full expansion of the shards that are being re-balanced, which is to say if you have two shards [-80] and [80-], and you want to split out part of the first shard you need to provision two new shards [-40] and [40-80], let the data copy into them, and then you can mark them as active and delete the old [-80] shard. In this brave new cloud world, that not as much of an issue as it would have been in the old data of your own datacenter.

The big problem everyone I have talked to about sharding runs into is managing the shards as you expand. In this case it looks like notion over sharded so they can spin out up to 480 physical nodes, but when they need the 481 it is going to be a nightmare, thats what Vitess gets you for free, expand to any number of shards and just never worry about it again


Resharding in vitess is not quite as magical as this post implies. My infrastructure team has tried resharding the cluster powering my app several times and each time it's failed.

It's been a while since the last attempt so I forget the details but I think the reshard replication lag was growing no matter what resources we threw at it. We've since scaled vertically instead and are working on architecture changes to reduce our MySQL load (which admittedly is extremely high).


I would encourage you to join the vitess slack channel and let them know of any issues you run into there. Resharding has successfully been done at massive scale, so its unlikely that the system can't handle the load. Here is the blog post from when Github re-sharded live under full github load https://github.blog/2021-09-27-partitioning-githubs-relation...


Thanks, I'll pass along to my infrastructure team. Glancing through the blog post, its quite light on details, and I wonder if we just have a very different data load pattern.

Our dataset size is actually reasonably small (10s of TB), but our transaction throughput is very high. Glancing at our dashboard, baseline is 250k/s, with sustained daytime load in the 500k/s region.


Agreed. Just to add to this we’ve seen reshards on live datasets that are over 10 petabytes in size.


don't do it until you know you have to

If you wait until you have to, then isn't it too late to plan and test your solution? So you suffer from degraded performance and/or outages while you rush out a solution.


They said wait until _you know you have to_, not wait until _you have to_. If you're monitoring your systems appropriately, you should know well before you have to. (modulo inorganic growth explosions, which are hard to predict.)

This is just a specific case of premature optimization.


Agreed - at a previous job someone tried to force me into implementing some horrible sharding solution they'd dreamt up. I didn't: three years later the project was nearly dead and the database had never gotten anywhere close to maxing out a single machine.


(Ozgun, Citus co-founder)

That's a good point. Sharding has an interesting trade-off. Early on, you don't need it and there's an overhead to distributing work across multiple machines (you're taking an additional network hop). Later when/if you need it, sharding becomes painful to introduce. You may need to change your data model for performance and move data.

For years, I also cautioned against introducing sharding as a premature optimization. Recently, we changed our approach and also introduced sharding on a single VM. I think that offers a pretty good trade-off. If you're interested in this topic, I’d be curious to hear what you think.

https://www.citusdata.com/blog/2021/03/20/sharding-postgres-...


Yeah Notion waited too long because their performance was atrocious for a long time.


I'm not sure what you meant by "online rebalancer" above, but shard rebalancer in Citus is open source since March 2021: https://www.citusdata.com/blog/2021/03/13/scaling-out-postgr....


Ex-Citus here. The open sourced shard rebalancer blocks writes to the shard being moved. Online rebalancer (closed source) uses logical replication and doesn't block writes to shards being moved, except for a brief period. Everything is the same except how shard moves are implemented.


Yeah, echo'ing on here to be extra clear. What is open-sourced holds a write-lock while rebalancing. So that it rebalances sure, but it's only marginally better than a dump/restore. You can still read yes, but I'm not sure of many applications that can be okay with no writes flowing to a table for hours while rebalancing is happening.


Got it, thank you! I'm just wondering about whether this limitation can be alleviated / worked around by combining sharding and replication. In that case, I would expect the primary DB cluster to maintain the write lock during shard rebalancing, while allowing writes to replicas (upon primary cluster's rebalancing finish, the roles would reverse and rebalancing would be applied to replicas, while primary is already write lock-free).


Understood. Thank you for the clarification.


>>” Workspace ID/Tenant ID/Customer ID, having your data grouped that way makes it very shardable”

Dumb question, if someone starts off by grouping their data only by Customer ID and then later needs to shard. Couldn’t all of the sharding problems go away if they simply created a new Customer_Tenant table to map customer_id into tenant_group_id


Sorry, wasn't clear enough, tenant ID = customer ID = workspace ID. There is often a benefit though to materializing that ID onto all child tables. If you take a salesforce.com example, every customer has an "orgID", from there have accounts and accounts have opportunities. It's not common for people to put the "orgID" on the opportunity from a data modeling perspective, but when it comes to sharding it gives you a big win.

If you have that tenant discriminator on all tables then it's easy to route it to the appropriate physical and logical shard right away vs. having to do some DB join first in your request to figure out where it goes.


Are you suggesting creating entire database schemas specific to a customer and put the customer/orgID in all table names themselves?


No - you put customer ID column in every table


Yep, exactly this. By having customer id on every table it lets you know how to route appropriate queries and construct them as well.


I’ve stumbled across this in databases I’ve inherited before and while I appreciated that I could easily do filtering and stuff without joining across tables (and based on the comments here, more easily shard when things scale), it seems to denormalize the data.

Is there a specific term for the trade off of denormalized vs more easily query/shard/etc?


> Is there a specific term for the trade off of denormalized vs more easily query/shard/etc?

No, it's pretty much just called that. A good DBA will be able to strike the right balance between normalization and performance using their intuition and experience, which is one of the reasons they're often paid very well despite working in relatively ancient ecosystems.


Generally, the downside of denormalizing is that you risk inconsistency, and some extra storage. The inherent assumption in this case is that the customer ID (or whatever the key is) doesn't change for any associated table, and in that world, inconsistency is not a big risk. So yea, you store some extra data but if you've designed your ids well, it's not that big of a cost.


If you have the customer-ID in the data it basically represents the "owner" of the data. Typically the data of one owner does not refer to the data of other owners. Also the ownership of given data probably never changes so updating it is not a problem.

I wonder if this could or should be a built-in feature in databases. It is "meta-data" meaning data about data, who owns it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: