What were the limitations that required you to move all customers to a shared system at once?
Could you have selected some workspaces with lower traffic to migrate first? That would have decreased the load on the primary, potentially speeding up the replication, which is a flywheel to enable more customers to migrate to shards.
Good question, that was an option. The main motivating factor here was that vacuums were beginning to take dangerously long. O(weeks) to complete, independent of the load on the database. While migrating spaces in segments would have reduced the number of records future vacuums need to scan, we were already running against the clock to complete one vacuum prior to TXID wraparound[0]. To kick off replication for specific spaces we would have needed to write our shard key to all data owned by those spaces. That would further contribute to TXID growth, and was not something we were comfortable doing.
At the end of the day, this is something we could have explored in more depth, but we were ultimately comfortable with the risk tradeoff of migrating all users at once vs. the consequences of depending on the monolith for longer, largely thanks to the effort we put into validating our migration strategy.
Seems like something that might still be worth exploring, as if Iām thinking about this correctly, it would allow you to create new shards on the fly, and to migrate workspaces between shards while only locking one workspace at a time, and only for the amount of time required to catch up that single workspace.
Could you have selected some workspaces with lower traffic to migrate first? That would have decreased the load on the primary, potentially speeding up the replication, which is a flywheel to enable more customers to migrate to shards.