Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
[flagged] Everything You Know About MongoDB Is Wrong (mongodb.com)
41 points by scapbi on Nov 26, 2020 | hide | past | favorite | 61 comments


> But this is not the case! Some of these parts have never been the case - MongoDB has never been "eventually consistent" - updates are streamed and applied sequentially to secondary nodes, so although they might be behind, they're never inconsistent.

This is what eventually consistent means! If i wrote (key=X, value=Y) to a primary, then read X and see a value thats not Y (because the secondary node hasn't caught up yet), that is inconsistency. Strong consistency (e.g in a single node SQL database) would mean its impossible to read stale values.


It's par the course for MongoDB. 10 years later and they still use sales tactics to lie about the capabilities of their database. MongoDB might be a solid database when properly used, but their marketing material, again, promises things that the software just doesn't support.

This is literally their "we are the fastest database just turn off commits" claim, but now in a distributed manner.


> This is literally their "we are the fastest database just turn off commits" claim

I don't recall them ever saying this.


They said they were the fastest.

They shipped with syncs turned off by default.

EDIT: actually it was journaling of writeconcerns. Later they backtracked and enabled it by default, but by then their reputation was cemented (or bifurcated into fast, or terrible, I suppose).

https://shekhargulati.com/2011/12/08/how-mongodb-different-w...


It seems quite similar to what MySQL/PostgreSQL is doing with its WAL.

Those don't sync to disk on every write.


PG fsyncs on every commit. By default.

It will let you delay commits — if you ask it to — so it may batch multiple commits together on a single fsync, but it will NOT report to you that it has saved data durably when it hasn't. Because this is a safe operation, PG will let you do this easily: even on a per-session or per-transaction basis.

Similarly, you can ASK it to commit eagerly without waiting for fsync, again on a per-session or per-transaction basis. But it will never turn on by default, and the documentation of this option clearly calls out the risk: "an operating system or database crash might result in some recent allegedly-committed transactions being lost"

Or more too: you can ask it to turn off fsync (and all other kinds of sync) entirely and risk total database inconsistency — say, because you are running it on a filesystem that guarantees consistency, like ZFS — but it strongly recommends against it just on the off-chance that some over-eager user carelessly turns it off. Because this is generally not a safe thing to do, PG will NOT let you do this easily: you MUST change the server config on disk and restart the cluster.

I don't know about MySQL, but I'd be greatly surprised if they did anything like what MongoDB used to do. There's a reason MongoDB got that reputation, in a world where MySQL — and its surprising data-eating behaviours — already were popular.


I mean to be fair, in a single node MongoDB database (one primary, no secondaries), it would be impossible to read stale values as well. Eventual consistency only comes in when you introduce multiple nodes / secondaries. As long as you connect to the primary for reads, you'll never get stale data.


> I mean to be fair, in a single node MongoDB database (one primary, no secondaries), it would be impossible to read stale values as well.

Yes, but the product has been advertised from the beginning on the basis of how easy it is to grow horizontally; the whole "It's webscale!" argument.

----

> Eventual consistency only comes in when you introduce multiple nodes / secondaries. As long as you connect to the primary for reads, you'll never get stale data.

1. Writes can be lost if the primary dies and a secondary is promoted before data from primary has been replicated to it.

2. The whole point of adding secondaries — again, the "webscale!" argument — is to scale read capacities


> 1. Writes can be lost if the primary dies and a secondary is promoted before data from primary has been replicated to it.

Is this any different than how things work in eg Postgres?

> 2. The whole point of adding secondaries — again, the "webscale!" argument — is to scale read capacities

The main driver behind secondaries is redundancy/durability. If the primary goes down, there's no/very little overall service interruption.

If you need the latest data, read from the primary. If you care more about faster reads than perfect data accuracy, read from a secondary.


>> 1. Writes can be lost if the primary dies and a secondary is promoted before data from primary has been replicated to it.

> Is this any different than how things work in eg Postgres?

1. The comment of mine you are replying to applies to any DB that uses Eventual Consistency, as can be seen above it.

2. Yes, this is different in PG simply because PG doesn't do Eventual Consistency by default.

>> 2. The whole point of adding secondaries — again, the "webscale!" argument — is to scale read capacities

> The main driver behind secondaries is redundancy/durability.

In general, maybe — or even maybe in your specific use-cases. But Mongo's prescribed use-cases — the ones they marketed, heavily, even going so far as to make up terms like "webscale" — are about 'performance' and 'scale'. I specifically call this out in my comment.


Yeah I don't know much about MongoDBs past transgressions as a company or have any experience using it so it wasn't intended as a criticism of the company. I just spent a lot of time learning exactly what all these DB terms mean and wanted to point out an inaccuracy so its clearer to people reading.

As you say, eventual consistency only comes in with multiple nodes. And unless you are willing to sacrifice availability, eventual consistency is the only game in town in a multi-node scenario.


The definition of "eventually consistent" is somewhat vague in this respect, or at least easily misunderstood. When updates are streamed and applied sequentially, "eventual" is easily determined as the number of messages that the stream is behind. I'd say that this model is "deterministically consistent", because we know at which point the replicas will all agree with each other without having to perform a full comparison.


Sequential Consistency is the name for this model. Writes don’t get lost but it is weaker than linearizability because you can read after a commit and not get the latest value but eventually (not the same eventual as eventual consistency) you will. Typically data systems provide this to optimize for read scale out, otherwise all reads have to go to the leader.

In “eventual consistency” you could read the committed value at some time, no value, or another committed value depending on the exact implementation. It’s a much, much weaker consistency model than Sequential Consistency and Session/Causal Consistency is the closest you can even get to sniffing a stricter model.

I had a longer comment about this new page and MongoDB as a whole but I’ll just swallow my words. They know what they’ve done and don’t really care.


> Sequential Consistency is the name for this model. Writes don’t get lost but ...

Writes can get lost if the primary dies and the secondary is promoted in the meanwhile.


Yes but this is dependent upon the specific replication protocol. Primary-Backup has this problem but Majority Quorums and Chain Replication do not.


The parent is talking specifically about 'Sequential Consistency'.


My point is that Sequential Consistency isn’t a replication protocol, it’s a consistency model that you can have under various different replication protocols.


> Eventual consistency is a consistency model used in distributed computing to achieve high availability that informally guarantees that, if no new updates are made to a given data item, eventually all accesses to that item will return the last updated value. [1]

That is literally the definition of eventual consistency. How you achieve it doesn't matter to being eventually consistent, just that it is.

[1] - https://en.wikipedia.org/wiki/Eventual_consistency


This makes me wonder how it's possible how to replicate data without being eventually consistent. I can't transmit data faster than the speed of light. It seems that the definitions here need more clarity, because eventual consistency in the broadest sense applies to everything. Is there a good example of replication consistency which isn't "eventual", but is instead immediate?


It’s all about when you acknowledge back to the client, whether you accept writes at a single writer or multiple writers, and whether reads can be served from any node. Most EC systems tend to be multi writer with async replication. This means clients get acks immediately and reads to other nodes for those values can contain various values. The speed of the network doesn’t matter from a theoretical perspective. Network speed only affects potential convergence times. Time is time regardless of the unit of measurement.

See my other comment above for distinguishing between EC and Sequential.


I don't think the OP is arguing that replicated data is not "eventually consistent" because it is as you say.

OP is arguing the blog post is trying to change the definition of "eventually consistent"... the author is saying the delay between the master and slave is not "eventually consistent"... when, exactly as you say, it is.


SQLite, appropriately configured, is trivially linearizable — everything is protected under one lock on one machine.



Here's an overly simple, impractically slow, but immediately consistent model for you: for every request from a client, ask all other nodes in the network the same query. If you get 100% consensus, return the answer, otherwise, ask again.


That is a kind of strange thing to title a post hosted by the corporate blog itself.

1) If that's actually true, then both Mongo's marketing department and its technical writing staff have been doing a really, REALLY poor job for years.

2) "You don't know nuthin', dummy!" is not a great lead-in to correcting people's misconceptions.

It also appears to be the case that what I knew about MongoDB was actually true all along, and most of these "myths" are just Mongo trying to redefine the terms or problems to be something they're not. :-/


> You don't know nuthin', dummy!" is not a great lead-in to correcting people's misconceptions.

I actually disagree. Lots of companies have to rebrand, especially when their leadership and/or product have changed.

The core problem here is that the article seems to be an attempt to fix Mongo's reputation through misleading PR (e.g. using different definitions of words than everyone else in the industry), rather than technical solutions.

It's typical of Mongo as a company, though. They don't have technical merit and try to make up for it by trying to convince us all that the sky is green.


> 2) "You don't know nuthin', dummy!" is not a great lead-in to correcting people's misconceptions.

On the other hand if you look for reasons why mongo db is bad by searching "mongodb wrong choice" or "mongodb bad" and you find this post, by the company itself, it'll likely be kinder to the product than the posts from other blogs.


Interesting that they referenced the Jepsen report, even though it was a damning indictment of MongoDB's unreliability... and then Mongo referenced the report on their website as a positive, despite the negative result in their report. That was a bit of scandal not that long ago, so it's odd to see them still talking about it.

It's like those supplements that say "scientifically tested!", which is a true statement even if the scientific test found that the supplement is ineffective and does nothing.


The title should rather be "Everything MongoDB knows about databases is wrong".

They're trying to establish a narrative about their product which is an already accepted, defined terminology, and which has existed for decades in the Computer Science community.

Eventual consistency doesn't need to be redefined.

If the general public assumes a database to have relational features, maybe it's time to rebrand mongodb into mongostore or something?

If mongoDB tries to be the "eierlegende wollmilchsau" of databases, they have a too large pivot that they're trying to deliver to as a market. Mongodb isn't made for ERM based scenarios, maybe stop trying to push it into that. If people would've wanted that in the first place, they'll likely have chosen ArangoDB or an SQL based database anyways.

To be blunt, I don't understand why they seemingly are so irrational to push the narrative.

Maybe they've seen that most of their customers stop using their product after a while?

If so, then it might be time to stop promising things you conceptually cannot and should not want to deliver.

Also, dear mongodb folks: Web scale is definitely not mongodb as a product. For decades devs have used memcached with SQL based databases and it scaled beyond imaginations, and beyond what a non DDoS testing environment can reflect.

If your product isn't webscale because you do not want to be held responsible for adapters that you refer yourself in your own docs to, maybe it's time to take responsibility and introduce a q&a step regarding a minimum performance standard that every adapter has to fulfill to be recommended?

Regarding debian package version: I personally am gonna stop here, because I hate people complaining about shit, not taking responsibility, and expecting others to do work for free. If debian's policies annoy you, host your own damn ppa. Or sponsor them to help them work on that. But this... this is not okay.


this. and all the bleating over precisely which sort of consistency mongo has when it's a document store that cant gaurantee shit about what version of some entity was serialized into the document you just fetched. calling themselves a database has always been a scam.


> “There remain some differences of opinion with Jepsen about MongoDB's defaults, but if you read the manual, you can rely on MongoDB to store and maintain your data consistently and robustly.“

My personal opinion, is that a database should not lose data _by default_


That sentence does seem easy to attack if you assume that if you don’t read the manual, MongoDB will not be consistent or robust.

If defaults are touchy, it sounds like they would still benefit from a closer cooperation with Linux distro packagers, perhaps even taking upon the task.

MariaDB has shipped as “UNIX sockets only” and “Only listen to localhost, no default auth”. This increases the difficulty of trying out the database, but for a mature product for mature people, maybe there is no excuse for bad defaults anywhere.


If you want to tell your boss you're storing the data, but plan to quit before anyone tries to actually use said data before seed funding runs out, mongodb is the perfect choice.

The nice thing is, there is an API compatible version of mongodb distributed with most Unix systems, installed by default in /dev/null. I applaud the authors of mongodb for achieving this level of market penetration in such a short time, defying years of experience from the real world.


Mongodb is Webscale - https://youtu.be/b2F-DItXtZs


> Does /dev/null support sharding?

Timeless.


Ouch


I used to work in a managed investment department of a large investing company in midwest. We used MongoDB as a single source of truth for daily market analysis data. We used C# .NET Core. I loved working with mongo, although it gets slow overtime but it was enjoyable. One thing I really hated was the C# driver. I just hated it. It fails translating complex LINQ expressions into mongo query and its BSON implementation is extremely slow. We literally wrote the data-access-layer in python wrapped it as an API and we got a 10x speed boost. Eventually, we replaced it with Postgres and utilized its JSON columns, and retired MongoDB. Good memories learning its query syntax and investigating where the slowness was coming from.


  - Why did you move to Postgres?
  - What was perf impact/improvement after moving to Postgres?
Thank you in advance!


"Relational" databases are much cheaper comparing to MongoDB atlas.I mean $50k per month cheaper. Also AWS documentdb doesnt natively support 128 bit decimal ...

The performance boost was at least 10x and even more with complex LINQ queries all thanks to entity framework. At that job, we did a lot of risk analysis, bucketing bonds and creating a composites. Entity framework core and it's LINQ to SQL was a key in performance boost. Not to mention entity framework provides multiple layers cache which we used as our risk analysis algorithms were basically custom version of bin-packing ...


Thank you very much for sharing this.

Our experience was about the same - both cost- and perf-wise. It just that our perf boost number was order of magnitude bigger on some tasks :).


So what did you enjoy?


I enjoyed debugging mongo db C# driver, sending patches for performance boost and better LINQ to query feature. Debugging expression trees in C# peaked my interest and now I am a third year PhD student in compilers.


Oh that's really cool :)


You can do joins with queries that we call aggregation pipelines. They're super-powerful,

And hot garbage for performance. Unlike relational indices on foreign keys, mongo simply... does a lookup for each doc in the pipeline step. Indexing the looked up collection does nothing extra in aggregation, you're just doing a repetitive manual join. A simple aggregation query that I wrote that added a value from a second collection based on its timestamp compared to the original document's timestamp, took at least an order of magnitude more time than without the lookup.

Aggregations on a single collection are very performant. But never try to lookup or join anything.

Scaling data is mostly about RAM, so if you can, buy more RAM. If CPU is your bottleneck, upgrade your CPU, or buy a bigger disk, if that's your issue.

Never listen to this person about anything.

The mongo I'm dealing with now scales by database. Each entity has between 20-100GB of data in its own database; we're adding entities continually. If I try to replicate for performance, I'll be replicating everything--there's no selective replication. If I shard, I'll be sharding within a collection, which is the equivalent of striped RAID--great if that's what you need. I don't. I need to shard at the database layer. I need my queries routed according to the database at which they're aimed, not by the sharding key. Can I? Not a chance in hell with any of the existing scaling mechanisms from Mongo. My current mongo VM is already the largest Azure offers. How do I add more RAM to that?


Why do all your databases have to be on the same shard? Isn’t the idea of snarding you can have different dbs on different shards?


Not the way Mongo does it.

The way sharding works in general is that your data gets an additional key (unless it has one that works for sharding already), and a router in the stack does traffic control on the query to send it to the correct shard; on the way back, the data is reassembled into a single result. This enables parallelism in your query, boosting performance by using something like map-reduce. Secondarily, your shard management layer can do a lot with shard-level redundancy and dynamic sharding to spread the data evenly across shards.

The scaling axis here is the size of your data--in mongo's case, the number of documents in a collection. As the number of documents grows, you shard the collection to keep queries on it fast. Mongos, the sharding router, only manages sharding with a sharding key on a document, so the only sharding possible is spreading documents from a single collection across multiple shards. If you have 10 databases that each have 100 collections, you get that on every shard, but each collection only has a subset of the collections docs.

I haven't looked deeply into the mechanism, but I imagine this floats on top of mongo replication, where the replication layer cooperates with the sharding manager to replicate only the shard's docs (as Mongo replicates by tailing the oplog, all a shard has to do is ignore oplog entries for docs without a shard key in the correct range).

It's the fact that it works at the collection level that makes it useless for us. Each database, for single entity, is a set timespan of timestream data, with one collection per timestream. Entities vary by the number of timestreams/collections they have, but as they're all the same type of entity, their maximum size is pretty consistent and we don't have problems querying a single collection. We don't need collection-level query performance.

Our axis of scaling is the number of databases, not the number of documents within a collection. What would be literally perfect for us is a sharding manager that routes by database. We could put each entity's database on its own mongo instance/cluster, or an instance holding X databases, and scale horizontally almost indefinitely. We're lucky in that per-entity data falls into a clear range of sizes; we're unlucky in that no such router exists for such a common scenario, which is bizarre to me.


That's... not the kinda blog post you want on your own blog.

"We're really bad at explaining Mongo and nobody knows how to use it."


i will leave this here for anyone to validate the original article - http://jepsen.io/analyses/mongodb-4.2.6


“Some of the tests found bugs - they were meant to - and we fixed them. There remain some differences of opinion with Jepsen about MongoDB's defaults, but..”

It’s just a difference of like, opinions, man. Oh and in 6 months they can claim “that version of MongoDb was version 4.2.6, we’re on version 6.11.42, that’s an oooold version they did that report on”. It’s the always the same tactics.


Back in ~2012 I interviewed at a company that chose Mongo as its database, and my white boarding question was to implement joins in nosql as that was a recent problem they were solving. After doing the problem I asked innocently, “but all your data is relational - why use mongo at all?” And the CTO went red in the face and exploded, actually yelling at me about mongo’s many benefits (just not, ya know, anything a normal database provides). Needless to say I didn’t get the job!


"I chose MongoDB" has been (and apparently should continue to be) an indicator that someone is either new to software -- which is fine, a lot of bootcamps teach Mongo -- or totally incompetent when it comes to databases.


I'd say "name and shame", but I'm guessing they're not around any more. Am I right?


They seem to still exist but they recently closed down a major product. Let’s just say that if I joined them and lasted 8 years my stock options would be worthless.


Oh, would love to have been a fly on that wall. I wonder what they did for 'reporting'. mongo/document dbs are great for niche uses. I don't disagree on that. But for an overall store I don't see it.


The only thing I need to know is not to use it.


I was trying to type out some kind of rebuttal, but yeah, just use Postgres with jsonb if you really need "document" behavior. Mongo is so fraught with issues that given the alternatives that exist now (CouchDB, Postgres with JsonB, etc) just avoid it.


I had good experience using for rapidly-changing map data that needs to be queried based on location. Indexing over coordinates on the surface of a sphere is supported out of the box, which makes it simple to use with relatively low latency on databases of GeoJSON documents on the order of a few terabytes, although I'll admit to not knowing if that's a feature supported out of the box on another document database.


> MongoDB is an ACID database. It supports atomicity, consistency, isolation, and durability.

Yeah. If you dismiss the part that it took them 4 versions and 9 years to be able to make that claim.

ACID by compliance. Definitely not ACID by design and hence not trustable for transactions. Even this article says that MongoDB shouldn’t be used for transactions.

MongoDB seems great for when you need to capture large streams of data in a create-once/read-forever type of way. Probably why is popular among large enterprises.


This is such a ridiculous argument.

So if a company doesn't add a feature at version 1.0 then it simply doesn't count ?


There’s no question mark but Betteridge’s law of headlines still applies.


Having just (TODAY) finished moving a massive mongodb to postgres (w/ jsonb) I think I can say everything I know about mongodb is correct; it usually is a pain in the ass (particularly for larger databases).

That said, mongo (inc?) is has a p good marketing team.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: