Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Forgive me - but how does this avoid the chicken&egg problem here. Without digging through the promo copy, why would one programmatically label training data to do ML on if they have such a program to label data...


That's a really good question. I took a class with one of the professors who started Snorkel.

The way he broke it down was either you can incorporate rules into your data or into your model. Because we want our model to be as general purpose as possible, it turns out you can squeeze some extra performance by "bronze/copper" quality data with handwritten rules in your dataset.

You can think of the model getting an extra boost from the latent knowledge within the rules.


Their paper explains it - https://arxiv.org/abs/1711.10160

Snorkel itself has been a open source package for a while - https://github.com/snorkel-team/snorkel

This new announcement is about Snorkel Flow


Labels are knowledge about data. If you already know some rules that work reasonably well based on your domain experience, then Snorkel lets you capture those as "labeling functions" that may not cover the whole ground or can be "noisy". Snorkel can then build a model to label your data accounting for the "noise". Combining that with some "gold" labels (done by humans), you can use the generated labels on a large data set to build a higher quality model that generalizes better. This is similar to how you can take several low quality models and by virtue of them having expertise over different parts of the data, build an "ensemble" model that performs better than any of them.

Imho, Snorkel kinds of tools ("weak supervision") are game changers for ML .. though the biggies get all the press. So I'm excited to see this end to end direction taken by the team.


is not this done for years and called synthetic data generation, simulation etc.


Not data generation. Label generation. .. but the charitable interpretation of your question is valid - we've been doing such ensembling to make higher quality models for some time now. It's getting some good structure, practice and tooling around it is what I feel.


Yeh then advertise it as a tool rather than AI. The problem is that sorkle is trying to sell snake oil on the name of Stanford and AI. Under the hood it is just a data generation pipeline. Remember you can't put label on random data. So "Not data generation. Label generation" is totally does not make any sense and sound to me like "brown sugar".


I saw a talk on Snorkel a few years back, so I don't remember perfectly, but it seemed to be an iterative process. It's a tool for you to build and refine simple rules. If you have ingredients, a simple heuristic "<number> <units> <ingredient>" will get a lot of them, but there are tons of edge cases. With more heuristics, you might get lots of those, and so. I think it was a tool to help you explore and iterate on those heuristic labeling functions quickly. Then you can label the stuff that's hard in a more expensive way or something. I thought of it as noisily hand labeling sets of examples at a time rather than single examples at a time. This is all memory from a random conference talk or paper or something years ago so take it with some big grains of salt. I do clearly remember thinking it seemed really cool at the time.


A human will label data according to hand-rules or heuristics. What's the difference is a program labels data according to hand-rules or heuristics.

The down-stream discriminates model's goal is to generalize via supervision.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: