Forgive me - but how does this avoid the chicken&egg problem here. Without diggi...

imranq · on July 16, 2020

That's a really good question. I took a class with one of the professors who started Snorkel.

The way he broke it down was either you can incorporate rules into your data or into your model. Because we want our model to be as general purpose as possible, it turns out you can squeeze some extra performance by "bronze/copper" quality data with handwritten rules in your dataset.

You can think of the model getting an extra boost from the latent knowledge within the rules.

manojlds · on July 16, 2020

Their paper explains it - https://arxiv.org/abs/1711.10160

Snorkel itself has been a open source package for a while - https://github.com/snorkel-team/snorkel

This new announcement is about Snorkel Flow

sriku · on July 16, 2020

Labels are knowledge about data. If you already know some rules that work reasonably well based on your domain experience, then Snorkel lets you capture those as "labeling functions" that may not cover the whole ground or can be "noisy". Snorkel can then build a model to label your data accounting for the "noise". Combining that with some "gold" labels (done by humans), you can use the generated labels on a large data set to build a higher quality model that generalizes better. This is similar to how you can take several low quality models and by virtue of them having expertise over different parts of the data, build an "ensemble" model that performs better than any of them.

Imho, Snorkel kinds of tools ("weak supervision") are game changers for ML .. though the biggies get all the press. So I'm excited to see this end to end direction taken by the team.

master_yoda_1 · on July 16, 2020

is not this done for years and called synthetic data generation, simulation etc.

sriku · on July 16, 2020

Not data generation. Label generation. .. but the charitable interpretation of your question is valid - we've been doing such ensembling to make higher quality models for some time now. It's getting some good structure, practice and tooling around it is what I feel.

master_yoda_1 · on July 16, 2020

Yeh then advertise it as a tool rather than AI. The problem is that sorkle is trying to sell snake oil on the name of Stanford and AI. Under the hood it is just a data generation pipeline. Remember you can't put label on random data. So "Not data generation. Label generation" is totally does not make any sense and sound to me like "brown sugar".

ianhorn · on July 16, 2020

I saw a talk on Snorkel a few years back, so I don't remember perfectly, but it seemed to be an iterative process. It's a tool for you to build and refine simple rules. If you have ingredients, a simple heuristic "<number> <units> <ingredient>" will get a lot of them, but there are tons of edge cases. With more heuristics, you might get lots of those, and so. I think it was a tool to help you explore and iterate on those heuristic labeling functions quickly. Then you can label the stuff that's hard in a more expensive way or something. I thought of it as noisily hand labeling sets of examples at a time rather than single examples at a time. This is all memory from a random conference talk or paper or something years ago so take it with some big grains of salt. I do clearly remember thinking it seemed really cool at the time.

eggie5 · on July 16, 2020

A human will label data according to hand-rules or heuristics. What's the difference is a program labels data according to hand-rules or heuristics.

The down-stream discriminates model's goal is to generalize via supervision.