Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I also have problems understanding what exactly it does. I just briefly skimmed the paper, but it seems like the idea is as follows. Assuming you don't have ground truth labels for your data:

1. Generate many different nosiy labels for your data by writing functions. These don't need to be correct, but they should make uncorrelated errors. They are basically domain knowledge you have of your data.

2. Snorkel takes the output of these functions, and based on their (dis)agreement, builds a generative probabilistic model to uncorrelate your labels, which may have had some overlap in the errors.

3. You train your final discriminative model on the output of that probabilistic model.

So, the main idea is to create many noisy labels instead of relying on a single high-quality label and Snorkel does the hard work of figuring out how to smartly combine these labels so you can train on something clean.



Yup! The LFs can also actually be correlated- just not too correlated (roughly, think of needing at least three mostly uncorrelated cliques, to be precise).

Part of the high level description, though, is that a lot of different parts and lines of work are integrated into Snorkel Flow beyond just this original programmatic labeling idea. So also programmatic operators for data augmentation, "slicing" or partitioning of data, and the overall end-to-end platform (UI + SDK) supporting iterative development of ML models via this paradigm of programmatic training data.


The description reminds me a bit of learning classifier systems.

Edit: and a bit of fuzzy rule systems. Which just goes to suggest that I am probably well out of my depth.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: