Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

At the ODSC presentation I went to last year where the team presented they actually used a vision problem as their canonical example. It's hard to grasp without a concrete example, but it makes a lot of sense the way they explained it.

For example, lets assume you want to identify something like a lung tumor. So you have many MRI images and they're all largely the same template of image. Using traditional image processing software like open CV, it's suprisingly easy to do more coarse grained tasks programmatically, like say, search this image for any circle that's brighter than the surrounding tissue and has a radius greater than say x.yz mm. If you find one, that function returns True if not False. That x.yz mm number is what you get from your radiologists that you work with to help you develop the labeling functions and this is just _one_ of the labeling functions. But basically it turns out if you construct a few of these functions with the help of domain experts and then use those functions all together with the information theory research the Snorkel folks do, you get pretty damn good performance!



Yeah, but for some tasks, like recognizing a car, or something else you'd have to write pretty sophisticated code to even get to a reasonable result. It's far easier in those cases to use supervised learning to have a NN learn to do it for you.


Agreed! As noted in other answer, Snorkel certainly does not work for everything :) And indeed, in many cases it may be easier to express what you know extensionally (label examples) vs intensionally (write functions). A lot comes down to the unit cost per label over time- and whether it's more economical to label a bunch of data by hand vs. write LFs or similar.

That's btw why a lot of examples of ML today are ones where data is (i) simple for non-experts to label, (ii) non-private and therefore easy to outsource for labeling, and (iii) low rate of change (e.g. images for self-driving, basic NLP stuff for chat bots, etc)- this kind of data can be labeled cheaply and once, so hand-labeled training sets are (barely) economically feasible to build manually. However, most data is not that easy or cheap to label, needs to be relabeled constantly to adapt to change, and thus the investment in a programmatic approach is often far better even if certainly not push-button!




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: