We're already using domain-specific LLM's. The only LLM trained lawfully that I know of, KL3M, is also domain-specific. So, the title is already wrong.
Author is correct that intelligence is compounding. That's why domain-specific models are usually general models converted to domain-specific models by continued pretraining. Even general models, like H20's, have been improved by constraining them to domain-supporting, general knowledge in a second phase of pretraining. But, they're eventually domain specific.
Outside LLM's, I think most models are domain-specific: genetics, stock prices, ECG/EKG scans, transmission shifying, seismic, climate, etc. LLM's trying to do everything are an exception to the rule that most ML is domain-specific.
> We're already using domain-specific LLM's. The only LLM trained lawfully that I know of, KL3M, is also domain-specific. So, the title is already wrong.
This looks like an "ethical" LLM but not domain specific. What is the domain here?
> That's why domain-specific models are usually general models converted to domain-specific models by continued pretraining
I've also wondered this, like with the case of the Codex model. My hunch is that a good general model trumps a pretrained model by just adding an appropriate system prompt. Which is why even OpenAI sorta recommends using GPT-5.4 over any Codex model.
I was testing them on a HP laptop I bought for $200 with 4GB of RAM.
Windows, its default, used so much memory that there was not much left for apps.
Ubuntu used 500MB less than Windows in system monitor. I think it was still 1GB or more. It also appeared to run more slowly than it used to on older hardware.
Lubuntu used hundreds of MB less than Ubuntu. It could still run the same apps but had less features in UI (eg search). It ran lightening fast with more, simultaneous apps.
(Note: That laptop's Wifi card wouldn't work with any Linux using any technique I tried. Sadly, I had to ditch it.)
I also had Lubuntu on a 10+ year old Thinkpad with an i7 (2nd gen). It's been my daily machine for a long time. The newer, USB installers wouldn't work with it. While I can't recall the specifics, I finally found a way to load an Ubuntu-like interface or Ubuntu itself through the Lubuntu tech. It's now much slower but still lighter than default Ubuntu or Windows.
(Note: Lubuntu was much lighter and faster on a refurbished Dell laptop I tested it on, too.)
God blessed me recently by a person who outright gave me an Acer Nitro with a RTX and Windows. My next step is to figure out the safest way to dual boot Windows 11 and Linux for machine learning without destroying the existing filesystem or overshrinking it.
Consider a dedicated SSD for each OS. You should have a couple M2 slots in the laptop. What you can do is remove (or disable) the Windows SSD, install Linux on the second drive, and then add back the windows drive. Select the drive at startup you want to be in on boot and default the drive you want to spend most of your time in. I did that on my XPS and it was trouble free. Linux can mount your NTFS just fine, without having to consider it from a boot/grub perspective.
> Ubuntu used 500MB less than Windows in system monitor.
Those number meant nothing comparing across OS. Depends on how they counts shared memory and how aggressive it cache, they can feel very different.
The realistic benchmark would be open two large applications (e.g. chrome + firefox with youtube and facebook - to jack up the memory usage), switch between them, and see how it response switching different tasks.
Larger models better understand and reproduce what's in their training set.
For example, I used to get verbatim quotes and answers from copyrighted works when I used GPT-3.5. That's what clued me in to the copyright problem. Whereas, the smallest models often produced nonsense about the same topics. Because small models often produce nonsense.
You might need to do a new test each time to avoid your old ones being scraped into the training sets. Maybe a new one for each model produced after your last one. Totally unrelated to the last one, too.
They're great at Python and Javascript which have lots of tooling. My idea was to make X-to-safe-lang translators, X initially being Python and Javascript. Let the tools keep generating what they're good at. The simpler translators make it safe and fast.
If translated to C or Java, we can use decades worth of tools for static analysis and test generation. While in Python and Javascript, it's easier to analyze and live debug by humans.
> My idea was to make X-to-safe-lang translators, X initially being Python and Javascript.
Both of those languages are already safe. Then you talk about translating to C, so you're actually doing a safe-to-unsafe translation. I'm not sure what properties you're checking with the static analysis at that point. I think what would be more important is that your translator maintains safety.
I hastily wrote that. I probably should've said high-performance, system languages that can be made safely and turned into a single executable. Preferrably with good support for parallelism and concurrency. That's mostly Rust or safe subsets of C and C++ with static analysis.
Python can do the algorithms. It's quick to develop and debug. There's tons of existing code in data science and ML fields. It's worse in the other areas I mentioned, though.
So, a transpiler that generated Rust or safe C/C++ from legacy and AI-generated Python could be a potent combination. What do you think about that?
You've never seen the full power of static analysis, dynamic analysis, and test generation. The best examples were always silo'd, academic codebases. If they were combined, and matured, the results would be amazing. I wanted to do that back when I was in INFOSEC.
That doesn't even account for lightweight, formal methods. SPARK Ada, Jahob verification system with its many solvers, Design ny Contract, LLM's spitting this stuff out from human descriptions, type systems like Rust's, etc. Speed run (AI) producing those with unsafe stuff checked by the combo of tools I already described.
The silo'd codebases I was referring to are verification tools they produce. They're used to prevent attacks. Each tool has one or more capabilities others lack. If combined, they'd catch many problems.
Examples: KLEE test generator; combinatorial or path-bases testing; CPAChecker; race detectors for concurrency; SIF information flow control; symbolic execution; Why3 verifier which commercial tools already build on.
Also, synthetic data and templates to help them discover new vulnerabilities or make agents work on things they're bad at. They differentiate with their prompts or specialist models.
Also, like ForAllSecure's Mayhem, I think they can differentiate on automatic patching that's reliable and secure. Maybe test generation, too, that does full coverage. They become drive by verification and validation specialists who also fix your stuff for you.
I feel like this is actually human-like but like the average human in the pretraining data. Let's look:
1. They reward short or under-developed essays. I'd say most online content, especially with high upvotes next to the post, fits that. Social media surely does.
2. If it's longer posts, the system starts nitpicking it on minor details, like grammar. We see this even on Hacker News, a community valuing quality, with some longer submissions. It's also a debate tactic to derail opponents' better arguments in many discussions which are in their pretraining data.
3. Essays with more praise get higher scores and with more criticism get lower scores. "Get on the Bandwagon" Effect. Echo chambers. One person writes a thing followed by 5-20 people confirming it. That's probably in the pretraining data. It might survive some filtering/cleaning strategies, too.
So, no, I think these AI's are acting way too human. They need to fine-tune them to act like more, reasonable humans. That will initially take RLHF data for many types of situations. Given pretraining bias, they might also have to train them to drop the bad habits the article mentions.
School-type long essays only seem to exist in academia. I took a "business communication" class in college and we didn't write essays. My life experience since then has supported the "no essays" conclusion.
A long comment online now means either two things: it's written by a crank who has strong opinions, usually only tangentially related; or someone who has deep knowledge about the subject and has a lot of detail to provide. It's usually the former.
I agree with you on how their quality is spread out. But, this...
"School-type long essays only seem to exist in academia."
Does an AI know what an essay is? Would it consider any long, descriptive post an essay? Especially if pretraining data has many people describing long posts as essays or "essay-like?" Or only actual essays? And what is an actual essay again?
I think AI's might have different interpretations due to the above questions. They might also conflate essays with longer, detailed, or argumentative posts. We'd have to put a bunch of posts into a bunch of AI's to ask how they classify them.
That we're building theories on what's left of mostly-trashed data has scientific implications. Most people hearing LHC proved something probably didn't think a preprocessor threw away most observations first. That layer of interpretation could cause errors.
I wonder how much independent review went into that step.
It's a discussion forum. Saying people are all wrong with no proof comes off as arrogant but isn't helpful. If you have links to examples, you can simply say, "Here's some prior art or previous work in this area you all might like."
Christians usually only believe in God, angels, humans, and animals. That would mean intelligent UFO's might be angels or demons. While that's speculation, one guy did an interesting test of it.
Non-believers are much more vulnerable to demonic activity than believers. There's also a goal where distracting them from Christ is all they nerd to stay on the road to Hell. So, UFO sightings should be much higher in areas with non-believers than areas with Christians. He shared his data here:
https://www.kl3m.ai/
Author is correct that intelligence is compounding. That's why domain-specific models are usually general models converted to domain-specific models by continued pretraining. Even general models, like H20's, have been improved by constraining them to domain-supporting, general knowledge in a second phase of pretraining. But, they're eventually domain specific.
Outside LLM's, I think most models are domain-specific: genetics, stock prices, ECG/EKG scans, transmission shifying, seismic, climate, etc. LLM's trying to do everything are an exception to the rule that most ML is domain-specific.
reply