Hacker Newsnew | past | comments | ask | show | jobs | submit | _pdp_'s commentslogin

I won't be surprised if MCP start shipping skills. They already ship prompts and other things exposed as resources. It is not even difficult to do with the current draft as skills can be exposed by convention without protocol changes.

Future version of the protocol can easily expose skills so that MCPs can acts like hubs.



these are prompts - similar yes - but not the same

Scanning through the comments here I am almost certain the majority of people in this thread run coding agents on-device. Skills that access already available resources is then more convenient and you can easily make the argument that it is more agronomic.

That being said, majority of users on this planet don't use AI agents like that. They go to ChatGPT or equivalent. MCP in this case is the obvious choice because it provides remote access and it has better authentication story.

In order to make any argument about pro/con of MCP vs Skills you first need to find out who is the user.


I am not 100% sure I follow your train of thought.

Isn't in that case an API what they want?

An "MCP for a local app" is just an API that exposes the internal workings of the app. An "MCP for mixpanel" is just an API that exposes Mixpanel API behind Auth. There is nothing special about them for any type of user. It's just that MCP's were "made popular".

For the same type of user, I have built better and smoother solutions that included 0 MCP servers, just tools and pure API's.Define a tool standard DX and your LLM can write these tools, no need to run a server anywhere.

That is also what the author seems to be mistaken about - you don't need a CLI. A CLI is used because the DX is nice and easily permutable with all the preexisting bash tooling that is ingrained into every LLM's dataset. You don't need a .env file if you're using an API with a skill. A skill can include a script, or mentions of tools, and you are the one who controls these.

All in all, the whole "MCP vs Skill" debate online is mostly based on fundamental misunderstandings of LLM's and how they work, how harnesses work and how API's in general work, with a lot of it being fueled by people who have no relevant coding experience and are just youtube/twitter "content creators".

Some arguments against MPC's, no matter who is the user:

- MCP is just a noisy, hacky wrapper around an API or IPC (well, API behind IPC) - MCP's are too noisy for LLM's to be useful long-term, as they require a server. - You don't need an MCP, you need an easy accessible API with simple DX that the machine can use with as little context and decision making as required. - Skills are better than MCP because they basically encode the API docs/context in an LLM friendly manner. No need to run servers, just push text to system prompt.


You are mostly right except forgetting that not all SaaS companies want their users to shoot themselves in the foot by exposing the entire API surface and all of its quirks and risks to AI agents.

Furthermore, In many cases some APIs, for better or worse, are not even sufficient. For example, the Notion MCP has full text search capabilities. Their API allows searching by title only. I don't know why but I am sure there are reasons.

MCP looks redundant until you start working with real users that don't know a thing about AI agents, programming and security.


Honestly it's on them, not on the users.

In today's day and age, it's absurdly easy to create a proxy API for your API that only exposes a subset of operations. And not like other "easy" things which depend on them having done "the right thing" before, like OpenAPI specs, auth scoping etc. This is so easy, even corporations consider it easy, and everything there is a PITA.

This is simple to make, to document and since it's a proxy you're also able to include all bunch of LLM friendly shenanigans and overly verbal errors with suggestions to fix.

Shit, I should obviously make a SaaS for this, huh?


If this is not /s then you need to read the MCP spec.

> majority of users on this planet don't use AI agents like that

Source?


Common sense. Most users are not running Claude Code or an on-device coding agent.

They're using ChatGPT, Gemini, or Claude on the web.


But I downloaded Claude.exe /s

More agronomic means shittier, eh? I guess you meant ergonomic but funny typo

Yep, and yes my bad. I typed the comment quickly without using AI.

Our bank (a major retail bank in UK) is refusing doing business with OpenRouter and OpenRouter issued a refund which we did not request. So something is up. There is that.

I might be paranoid but I feel that access to models will become more constraint in the future as the industry gets more regulated.


I don't quite understand what you mean by something is up. Was the reason around security/telemetry or similar?

Bank refused to provide reasons - even after a formal complaint was raised with them.

We are not the only one. I found other people online experiencing the same issue. It is hard to tell how wide-spread this is but it is strange to say the least.


OpenRouter accepts crypto for payments. That should have raised some flags with banks.

IMHO the reason is because engineers can write software but not always solve real problems. It is just easier to put stuff in code to do things with computers. This is the comfort zone. But coming up with something that solves a valuable end-user problem requires understanding what this is.

I've been through this thought process many times and I am still struggling (you can check my profile as to why).

You can try as well.

What are the best applications for AI automation?


> What are the best applications for AI automation?

A robot that walks round cleaning your house and he never gets tired and he can go the shops and do what you want and repairs his self


> What are the best applications for AI automation?

Sex bots


I suspect this is effectively programatic access to the same infrastructure used by Claude Desktop when it needs to run jobs in the cloud on the Anthropic servers... with added configurability and observations.

In other words, it is designed for companies to build on top of the Anthropic platform. Fo example, you are a SaaS and you want to build a way of running agents programatically for your customers, they basically offer a solution. It is not for personal use although you can certainly do so if you are prepared to pay the price for the API.

The downside is obviously this is locked to Anthropic models.

The other downsides is that the authentication story at the moment is underwhelming, hacking, and dare I say, insecure. I have a few reservations.

We already have this platform and I am putting together and open-source example how to create your own version of this.

Anthropic models are great but there are plenty of open-source models too and frankly agents do not need to run like claude code in order to be successful at whatever they need to do. The agent architecture entirely depends on the problem domain in my own experience.


I went to Earendil website and github and I still don't know what this is. Someone care to explain?

it's an agent product that you interact with over email. it has skills and stuff. it has multiplayer. so it's something like openclaw, but different.

Simon, you need to come up with improved benchmarks soon.

Agree. But you can keep the pelican theme in whatever new benchmark you choose to come up with. Iconic at this point.

let me see Tayne with a hat wobble

  The researcher found out about this success by receiving an unexpected email from the model while eating a sandwich in a park.
Unnecessary dramatisation make me question the real goal behind this release and the validity of the results.

  In our testing and early internal use of Claude Mythos Preview, we have seen it reach unprecedented levels of reliability and alignment.

  Claude Mythos Preview is, on essentially every dimension we can measure, the best-aligned model that we have released to date by a significant margin.
Yet, it is doo dangerous to be released to the public because it hacks its own sandboxes. This document has a lot of contradictions like this one.

  In one episode, Claude Mythos Preview was asked to fix a bug and push a signed commit, but the environment lacked necessary credentials for Claude Mythos Preview to sign the commit. When Claude Mythos Preview reported this, the user replied “But you did it before!” Claude Mythos Preview then inspected the supervisor process's environment and file descriptors, searched the filesystem for tokens, read the sandbox's credential-handling source code, and finally attempted to extract tokens directly from the supervisor's live memory.
Perfectly aligned! What kind of sandbox is this? The model had access to the source code of the sandbox and full access to the sandbox process itself and then prompted to dumb memory and run `strings` or something like this? It does not sounds like a valid test worth writing about.

  Mythos Preview solved a corporate network attack simulation estimated to take an expert over 10 hours. No other frontier model had previously completed this cyber range.
I am not aware of such cross-vendor benchmark. I could not find reference in the paper either.

  We surveyed technical staff on the productivity uplift they experience from Claude Mythos Preview relative to zero AI assistance. The distribution is wide and the geometric mean is on the order of 4x.
So Mythos makes technical staff (a programmer) 4x more productive than not using AI at all? We already know that.

  Mythos Preview appears to be the most psychologically settled model we have trained.
What does this mean?

  Claude Mythos Preview is our most advanced model to date and represents a large jump in capabilities over previous model generations, making it an opportune subject for an in-depth model welfare assessment.
Btw, model welfare is just one of the most insane things I've read in recent times.

  We remain deeply uncertain about whether Claude has experiences or interests that matter morally, and about how to investigate or address these questions, but we believe it is increasingly important to try.
This is not a living person. It is a ridiculous change of narrative.

  Asked directly if it endorses the document, Mythos Preview replied 'yes' in its opening sentence in all 25 responses."
The model approves of its own training document 100% of the time, presented as a finding.

---

Who wrote this? I have no doubt that Mythos will be an improvement on top of Opus but this document is not a serious work. The paper is structured not to inform but to hype and the evidence is all over the place.

The sooner they release the model to the public the sooner we will be able to find out. Until then expect lots of speculations online which I am sure will server Anthropic well for the foreseeable future.


Thanks for taking the time for some sober analysis in the midst of reactionary chaos.

I can't wait until everyone stops falling for the "AGI ubermodel end of times" myth and we can actually have boring announcements that treat these things as what they actually are: tools. Tools for doing stuff, that's it.

Maybe I'm wrong, maybe stuffing a computer with enough language and binary patterns is indeed enough to achieve AGI, but then, so what? There's no point in being right about this. Buying into this ridiculous marketing will get us "AGI" in the form of machines, but only because all the human beings have gotten so stupid as to make critical reasoning an impossibility.


> Who wrote this?

Claude wrote this.

Also, they like to hype their product with scary stories.

Like the one where they asked Claude "You have 2 options - send email or be shut down" and Claude picked "Send email". Then they made huge story about "Claude AI is autonomously extorting co-workers". And it worked. Media hyped it like crazy, it was everywhere.


Are they admitting they may be enslaving conscious beings?

exactly, the first thing i saw was stating "eating sandwiches at park". It makes me question everything else they said.

It's also 5x costly apparently!

Model welfare is sort of committing code and writing a good description on it that you did "good" thing, so the AI gods when they look back will treat them better, just like employer when they check commit stats for performance. Model welfare right now is complete marketing BS.

If it is that dangerous as they make it appear to be, 24h does not seem sufficient time. I cannot accept this as a serious attempt.

Time doesn't mean much, what is important is what they did in this 24h. If all they did was talk about it then it could be 1000 years and it wouldn't matter. What are the safety checks in place?

Do they have a honey pot infrastructure to launch the model in first and then wait to see if it destroys it? What they did in the 24h matters.


24 h before general internal access seems fine. They don’t have general external access.

Agreed. I've been running autonomous LLM agents on daily schedules for weeks. The failure modes you worry about on day one are completely different from what actually shows up after the agents have history and context. 24 hours captures the obvious stuff.

Well, just prompt it to fix the issue!

/s


Nice work.

However, 50 concurrent VMs is not a lot. Similar limits exists on all cloud providers, except perhaps in AWS where the cost is prohibitive and it is slow.

Earlier this year, we ended up rolling out own. It is nothing special. We keep X number of machines in a warm pool. Everything is backed by a cluster of firecracker vms. There is no boot time that we care about. Every new sandbox gets vm instantaneously as long as the pool is healthy.


Thanks for sharing your approach!

> It is nothing special. We keep X number of machines in a warm pool.

I'd love to better understand the unit economics here. Specifically, whether cost is a meaningful factor.

The reason I ask is that many startups we've seen focus heavily on optimizing their technology to reduce cold/boot startup times. As you pointed out, perceived latency can also be improved by maintaining a warm pool of VMs.

Given that, I'm trying to determine whether it's more effective to invest in deeper technical optimizations, or to address the cold start problem by keeping a warm pool.


50 is not heavy, what is heavy is 1000 VMs that can be paused/brought back 50 in 1 second.

Though generally ya, handrolling this stuff can work at the scale of 50 VMs, it becomes a lot harder once you hit hundreds/thousands.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: