Hacker Newsnew | past | comments | ask | show | jobs | submit | fwsgonzo's commentslogin

I actually just published a paper about something like this, which I implemented in both libriscv and TinyKVM called "Inter-Process Remote Execution (IPRE): Low-latency IPC/RPC using merged address spaces".

Here is the abstract: This paper introduces Inter-Process Remote Execution (IPRE), whose primary function is enabling gated persistence for per-request isolation architectures with microsecond-latency access to persistent services. IPRE eliminates scheduler dependency for descheduled processes by allowing a virtual machine to directly and safely call, execute functions in a remote virtual machines address space. Unlike prior approaches requiring hardware modifications (dIPC) or kernel changes (XPC), IPRE works with standard virtualization primitives, making it immediately deployable on commodity systems. We present two implementations: libriscv (12-14ns overhead, emulated execution) and TinyKVM (2-4us overhead, native execution). Both eliminate data serialization through address-space merging. Under realistic scheduler contention from schbench workloads (50-100% CPU utilization), IPRE maintains stable tail latency (p99<5us), while a state-of-the-art lock-free IPC framework shows 1,463× p99 degradation (4.1us to 6ms) when all CPU cores are saturated. IPRE thus enables architectural patterns (per-request isolation, fine-grained microservices) that incur millisecond-scale tail latency in busy multi-tenant systems using traditional IPC.

Bottom line: If you're doing synchronous calls to a remote party, IPRE wouldn't require any scheduler mediation. The same applies to your repo. Passing allocator-less structures to the remote is probably a landmine waiting to happen. If you structure both parties to use custom allocators, at least for the remote calls, you can track and even steal allocations (using a shared memory area). With IPRE there is extra risk of stale pointers because the remote part is removed from the callers memory after it completes. The paper will explain all the details, but for example since we control the VMM we can close the remote session if anything bad happens. (This paper is not out yet, but it should be very soon)

The best part about this kind of architecture, which you immediately mention, is the ability to completely avoid serialization. Passing a complex struct by reference and being able to use the data as-is is a big benefit. It breaks down when you try to do this with something like Deno, unfortunately. But you could do Deno <-> C++, for example.

For libriscv the implementation is simpler: Just loan remote-looking pages temporarily so that read/write/execute works, and then let exception-handling handle abnormal disconnection. With libriscv it's also possible for the host to take over the guests global heap allocator, which makes it possible to free something that was remotely allocated. You can divide the address space into the number of possible callers, and one or more remotes, then if you give the remote a std::string larger than SSO, the address will reveal the source and the source tracks its own allocations, so we know if something didn't go right. Note that this is only an interest for me, as even though (for example) libriscv is used in large codebases, the remote RPC feature is not used at all, and hasn't been attemped. It's a Cool Idea that kinda works out, but not ready for something high stakes.


Looks like you forgot the URL. Interested.

Best I can do is reply to an e-mail if someone asks for the paper, since it's not out yet. The e-mail ends with hotmail.

> I actually just published a paper...

This gives me an impression that the paper has already been published and is available publicly for us to read.


Sorry about that, the conference was on Feb 2, and it's supposed to be out any day/week now. I don't have a date.

There is a blog-style writeup here: https://fwsgonzo.medium.com/an-update-on-tinykvm-7a38518e57e...

Not as rigorous as the paper, but the gist is there.


Thanks! I'll keep an eye out for the paper.

> Low-latency IPC/RPC using merged address spaces".

Can't this be achieved with a small block of shared memory, between processes that are otherwise isolated?


You either have to pause the caller or prevent the caller from trampling the memory while the callee is using it. If you look at previous work like VMRPC (https://ieeexplore.ieee.org/document/5542746/) they make the shared area read-only while the callee is using it.

In IPRE, I am pausing the caller while the remote call is on-going, which means "just having a shared block" is inferior to just sharing everything. It's just so much easier and nicer to be able to pass literally anything you want.

The caveat is that the callee has to wait, but I think the fact that the remote is now running in the SAME THREAD without any scheduling involved makes up for it. It's a true synchronous remote function call with some overhead.


How much work would it be to use the C++ ONNX run-time with this instead of Python? Is it a Claudeable amount of work?

The iOS version is Swift-based.


shouldn't be hard. what backend/hardware are you interested in running this with? i'll add an example for using C++ onnx model. btw check out roadmap, our inference engine will be out 1-2 weeks and it is expected to be faster than onnx.

I want to run it in a website with Wasm and having the browser do the audio playback

I've been playing with running small models in browser tabs for some time, and finally decided to open some of it.

Added kitten (nano only, for now, will move on to mini) to my "web tts thing": https://github.com/idle-intelligence/tts-web

demo: https://idle-intelligence.github.io/tts-web/web/


desktop CPUs running inference on a single background thread would be the ideal case for what I'm considering.

Same here, also on an island. We lost power for ~8 hours during a storm, however that is the longest I've ever experienced. I have this stone fireplace: https://www.norskkleber.no/ovner/marcello/ (Marcello 140), which kept my 75sqm living room heated through the whole thing.

Since that storm, we have decided to buy a second fireplace for upstairs with a cooking top.


Hey Vi burde alle chat litt. Vi bor jo alle litt i nærheten.


Gjerne det! Sitter her om dagen: https://discord.gg/4e3yd5ej


This is true. A multi-tier JIT-compiler requires writable execute memory and the ability to flush icache. Loading segments dynamically is nice and covers a lot of the ground, but it won't be a magic solution to dynamic languages like JavaScript. Modern WASM emulators already implement a full compiler, linker and JIT-compiler in one, almost starting to look like v8. I'm not sure if adding in-guest JIT support is going in the right direction.


It's designed to be low-latency enough that calling into the scripting solution is not considered a high cost. With something like Lua you're likely to hold back a lot, as it has a really high entry/exit cost, and the same is true for calling out to the host. libloong has 40x lower latencies.


Hey, and thanks! libloong is a little bit restrained in its design. It's designed specifically to be the lowest latency sandbox. libriscv is more flexible in that it can load dynamic ELFs and run programs with LuaJIT embedded. I actually haven't been able to run Go programs in libloong yet, but I do want to reach that level!


Merry Christmas to all


I would never have had a working LoongArch emulator in 2 weeks at the kind of quality that I desire without it. Not because it writes perfect code, but because it sets everything up according to my will, does some things badly, and then I can take over and do the rest. The first week I was just amending a single commit that set everything up right and got a few programs working. A week after that it runs on multiple platforms with JIT-compilation. I'm not sure what to say, really. I obviously understand the subject matter deeply in this case. I probably wouldn't have had this result if I ventured into the unknown.

Although, I also made it create Rust and Go bindings. Two languages I don't really know that well. Or, at least not well enough for that kind of start-to-finish result.

Another commenter wrote a really interesting question: How do you not degrade your abilities? I have to say that I still had to spend days figuring out really hard problems. Who knew that 64-bit MinGW has a different struct layout for gettimeofday than 64-bit Linux? It's not that it's not obvious in hindsight, but it took me a really long time to figure out that was the issue, when all I have to go on is something that looks like incorrect instruction emulation. I must have read the LoongArch manual up and down several times and gone through instructions one by one, disabling everything I could think of, before finally landing on the culprit just being a mis-emulated kind-of legacy system call that tells you the time. ... and if the LLM had found this issue for me, I would have been very happy about it.

There are still unknowns that LLMs cannot help with, like running Golang programs inside the emulator. Golang has a complex run-time that uses signal-based preemption (sysmon) and threads and many other things, which I do emulate, but there is still something missing to pass all the way through to main() even for a simple Hello World. Who knows if it's the ucontext that signals can pass or something with threads or per-state signal state. Progression will require reading the Go system libraries (which are plain source code), the assembly for the given architecture (LA64), and perhaps instrumenting it so that I can see what's going wrong. Another route could be implementing an RSP server for remote GDB via a simple TCP socket.

As a conclusion, I will say that I can only remember twice I ditched everything the LLM did and just did it myself from scratch. It's bound to happen, as programming is an opinionated art. But I've used it a lot just to see what it can dream up, and it has occasionally impressed. Other times I'm in disbelief as it mishandles simple things like preventing an extra masking operation by moving something signed into the top bits so that extracting it is a single shift, while sharing space with something else in the lower bits. Overall, I feel like I've spent more time thinking about more high-level things (and occasionally low-level optimizations).


Just have a look at r/anthropic. It's well known that you hit a limit after no usage at all with Pro (aka. demo). Chargeback is the only thing they will understand.


I agree. I also challenge readers to watch TV broadcasts from politicians speaking in 70s, 80s and even 90s. You won't even believe your ears. But, the slow takeover of the world by international conglomerates buying up everything else, merging and bankrupting competition just doesn't seem to be on anyones mind with any power to deal with it. An acquaintance works at one of these Frankensteins monsters and there is a hodge podge of internal systems. It's hard to believe how many companies they have bought up over the decades.


"There is no America. There is no democracy. There is only IBM, and ITT, and AT&T, and DuPont, Dow, Union Carbide, and Exxon. Those are the nations of the world today. What do you think the Russians talk about in their councils of state, Karl Marx? They get out their linear programming charts, statistical decision theories, minimax solutions, and compute the price-cost probabilities of their transactions and investments, just like we do. We no longer live in a world of nations and ideologies, Mr. Beale. The world is a college of corporations, inexorably determined by the immutable bylaws of business. The world is a business, Mr. Beale. It has been since man crawled out of the slime."

- Network (1976)


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: