The project relies on Rayon [1] for scheduling parallel tasks and Cranelift [2] to JIT the hot loops.
There are plenty of other interesting features like auto-FFI, bytecode caching (similar to Python's .pyc files), and "daemonize" mode (similar to mod_perl or FastCGI).
The one piece of advice I give new PhD students is to maintain a list of your references for a bibliography ahead of time. For every paper you read, copy the citation in BibTeX format and write a couple of sentences to remind yourself what the paper was about. Do this for every source, even if it doesn't seem important at the time.
Use zotero and betterbibtex. By all means type a comment so you know which ideas came from where but I'm a big advocate of taking notes by hand when you really want to understand something, as opposed to reminding your future self about something you already understand.
Not within a PhD, but as a side project I work on a research project on wikiversity about grammatical gender in French. It does reference a bunch of books and academic works, like probably a hundred I guess. The most tedious work though is to check which nouns are used only in a single gender of do have some epicenic or specific inflection used in the wild and giving a reference that attest that when it's not already so consensual that most general public dictionary would already document the fact. For that the research refers to thousand of webpages. I'm glad that most of the time I just need to drop the DOI, ISBN, or page URL and MediaWiki will handle the filing of the most relevant fields. That's not perfect, it generates the output with many different models currently (some don't have an excerpt field), and some required fields might be left blank, url to pdf won't work, and so on. But all in all it make the process of taking note of the reference quick and not going too much in my way. Creating a structured database out of it can certainly be done later.
Zotero and AI have this covered now. If there's one thing AI is good at it's summarising crappy formatted papers. Never understood the 2 and 3 column thing. Horrendous way to format something.
2-column format has narrower columns, which means that your eyes move more vertically than horizontally while reading it. That is considered conducive to “skimming” long texts if you’re a “speed reader”.
Do you mean that you’re using AI as a search engine for your local bibliography? I haven’t seen any AI plugins for Zotero.
I'm severely dyslexic and the columns are a massive hindrance for me, and I also cannot skim read due to this and meares irlen. So my dislike is not universal applicable, just personal experience.
On the zotero front there a bunch of AI plugins. But I've not used them. But yes the premise is that your can speak and ask your library questions. Some are set up differently though. Personally I can fire a paper into an llm and get a good idea of the content immediately and then ask questions about it. It's more interactive and allows me to get a better idea of it prior to reading it.
LLMs make too many mistakes when summarizing papers in their current state, I would never trust it to summarize a whole paper at the moment.
I only use it on a sentence or paragraph basis, otherwise it misses the point 90% of the time.
I would strongly advise against this use for the moment.
The important part of reading a paper is not only to extract general rules, but to build your own internal model. Without it you cannot effectively do research. The main interesting points are often in the subtleties of the details deep in the paper.
Internal tought that come easily to mind when I read :
- 'oh they used that equation, but that could be also be interpreted totally differently, what happens if we change point of view, does it makes sense from this other perspective'
- 'I see they claim to achieve better results than sota, but actually, they compared with other methods that are not solving exactly the same problem, what shortcut or changes did they had to do to obtain a fair comparison, is it a fair comparison, can I trust those numbers? '
- 'oh, the authors didn't realize that they solved this other problem, or did they realize but there was a block somewhere preventing it?'
- 'I like this trick to achieve that result, but at the same time, it will prevent to solve a whole class of other problems, so their method will not work on those cases'
...
Also, notice that a paper IS a summary of multiple months/years of work, and researchers summarize it already to the maximum to stay within the page limit, by summarizing a summary you will always miss many things.
I have found a lot of pearls casually buried in the paper, that there is no way a summary, either human or AI, would extract. Things like changing a method slightly, or recovering an old method to apply to a current problem, menctioned like it is not important but actually you have a project blocked in a similar spot.
Fair points. And likely why I'm not suited to academia too. I've just never really groked the practice. I've obviously only experience from bachelor's and masters but it always seems that you have an idea and the research is finding papers to back it up, and then some that might not. The work you do doesn't really matter as it seems secondary to the nonsense around "the literature".
I've a bachelors of science (first) in computer science, and currently doing a dissertation for a master's in cyber security, on route for a first but that might change depending on the mark for this dissertation.
My experience with the bachelors was that despite my project being derailed by the bullshit around formatting the document, doing "research" by searching the library for peer reviewed papers that backed up my claims, etc, etc; I got a excellent mark. In short I set out to make something and due to the academic processes failed in making anything, but because I was able to critically reflect on it, I got a good mark. Waste of time, unless you were just are a good mark.
For my masters I know the project doesn't matter, I'm concentrating on the academic nonsense because that's where the marks are.
The work you were given in your undergraduate and master’s was not research, it was homework. The task was critical reflection, which is repeatable and achievable for students; whereas research is expensive, one off, and generally out of reach for undergrads, and requires intensive oversight by an experienced researcher.
The waste of time would be for a professor to train you up to be a researcher before you’ve proven you are ready, hence the homework assignments.
If that's the case then and researching is way above masters level then how is it you get on a PhD? Genuine question. If everything I've done to date is a pale imitation of the real thing how can I make a fair assessment as to whether I want to pursue a PhD?
You don’t really, and why a lot of people become researchers only to discover they hate it. But that’s true with all things.
I think the way to know if you want to be a researcher is more along the lines of: do you like finding the answers to questions no k e has thought to ask let alone answered? If so then it doesn’t really matter the training you’ve or the amount t of the field you’ve experienced, you can focus on that bit as your guiding force.
No, it's not about whether masters or PhD, it's whether you did something new (the novelty aspect). It sounds to me that you did a coursework masters of some kind, which gave some basic literature analysis projects. This is like the first month of any research project, and is so you understand the context of the project. The actual work is doing the novel thing, and dealing with the repeated failures.
My suggestion is do a summer research project, and see if you enjoy it. If no-one will take you on, reflect on why that is (and to me that's a strong reason not to do it).
There are few. I use zoteroGPT to extract things(e.g. methods, sample size, species etc.) from a bunch of papers/collection. I don't use it for summary.
I feel like that's true when the font is insanely small, which I guess was good when people would print entire proceedings.
Reading two column super small font on a computer is super annoying though tbh.
The comments you write in to Zotero are not what paper is about - abstract covers this well enough - it’s about what you found interesting or useful about the paper.
I have had some fun exhuming my old LaTeX skills and assembling a BibTeX bibliography from which I automatically extract the right entries presented in whichever style is needed for a given paper and for my own (HTML) site. I even publish the collection in Zenodo in case useful to others. I use the 'annote' field for the reminder you suggest.
Ha! You just made me remember how much I used JabRef (open source bibtex reference app) back in 2004 when I did my PhD.
It was the best/worst 4 years of my life. I studied overseas (uk), met my future wife and got a PhD that really wasn't useful for much to me. Fortunately it was under a scholarship.
The lack of good tools to have good research notes with good search is kind of mind-boggling. I have reverted to having a website for myself, a private one that I run on my machine, using mkdocs which comes close to what I would want.
Presumably the idea is that you put the relevant parts of the list in your thesis. You need to convince your examiner that you understand the background to the original research you did, and a solid reference list (with supporting text in the introductory/background section of your thesis) is part of doing that.
Personally I did the references at the end and didn't feel like I suffered from that decision, but the key references in my particular area were a relatively small and well-known set.
Hmm, yeah. I mean you often see huge reference lists which always just makes me feel like the person can't possible be actually well acquainted with the stuff that's being referenced. So who are you really fooling? Seems all very performative, though I guess I understand the motivation
> I think it would be the best to start interpreting the query and start compilation in another thread
This technique is known as a "tiered JIT". It's how production virtual machines operate for high-level languages like JavaScript.
There can be many tiers, like an interpreter, baseline compiler, optimizing compiler, etc. The runtime switches into the faster tier once it becomes ready.
It’s also common for JITs to sprout a tier and shed a tier over time, as the last and first tiers shift in cost/benefit. If the first tier works better you delay the other tiers. If the last tier gets faster (in run time or code optimization) you engage it sooner, or strip the middle tier entirely and hand half that budget to the last tier.
I first encountered q/kdb+ at a quant job in 2007. I learned so much from the array semantics about how to concisely represent time-series logic that I can't imagine ever using a scalar language for research.
Fun fact: the aj (asof join) function was my inspiration for pandas.merge_asof. I added the extra parameters (direction, tolerance, allow_exact_matches) because of the limitations I kept hitting in kdb.
The aj function at its heart is a bin (https://code.kx.com/q/ref/bin/) search between the two tables, on the requested columns, to find the indices of the right table to zip onto the left table.
aj[`sym`time;t;q]
becomes
t,'(`sym`time _q)(`sym`time#q)bin`sym`time#t
The rest of the aj function internals are there to handle edge cases, handling missing columns and options for filling nulls.
A lot of the joins can be distilled to the core operators/functions in a similar manner. For example the plus-join is
I couldn't figure-out how Arthur's bin matched on symbol though, so I switched to a linear scan on the right table to record the last-seen index for each "by" element. While it worked, my hash table was messy because I relied on Python to handle a whole tuple as a key, which had some issues during initial testing.
The asof join I wrote for Empirical properly categorizes the keys before they are matched. That approach worked far better.
The article points out that tools like TLA+ can prove that a system is correct, but can't demonstrate that a system is performant. The author asks for ways to assess latency et al., which is currently handled by simulation. While this has worked for one-off cases, OP requests more generalized tooling.
It's like the quote attributed to Don Knuth: "Beware of bugs in the above code; I have only proved it correct, not tried it."
From my point of view, they cannot even prove that, because in most cases there is no validation if the TLA+ model actually maps to the e.g. C code that was written.
I only believe in formal methods where we always have a machine validated way from model to implementation.
I had been thinking about this idea for a long time, but I doubt I'll be able to get around to it. After speaking with a friend this evening, I decided to just jot it down for anyone interested.
Basically, use copy-and-patch compilation in a vector language to fuse loops and avoid temporaries. It can be employed for a baseline compiler that will use less memory than an interpreter and will have a much lower startup cost than an optimizing compiler.
It can infer the column names and types from a CSV file at compile time.
Here's an example that misspells the "ask" column as if it were plural:
let quotes = load("quotes.csv")
sort quotes by (asks - bid) / bid
The error is caught before the script is run:
Error: symbol asks was not found
I had to use a lot of computer-science techniques to get this working, like type providers and compile-time function evaluation. I'm really proud of the novelty of it and even won Y Combinator's Startup School grant for it.
Unfortunately, it didn't go anywhere as a project. Turns out that static typing isn't enough of a selling point for people to drop Python. I haven't touched Empirical in four years, but my code and my notes are still publicly available on the website.
I love how you really expanded on the idea of executing code at compile time. You should be proud.
You probably already know this but for people like me to switch "all" it would take would be:
1. A plotting library like ggplot2 or plotnine
2. A machine learning library, like scikit
3. A dashboard framework like streamlit or shiny
4. Support for Empirical in my cloud workspace environment, which is Jupyter based, and where I have to execute all the code, because that's where the data is and has to stay due to security
Just like how Polars is written in Rust and has Python bindings, I wonder if there's a market for 1 and 2 written in Rust and then having bindings to Python, Empirical, R, Julia etc. I feel like 4 is just a matter of time if Empirical becomes popular, but I think 3 would have to be implemented specifically for Empirical.
I think the idea of statically typed dataframes is really useful and you were ahead of your time. Maybe one day the time will be right.
The inferencing logic needs to sample the file, so (1) the file path must be determined at compile time and (2) the file must be available to be read at compile time. If neither condition is true---like the filename is a runtime parameter, for example---then the user must supply the type in advance.
There is no magic here. No language can guess the type of anything without seeing what the thing is.
Yeah, i think that's what limits the utility of such systems. Polars does typechecking at query planning time. So before you really do computation. I don't expect that much can improve over this model due to the aforementioned limitations.
I think needing network access or file access at compile time is a semi-hard blocker for statically typed dataframes.
> Jobs are being eliminated within the IT function which are routine and mundane, such as reporting, clerical administration.
I've had a similar thought recently, that there is no demand for rote programmers. No employer is going to hand you a completed spec and tell you to code it up.
Software engineers and data scientists today must be innovative, understand the business they operate in, communicate with users, and work cross functionally. You’ve got to create something original and see it through without having to be told what to do at every step.
Personally, I'd classify jobs such as reporting and clerical administration more as 'admin' then IT. You do some of that as a SWE (tickets, SoW, Design docs) but that's not generally the focus of the work.
This op-ed was written by an undergrad and complains that Northeastern's switch to Python (from Racket) for its introductory classes will prevent students from learning fundamentals of computer science.
But that complaint can be made about any language! "This dynamically typed language won't allow students to understand type safety." "This high-level language won't allow students to learn pointers and systems programming." Etc.
I believe that an intro course should get students coding since the first major hurdle is learning how to construct any kind of program at all. The switch to a more "employable" language isn't going to make education worse.
Tell me you haven't read the article (or used racket) without telling me.
> I believe that an intro course should get students coding since the first major hurdle is learning how to construct any kind of program at all. The switch to a more "employable" language isn't going to make education worse.
None of this is the issue at hand. The switch to python is because industry uses it. The article correctly makes the point that racket was intentionally designed to get students coding as easily and quickly as possible. It has multiple steps of teaching languages for exactly that purpose, introducing concepts in ways that let students grapple with them one at a time in an interactive environment.
Meanwhile in python complex topics like duck typing, object oriented methods, exceptions, the distinction between iterables and lists, how to use a command line/terminal or how to configure an IDE, and so on must be covered before people can start writing code for the exercises. Racket is streamlined for beginners.
> Meanwhile in python complex topics like duck typing, object oriented methods, exceptions, the distinction between iterables and lists, how to use a command line/terminal or how to configure an IDE, and so on must be covered before people can start writing code for the exercises.
No, they dont have to be at all. You might as well suggest you need to learn the JVM before writing a line of Java.
Python supports imperative, OO and functional programming paradigms. And to start you can use any text editor, an IDE is not required. In fact you can start working in the REPL right away, in which case you need a terminal and the command “Python”.
To quote the above person: "tell me you haven't read the article without telling me".
You thought that supporting multiple "programming paradigms" is a nice thing, but it's the opposite for teaching beginning student. Experienced programmers want expressivity/customization/choices to do whatever they want. That's not what newbies need when they get stuck on an assignment.
In this case, you can find the same criticisms in published articles and books. I expect this student heard them straight from the source (author of the articles or books). That does not lessen their impact or correctness in my opinion.
There are plenty of other interesting features like auto-FFI, bytecode caching (similar to Python's .pyc files), and "daemonize" mode (similar to mod_perl or FastCGI).
[1] https://docs.rs/rayon/latest/rayon/
[2] https://cranelift.dev
reply