I'm surprised it took OpenAI this long to launch scheduled tasks, but as we've seen from our users[0], pure LLM-based responses are quite limited in utility.
For context: ~50% of our users use a time-triggered Loop, often with an LLM component.
Simple stuff I've used it for: baby name idea generator, reminder to pay housekeeper, pre-natal notifications, etc.
We're moving away from cron-esque automations as one of our core-value props (most new users use us for spinning up APIs really quickly), but the base functionality of LLM+code+cron will still be available (and migrated!) to the next version of our product.
> None of these require an LLM. It seems like you own this service yet can't find any valuable use for it.
Sorry? My point was that these are the only overlapping features I've personally found useful that could be replaced with the new scheduled tasks from ChatGPT.
Even these shouldn't require an LLM. A simple cron+email would suffice.
The web scraping component is neat, but for my personal use-cases (tide tracking) I've had to use LLM-generated code to get the proper results. Pure LLMs were lacking in following the rules I wanted (tide less than 1 ft, between sunrise and sunset). Sometimes the LLM would get it right, sometimes it would not.
For our customers, purely scheduling an LLM call isn't that useful. They require pairing multiple LLM and code execution steps to get repeatable and reliable results.
> ChatGPT tasks will become a powerful tool once incorporated into GPTs.
> Baby name generator: why would this be a scheduled task? Surely you aren't having that many children... :)
So far it's help name two children :) -- my wife and I like to see the same 10 ideas each day (via text), so that we can discuss what we like/don't like daily. We tried the sift through 1000 names thing and it didn't fit well with us.
> Reminder to pay, notifications: what value does OpenAI bring to the table here over other apps which provide calendar / reminder functionality?
That's exactly my point. Without further utility (i.e. custom code execution), I don't think this provides a ton of value at present.
I totally remember 482 (Operating Systems for those reading) being really interesting. Story I remember is one of the final projects and dealing with locks in C++ world where I'd get close to full solution, but some errors from the locks, then I'd make a change and suddenly those previous failing tests passed but new ones failed. I didn't realize that could happen.
Great times. And I really liked how we did it all in C++ (other than computer vision 442 that was in matlab) rather than Python which some places do. Having that lower level understanding of languages in school makes understanding code so much easier, and something I didn't have to learn on my own.
I did a pass of the codebase and it seems they’re just forking processes?
It’s unclear to me where the safety guarantees come from (compared to using e.g. KVM).
Edit: it appears the safety guarantees come from libriscv[0]. As far as I can tell, these sandboxes are essentially RISC-V programs running in an isolated context (“machine”) where all the Linux syscalls are emulated and thus “safe.” Still curious what potential attack vectors may exist?
Well, the boundary between the host and the guest, the system call API, is always going to be the biggest vector of attacks no matter what the solution used is. But, if you find a problem and fix it, you're back to being safe again, unlike if you don't have any sandboxing at all. You can also put the whole solution in a jail, which is very common nowadays.
We do something internally[0] but specifically for security concerns.
We’ve found that having the LLM provide a “severity” level (simply low, medium, high), we’re able to filter out all the nitpicky feedback.
It’s important to note that this severity level should be specified at the end of the LLM’s response, not the beginning or middle.
There’s still an issue of context, where the LLM will provide a false positive due to unseen aspects of the larger system (e.g. make sure to sanitize X input).
We haven’t found the bot to be overbearing, but mostly because we auto-delete past comments when changes are pushed.
Another variation on this is to think about tokens and definitions. Numbers don’t have inherent meaning for your use case, so if you use numbers you need to provide an explicit definition of each rating number in the prompt. Similarly, and more effectively is to use labels such as low-quality, medium-quality, high-quality, and again providing an explicit definition of the label; one step further is to use explicit self describing label (along with detailed definition) such as “trivial-observation-on-naming-convention” or “insightful-identification-on-missed-corner-case”.
Effectively you are turning a somewhat arbitrary numeric “rating” task , into a multi label classification problem with well defined labels.
The natural evolution is to then train a BERT based classifier or similar on the set of labels and comments, which will get you a model judge that is super fast and can achieve good accuracy.
If they are made by cutting a ring shape out of wood, the grain is too weak for long term wear.
I more common method for wooden rings is to cut a long thin rip at 1/16th”. Soak it water for 30 minutes. Wrap it around something finger size, put a rubber band around it and let it dry. You can get a good imitation of a glossy epoxy finish with CA/super glue. This gives a lot more strength than a cutout.
I don't think that is true- I build and restore both wooden and fiberglass boats with epoxy, and have used it in almost every possible way. There are different thicknesses of epoxy with different properties, but the ones specially designed for penetrating deeply into wood - such as clear penetrating epoxy sealer will indeed penetrate extremely deep into wood, the manufacturer claims 9-16". In practice, almost any epoxy will penetrate at least 1" into wood.
If anything, epoxy often has too much penetration, and I end up doing a first coat or two that disappear fully into the wood, and another thickened one so it actually stays on the surface or joint.
Yes, but that's generally not something you want to be doing the week before a wedding. It's _very_ easy to forget to do, and hard for the best man to run around and fix while you panic.
Curious how these numbers correlate to the estimates of the engineers behind the PRs?
For example, the first PR is correlated with ~15 "hours of work for an expert engineer"
Looking at the PR, it was opened on Sept 18th and merged on Oct 2nd. That's two weeks, or 10 working days, later.
Between the initial code, the follow up PR feedback, and merging with upstream (8 times), I would wager that this took longer than 15 hours of work on the part of the author.
It doesn't _really_ matter, as long as the metrics are proportional, but it may be better to refer to them as isolated complexity hours, as context-switching doesn't seem to be properly accounted for.
Yeah maybe "expert engineer" is the wrong framing and it should be "oracle engineer" instead - you're right that we're not accounting for context switching (which, to be fair, is not really productive right?)
However ultimately the meaning isn't the absolute number but rather the relative difference (e.g. from PR to PR, or from team to team) - that's why we show industry benchmarks and make it easy to compare across teams!
reply