Hacker News new | past | comments | ask | show | jobs | submit login
The Kitten Programming Language (kittenlang.org)
113 points by andrenth on April 10, 2021 | hide | past | favorite | 84 comments



Looks like Haskell dreaming of FORTH.

Quite fun and surprisingly readable to this old FORTH programmer. Though List<Char> for strings had me coughing up a fur ball.


Interesting idea! It looks like the Git repo doesn't get many updates, at least not on the master branch. Is it "dead", or is it "stable"?

Edit: it looks like as of 2018, the author was taking time off of working in tech (https://github.com/evincarofautumn/kitten/issues/201#issueco...). Maybe lack of activity means they got another job, or maybe they left the programming/tech world entirely.


I've posted this language on HN before (a different username) and the author responded. I haven't played with it, but found the concept interesting. I'm also not sure if dead or stable are the best terms for what appears to be a hobby project and not destined to run in production. It could be neither.


Their Github avatar is wearing a facemask, which presumably means it was changed at some point in 2020...


I'm imagining an author who found his definition of programming nirvana, encoded it as a language, saw that he'd never actually get paid to work in that way and decided to leave it all behind and move to a cabin in the mountains


If curious, past threads:

Kitten Programming Language - https://news.ycombinator.com/item?id=17681202 - Aug 2018 (13 comments)

Kitten – A statically typed, stack-based functional programming language - https://news.ycombinator.com/item?id=13345832 - Jan 2017 (49 comments)

Kitten: a statically-typed stack-based functional programming language - https://news.ycombinator.com/item?id=11753145 - May 2016 (1 comment)

Kitten: compile to C, stack-based functional programming language - https://news.ycombinator.com/item?id=11506603 - April 2016 (3 comments)

Kitten - high-performance statically typed concatenative programming language - https://news.ycombinator.com/item?id=4993640 - Jan 2013 (1 comment)


The author’s blog is also good, though it’s been several years since the last post:

http://evincarofautumn.blogspot.com


> Text literals use double-quotes (""), and are syntactic sugar for lists of characters. They have type List<Char>.

Perhaps not the best choice. I believe this is frequently considered a design mistake in Haskell.


Depends on the underlying implementation of `List<T>`. It should be fine as long as it's not a linked list.


It's a mistake to define a string as a collection type, because strings aren't fundamentally about a single kind of element. They have elements at different abstraction levels (bytes, codepoints, composed characters, words) you sometimes want to enumerate but mostly you operate on the whole string at once.

I do think this was a mistake in Haskell, but also it's silly that Haskell and Lisp are based on lists which are a bad inefficient data structure. It's doubly silly that lazy languages like Haskell use linked lists (which are bad) as a metaphor to represent generator functions (which are good).


If I'm reading the implementation of `reverse<T>` correctly, it appears to be a linked list.



Why is this an issue, out of interest? :)


It's not a huge issue since you can have other datatypes to deal with strings efficiently. E.g., Haskell has Text and ByteStrings (both lazy and strict versions) for this reason. But it is considered a design mistake that the least efficient way of handling strings was made the default.


"Hello World!" now takes 192 bytes in memory, instead of 12. Traversing (for printing, etc) is at least tens of times slower. Changing it to "Hello World?" is now 12 pointer chases rather than 1 assembly instruction. Etc.


It doesn't have to be an issue, but users should be aware that the default string type should be avoided in hot paths, which can be cumbersome. The reason it'll be slow is because iterating linked lists involve a lot of pointer chasing, which implies bad behavior on x86 and worse behavior on CPUs that don't have a data prefetcher.


Random access is pretty common with strings, so using a datastructure that makes it expensive is problematic.


Exactly the opposite... non-random access is extremely common. If you're reading byte X, the odds that you're going to read X+1 are very good. So it makes most sense to pack them tightly in memory in order, not splash them all over RAM and tie them together with pointers. The latter makes certain things easier, but never by enough on modern hardware to beat the array representation.


I’m not really talking about access as the computer sees it: most of the code I write that processes strings ends up doing things like splitting at new lines and stripping off the trailing space of each line. Or getting the length of the string or similar things that require efficient access to a semi-arbitrary offset into the string.

For the type of string processing code I find myself writing, linked lists are the wrong data structure because they tend to turn O(n) algorithms into O(n^2) ones: this can be mitigated by things like skip-lists, and is complicated by things like Unicode. But, personally, I tend to think that the biggest issue is not the performance impact of non-local storage of the string’s contents (which can be mitigated by techniques like cdr-coding). The problem is that string processing code, in my experience, tends to assume getting a predetermined offset into the string is relatively low cost.


Is random access that common? I usually only iterate over strings (and random access is hard to do with utf-8 and friends).

I think the two biggest reasons to not use a list of characters to represent a string are memory efficiency (not storing a linked list pointer per character) and memory locality (keeping the characters of the string near each other in memory since they are frequently used as a unit).


Not exactly random access, but things like finding the last character, trimming trailing whitespace, etc. are all less efficient with linked lists because you have to copy the head of the list to change the tail.

As far as UTF-8 goes, it does add a whole layer of complexity that I had overlooked: although, I suspect intelligent data structure choices could mitigate the random access cost (something like an array of bytes plus an array of grapheme cluster offsets, or similar).

I personally have never found the memory locality concerns relevant to the vast majority of programs I write: when using a language like Python or Ruby or Haskell, you already have a lot of pointer chasing going on because of your non-string data structures such that the memory locality of strings is nearly irrelevant.


I think if your program is string heavy you would definitely notice. It is quite a difference between chasing one pointer per structure and one per character.


Unicode grapheme clusters (letters/ideograms plus accent marks) are inherently variable length, but most devs haven’t read about how that works. We shouldn’t be working with single codepoints or bytes for the same reason we don’t work with US-ASCII one bit at a time, because most slices of a bit string will be nonsense.


Treating strings as arrays of bytes is very reasonable. It is the assumption that each byte is exactly one character that causes trouble.


You should treat them as strings, the backing data store is an implementation detail. What do you want to do with individual bytes if you access them? Whatever you do is likely to be incorrect in some language, even English with emoji mixed in or Zalgo text.

This also lets you implement compressed small strings (eg in tagged pointers) which is a good memory optimization.


Linked lists are slower to iterate through than contiguous arrays, and are much slower at random access. Both are common operations for strings


What is a proper implementation of strings? I often hear this argument (and that null terminated are even worse) but I never understood what is a proper solution then? Can you please explain? Thank you.


I think Java Strings are pretty well accepted as being a good implementation:

1. Primarily, the Java String class stores an array of characters as instance data. The underlying format of arrays in Java is JVM dependent, but most JVMs store arrays (which are immutable) as a contiguous block of memory, with a 4-byte length field. This allows for very fast access for a number of different operations, at the cost of storing those 4 bytes for each array.

2. String instances themselves are also immutable, which has a number of very nice features.

Back in the day when C was created there was essentially a trade-off between storage (storing the length as an int value for each array) and ease of use. These days I think most people would argue that the teeny amount of storage used for Java arrays is mostly a moot point. Also, since Strings are immutable Java can do some optimizations like representing the same string value as just a single underlying instance even if multiple instances of that String are created in code.


> Also, since Strings are immutable Java can do some optimizations like representing the same string value as just a single underlying instance even if multiple instances of that String are created in code.

I thought the JVM stopped doing that many years ago?


No, that's not correct. String literals in code are always interned, and you can still call String.intern() on any String instance.

What you may be referring to is that interned Strings are stored in the head, while they used to be stored in the Permanent Generation:

https://stackoverflow.com/a/10579062


> No, that's not correct.

See the commit for yourself! They removed the sharing between string values when you created a substring a while ago.

https://github.com/openjdk/jdk/commit/f55750d05a04484b719220...


Sorry, yes, we were talking about different things. Calling String.substring() used to not copy the original char array, and would instead just store an offset and length in the new String, but still point to the char array of the original String. This could cause memory leaks, and the commit you referenced fixes that.

I was referring to String interning, where literals defined in code are only created once even if they are duplicated, and the API provides String.intern() to get the "canonical" instance of a String.


What is a “character” in Java? I’m guessing it’s not a unicode code point? Is it basically a byte? Go also has immutable strings which are basically just byte slices (slice being a window into an array).


Java basically uses UTF-16, thus, a char is 2 bytes. All characters in the Unicode Basic Multilingual Plane are encoded with just a single char, while characters in the Supplementary Plane (i.e those with code points above 64K) require 2 chars (4 bytes) in what is called "Surrogate Pairs".

Interestingly, code points beyond the BMP used to be quite uncommon (Unicode was originally designed for just 64k code points), but now emoji are in there, which of course are used a lot.


Using UTF-16 is a mistake because it doesn't represent anything well. Even if you use UTF-32 you still can't fit all characters in a single codepoint because some of them (like flag emoji) are composed. So you should expect things to be variable length. But once you do, UTF-8 is more efficient than UTF-16 even if you're processing CJK text because it so often has ASCII characters mixed in.

Cocoa/ObjC has the same problem - actually Java is based on Cocoa so that's probably where it came from.


Even with the surrogate pairs, Java's choice strikes a compromise between the memory efficiency of UTF-8 and the performance of fixed size code points. Most of the string methods work as if all characters are two bytes. So if, for example, you have a string composed entirely of 4 byte emojis, charAt(10) would return the first part of the 6th emoji, not the 11th emoji.

For many cases, strings won't use any characters outside of UCS-2. In these cases, programmers gain the performance advantages of fixed sized characters without worrying about the downsides of surrogate pairs. It doesn't have the small memory footprint of UTF-8, but one could argue that if memory size is really important and fixed size characters are not, you're better off using in-memory compression for your strings, which will beat UTF-8.

If you are using characters outside of UCS-2, then Java's string methods are a potential source of confusion and bugs. There's always a trade-off, right?


From java9 onward if all the characters in a string are ASCII then java will store them in a byte array with one byte per character. This ends up covering a lot of strings.


Java characters are 16 bit. Unfortunately many methods in the String API expect strings to be UCS-2 encoded, even though in practice nowadays they're UTF-16. But it's more complicated than that, especially since Java 9 which introduced compressed storage for strings that could fit into ISO-8859-1[0].

[0] https://openjdk.java.net/jeps/254


Thanks for this! I stand corrected. The actual underlying storage switched from a char array to a byte array in Java 9:

https://dzone.com/articles/going-beyond-java-8-compact-strin...


A text string standalone data type (maybe called Text) that supports traversal/iteration, subsetting/slicing, containment checking, and has a finite known length.

Internally it can be a list or array or whatever.

Traversal could yield Character objects, which would correspond to Unicode code points (the easy but less-good option, like in Python) or extended grapheme clusters (the actually useful option, like in Swift [0]).

It should also be easy to convert to/from a list or array of raw bytes.

[0]: https://docs.swift.org/swift-book/LanguageGuide/StringsAndCh...


I think they just meant that for ergonomic purposes, strings are very fundamental to most types of programs and deserve first-class support on the same level as primitives

Though possibly they just meant a string shouldn't be stored as a linked-list; I'm not familiar enough with Haskell to know whether List implies that underlying representation


String in Haskell is a type synonym for `List Char` (or `List<Char>`). The question of 'primitives' in Haskell is tricky, as usually you don't deal with primitives at all, although this depends on what you mean by that word:

> Primitive (unboxed) types cannot be defined in Haskell, and are therefore built into the language and compiler. Primitive types are always unlifted; that is, a value of a primitive type cannot be bottom. We use the convention (but it is only a convention) that primitive types, values, and operations have a # suffix (see Section 7.3.2, “The magic hash”). For some primitive types we have special syntax for literals, also described in the same section.

So with respect to 'primitives', Strings are not less special in Haskell-land than, say, Integers (which are arbitrary-size and assignable as literals).

Anyway, Strings as Lists of Chars perform poorly both memory-wise and speed-wise, which is why the linked-list implementation is generally considered a mistake.


As with all things, it's a balance of several competing interests.

If the string is stored in a contiguous array of memory, you can maintain a pointer to the start of the string and then do some pointer arithmetic to get char n. This is an O(1) operation.

If you store the string as a linked list of characters, then to access the nth character you have to follow a bunch of pointers until you get to that char. This is an O(n) operation, and it's going to take up more memory (4/8bytes per pointer, plus a byte for each character). Also it can make your memory fragmented. The upside is that you can grow and shrink the string. Adding chars to the front or end will be O(1). Adding chars in the middle will be O(n).

In the array case, if you want to modify the string you'd likely have to allocate a new block of memory for the larger string and copy all chars over. That will be an O(n) operation.

Null terminated strings are annoying because you don't know how big the string is without reading it. So getting the length of the string is an O(n) operation. The plus side is you don't have to carry around an extra byte(s) to hold the string length.

So it really depends on what you're doing with the string. Are you reading the string and accessing random characters often? Then an array is better. Are you mutating the string often? Then probably a linked list is better. Do you need to know how large your string is? Then keeping a byte around to indicate that is useful. But if you often read the entire string until the NULL character, then that extra memory dedicated to its length is wasted.


For cases like strings as lists of chars and null-terminated strings, I think the reasons why they're generally considered mistakes is because the tradeoffs don't make sense, at least not compared to their downsides. Null-terminated strings save a constant number of bytes, at the expense of expensive length-checking, potential complications, and liability for generating footgun mistakes (I believe they've been the cause of many a CVE over the years). Most systems aren't generally memory-starved enough that you can't spare the constant number of bytes for keeping track of string lengths.

As for linked lists, I believe the times where they're more performant are fairly rare, and unlikely to occur at the level of a data structure for individual characters, which is reflected in String's poor performance in practice. Pretty much the only advantage you get from using string-as-a-list-of-chars is that you get to use List's Functor and Monad instances, but that probably should be avoided in any case.


Most systems aren't generally memory-starved enough that you can't spare the constant number of bytes for keeping track of string lengths.

That's the sort of thinking that gave us instant messaging applications which take gigabytes of memory and can saturate several year-old CPU cores trying to render some text...

That said, null termination + length is usually a good string representation. Instant length retrieval with the ability to ignore it when your algorithm needs to read to the end anyway.


> If the string is stored in a contiguous array of memory, you can maintain a pointer to the start of the string and then do some pointer arithmetic to get char n. This is an O(1) operation.

Pretty sure that only works 100% when in pure ascii. In UTF there is the possibility of characters being more than 1/2/4 bytes.


All of the above! And also there needs to be a way of dealing with variable length characters.

This is the problem that language designers need to solve for.


Similar to what pascal do it. And certainly NOT how C do it.

The best internals (imho) is just a array (see, on rust):

https://doc.rust-lang.org/src/alloc/string.rs.html#278-280

This allow a lot of efficient operations and the possibility of use views (for substrings).

Where thing get nasty is all the unicode stuff. So even if the internal are simply, provide a good API that allow to use unicode correctly is what is more complicated.

I think here "proper implementation of strings" must take in account if is proper unicode, ascii or raw. Rust and other modern languages choose UTF-8 and I think this is the best posible option.


> I believe this is frequently considered a design mistake in Haskell.

More accurately, there's an impedance mismatch with how many people use strings. Text is a better fit for them.

The Haskell implementation is not purely functional. It is however well tested, and wicked fast. While it is immensely appealing to view Haskell as a "there is a God!" abstraction, one does well to often consider and exploit what is happening through the looking glass.

One sees this for example using `type ShowS = String -> String` to concatenate many small strings to later output all at once. A project I've been meaning to complete is to translate a fast parser of mine to Text, to see if I can get it to run as fast. I'm sure the response to that blog post will be amusing.


Having characters at all is probably a mistake.


How so? Character doesn't necessarily mean ASCII char nowadays, if that's what you mean. And other than that I don't know how it's a mistake to have a concept for the individual parts that make up a String.


> How so? Character doesn't necessarily mean ASCII char nowadays, if that's what you mean. And other than that I don't know how it's a mistake to have a concept for the individual parts that make up a String.

But why? It seems like having a `Digit` type, and having a `NatNum` type that's `List<Digit>`.

(OK, to nerdsnipe myself, that runs into trouble with leading 0's, but hopefully the point is clear. `0 | NonZeroDigit:List<Digit>` is maybe more accurate but less punchy.)


> But why? It seems like having a `Digit` type, and having a `NatNum` type that's `List<Digit>`.

BigInt types are in fact more or less List<Digit>, normal integer types just aren't implemented like that for performance reasons.

A char is the primitive type that makes up a string if chained together. It's more or less a practical naming convention, just like we can call a tuple of floats a vector. Maybe you could get by without chars, but then String would be the only type in your language that doesn't consist of any scalar element but always has an arbitrary length in every situation.


> BigInt types are in fact more or less List<Digit>, normal integer types just aren't implemented like that for performance reasons.

Right. Because you only use BigInts, you don't use isolated BigInt digits.

> but then String would be the only type in your language that doesn't consist of any scalar element but always has an arbitrary length in every situation.

Sure, why not? Every type has something special about it. Integers have overflow, floats have precision, infinities, and NaN, and booleans don't map cleanly to bytes.

There's enough different ways to iterate over a string; you don't need a canonical way to decompose them, which is basically what a char type is. I wouldn't go as far as "mistake", but it's definitely fine to go without it.

> A char is the primitive type that makes up a string if chained together. It's more or less a practical naming convention, just like we can call a tuple of floats a vector.

For chars more complex than a byte, there are multiple practical conventions and it's hard to pick one.

For char=byte there's a very natural convention, and an array of bytes reflects what's actually in memory, but char=byte is something that should be avoided.

So it's either "none of these stand out" or "don't do this". So you might as well keep the language slightly simpler and not have a char type.


Arguably strings these days don’t really have scalar elements, not at least in the way we think. One “character” might be several distinct code points, and character access might not be a linear time operation.


It still makes sense to expose those codepoint sequences as individual symbols for the sake of the programmer's sanity. In most cases we don't really need to care if it's a single latin letter or a combination of bytes that make up a weird emoji, we just want a simple way to iterate over what a human reader would call a single character.


> we just want a simple way to iterate over what a human reader would call a single character

But that's almost never what a language is giving you with their 'characters'.


What about multiple leading 0's? Like how in some phone systems you need to do double 0's to call to somewhere not within the company. Also: What about negatives?


I'm not sure what the question is. As you say, I can imagine the need for, say, a `PhoneNumber` type that can remember leading 0's, and certainly there's use for an `Integer` type that can store negatives, but neither of those probably belongs in a `NatNum` type.


Oh right, completely forgot about using more advanced data types. Sometimes I surprise myself with how stupid my statements can be.


What is the use case for breaking up a string into parts? What are the individual parts that make up a string in that use case: bytes, code points, code units, extended grapheme clusters? What should happen when a different partitioning is needed, for a different use case?


> Kitten is a statically typed, stack-based functional programming language designed to be simple and fast.

It's amazing how many languages are designed to be simple and fast. You'd think that at some point the design would converge...


Indeed, this is why a Tu-144, a Porsche 911 and the Parker Solar Probe look nigh-identical.


Neither of those examples are simple.


For those curious, last time it was on HN https://news.ycombinator.com/item?id=13345832

The author answers some questions.


The permissions system for functions looks really neat. Lately my favorite features in a language are the ones that allow you to narrow down a function's contract with the calling code in interesting ways (in fact I'm working on one right now that's built around this idea)


The homepage says that Kitten has (will have?) memory management with no garbage collection, and yet the introduction shows that each item on the stack can represent arbitrarily large lists, trees, etc. How might this be implemented?



The FAQ says:

> In addition, stack-based resource management has predictable runtime performance behavior: when you drop a value from the stack, it gets deallocated, no garbage collector required.

This cannot possibly work for this language. Stack-based memory management only works as long as all references go from "newer" values (higher on the stack) to "older" ones (deeper on the stack). But this language allows you to write a function that rearranges the stack, for example by swapping the two topmost elements.

Consider a scenario where some value x is on top of the stack:

    x
Now we build some sort of heap value that has a reference to x and push it onto the stack:

    box(x)
    x
Now swap these:

    x
    box(x)
Then drop x from the stack, for example by printing it. This deallocates x. The stack is now:

    box(x)
But this value is undefined, since x has been deallocated. You have a use-after-free error.

I wish people stopped approaching language design from the point of view that the people who have been working on garbage collection for the last 60 years are all idiots who don't see how simple it all is.


So the runtime will just randomly segfault or cause memory corruption? That wouldn't make a very useful language, maybe they have a workaround.


Like all other languages voted to the HN front page that claim that they have magically solved memory management, this project's website is simply lying.

Looking at https://github.com/evincarofautumn/kitten/issues/193: "Does kitten require garbage collection?" -- "Nope. [...] The plan is that boxed lists ([…], List<T>) and closures ({…}) are reference-counted" and https://github.com/evincarofautumn/kitten/issues/131: "Explore GC strategies" -- "The C backend currently uses naïve reference counting."

A quick skim of kitten.c confirms that objects have a reference count field and are only really released when it drops to zero. (I didn't find code that increments the reference count, but I find it hard to care.)

The project's website claims:

> Automatic management of memory and resources with no garbage collector.

One can weakly argue that many people mean tracing garbage collection when they speak of garbage collection, and that reference counting is not garbage collection in this narrow sense. This is unhelpful at best.

The FAQ has the claim I quoted above, of which the "when you drop a value from the stack, it gets deallocated" appears to be simply false.

I should add that I'm aware that this project is incomplete, but that doesn't excuse making wildly implausible claims.


So I'd imagine it has "linear-like" ownership?


Just from looking at the hello world example, I don't like how "say" and "ask" are placed at the end. It would be much more readable to start with that.


It's a stack-oriented language, so postfix notation kind of comes with the territory.


Think of it like a unix pipe. That’s what they’re going for with the concatenative paradigm.


It's unfortunate that people are downvoting you because "that's just how it is in stack-based languages". Maybe stack-based languages are indeed less readable than alternatives. Pointing this out should ideally encourage discussion rather than downvotes.

The language's developers themselves seem to agree that postfix syntax isn't always best: The syntax for the type "list of characters" is "List<Char>", which is not postfix. Compare ML-family languages, where this would use postfix syntax, writing this type as "char list".


I am quite sure that if I went to a thread discussing Go or Rust (for example), and I said something along the lines of: "I don't like this, it would be more readable if you did like Forth", I would get downvoted too. And rightly so.

We can have discussions about the convenience or readability of different syntaxes, but if the arguments are going to be of the level of "I don't like it", then a downvote is a valid response.


> Maybe stack-based languages are indeed less readable than alternatives.

That may be the case, but taking a core language feature and commenting "I don't like it, change it" isn't a very effective way to encourage discussion, in my opinion.


It's a form of feedback. Yes, it would have been better feedback if the author had also supplied an example of how they think this could be improved. But I don't believe that that should be required.

Anyway, here's a rough idea. Introduce some special token, like : for example, and add a special syntax rule in the parser that "a: b" is parsed as if it had been written "b a".

Then people who find it more readable can write

    say: "hello"
while others can continue to write

    "hello" say
and surely everyone would be happy!


Postfix is incredibly intuitive. It's much simpler to read than alternatives. Your proposal is wrong.


It's stack based. Written like RPN (Reverse Polish Notation).


At first it was hard to read for me, but then I realized that in their design, the code always flowed from left to right, starting with the data you want to work on (or the data producer).

So it's always: data => process => process => process.

Left to right, no surprise, always the same thing first.

I kinda like the idea, although it's weird to get used to due to unfamiliarity.


They explain this in the tutorial, but anecdotally it reads really well if you speak a SOV language.


I'm disappointed. I expected the syntax to be cuter.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: