Hacker News new | past | comments | ask | show | jobs | submit login

Just trying to help, not criticize. As to correctness, if you change the tokenization to:

  while (fgets(buf, 1024, fp)) {
    char *c, word[1024], *w = word;
    for (c = buf; *c; c++) {
      *c = tolower(*c);
      if (*c >= 'a' && *c <= 'z')
        *w++ = *c;
      else {
        *w = '\0';
        count(word, (size_t)(w - word));
        w = word;
      }
    }
    *w = '\0';
    count(word, (size_t)(w - word));
  }
and you have that `count` function skip words with len < 2 then I can match results with the other 3 (Nim, Rust, C++) versions. Also, your C version runs in just 7 ms on Tale Of Two Cities, about half the time of the PGO optimized Nim.



Oh, no, you don't sound critical at all. I'm making fun of myself. Thank you for actually reading it!


Cool. You could probably simultaneously speed it up and simplify it by changing from linked list collision to open addressing with linear probes (but you may need a larger TSZ). Besides being more cache friendly for larger files, at the end you can then just qsort your table instead of first building a flat copy. :-) You could also use the unreduced hash value as a comparison prefix (only calling strcmp on different values). But it's not bad for potato code. ;-)


I didn't do open addressing because I didn't want to write code to resize the whole table. :)

I thought about also keeping the top 10 as I go instead of copying the whole table. But I'm guessing that virtually all the time this program spends is in I/O.


I kind of suspected that might be the reason..but you do still have a scalability problem for very large vocabularies. :-)

I just did a profile and saw about 15% in strcmp in the hot cache run, but sure if it's not in RAM then IO is likely the bottleneck.


(You could also just save the len and use that as a comparison prefix, but the hash value is less likely to yield a false positive.)




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: