Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Hi. Nice project! I don't know why there's no BM25 (or at least SOME TF-IDF implementation) in Postgres FTS, but if you decide you need it (and more languages support and highlighting and lower response time) ping us at contact@manticoresearch.com and we'll help you with integrating your postgres dataset with Manticore Search. 60M docs shouldn't be a problem at all (should take about an hour to make an index) and you'll get proper ranking and nice highlighting with just few lines of new code. Here's an interactive course about indexing from mysql https://play.manticoresearch.com/mysql/ , but with postgres it's the same.


Thank you for your generous offer of help! I look forward to taking it up (may take a while as I'm about to move countries and quarantine).

In particular I love that one of the examples in your comment history is in Latin as that language is not currently supported by Postgres FTS. Are Latin and Ancient Greek supported by Manticore? (dare I hope for Anglo Saxon...)


In terms of advanced NLP (stemming, lemmatization, stopwords, wordforms) - no. In terms of just general tokenization - I've never dealt with Latin and Ancient Greek characters (if there're specific characters for those languages), but if even they are not supported by default it's not a problem to add them in config (https://mnt.cr/charset_table)


For the character mappings, it might be useful to have a look at the config for https://tatoeba.org (or rather, the PHP script that generates the config): https://github.com/Tatoeba/tatoeba2/blob/dev/src/Shell/Sphin...

There's one big list of mappings for almost every script under the sun, including Greek. (With mappings like 'U+1F08..U+1F0F->U+1F00..U+1F07' turning U+1F08 Ἀ [CAPITAL ALPHA WITH PSILI] into U+1F00 ἀ [SMALL ALPHA WITH PSILI], and the same for seven other accented alphas. I've considered turning them all into unaccented alpha instead, but I don't know enough about Greek orthography to decide that.) https://github.com/Tatoeba/tatoeba2/blob/3170f7326ad2939c691...

For Latin, there are some special exceptions so that "GAIVS IVLIVS CAESAR" and "Gaius Julius Caesar" are treated the same: https://github.com/Tatoeba/tatoeba2/blob/3170f7326ad2939c691...

It's not beautiful, but it's used in production. People who don't need to support quite as many languages as Tatoeba will probably want a simpler config, but it might still be useful as a reference.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: