Data-Parallel Rank-Select Bit-String construction

jonaslyk · on Aug 8, 2018

I could see this ending up in my data ingestion pipeline... using SIMD instructions for finding line boundaries right after decompression would give a major performance boost.

Decompressing LZ4 can not be done in parallel, but right after that what I want is to split the workload to CORE_COUNT threads at correct line boundaries.

Some primitive way to control pipe flow would solve a problem I have not found any good soloutions for. Example: I want to ingest csv/json with a row containing a domainname.

That row will be my primary key, and my table is partionised on domainname TLD.

In Postgre COPY FROM is the only really fast way to ingest data, but it is not possible to COPY from into a partionised table unless all entries are ending up in the same partition.

If I could do something like this:

curl -q http://something.com/x.lz4 | lz4 -d -| tool -pipeFlowCmd=postgreIngest.sh -pipeFlowCmdArg=split($domainname,\.)[-1]

The tool should spawn a postgreIngest.sh for each tld it sees with the tld as argument and keep the process open piping matching lines into them.

postgreIngest.sh would be something like: psql | awk '{ if ((NR % 500) == 0) printf("COPY domains.$1 FROM stdin "); print; }'

so psql starts by getting "copy domains.com from stdin" then 500 lines piped into it- all with a tld of .com then repeat.

Having a tool capable of using SIMD instructions for ultra fast pipeflow parallelisation and control would be awesome because the bottleneck will not be distribution of the decompressed data to X processess.

If I can find the time I will make it myself using hes ideas for inspiration(I will make it in c++ Posix subset).

Anyway, great work, thanks for sharing- you inspire me :)

jonaslyk · on Aug 8, 2018

just having a generic tool using SIMD instructions with controllable masking arguments in the pipeline would be awesome.

Another example:

When ingesting data into postgre with copy from exceptions are bad...all entries in the copy from batch are discarded because of one faulting line. So either it have to be flawless or the data needs to also be stored on disk until commit is successfull.

Datasets with invalid utf-8 are a pain because of that. One solution is to pipe all input to iconv −f UTF−8 −t UTF−8 -c That will drop invalid characters, but it eats alot of CPU as every character is parsed one by one.

With a pipe processing stage using SIMD custom masks as a way to control the pipe flow I could: Select lines containing chars indicating we have a multi byte utf-8 sequence(that can possibly fail) and only pipe those to iconv, it would reduce overhead by 99%.

valenciarose · on Aug 8, 2018

The input constraints might have been realistic 10+ years ago, but there aren't a lot of performance sensitive parsing problems that can safely do without either escaping or Unicode support.

jonaslyk · on Aug 8, 2018

True, but irrelevant— hes experiments apply perfectly to finding escape characters and bit sequences that indicates the following x bits should be parsed by ICU. He is testing ideas and concepts, I look forward to an update. Would like a way to subscribe.