Hexagons and Hilbert curves – The headaches of distributed spatial indices

spenczar5 · on March 22, 2024

Oh boy, this gives me a chance to talk about one of the gems of astronomy software which deserves to be better known: HEALPixel tesselation!

HEALPixels stand for 'Hierarchical Equal-Area Iso-latitudinal Pixels'. It is a scheme that was developed to analyze signals that cover the entire sky, but with variable density.

Like HTM or Hilbert curves, this can be used to organize spatial data.

The tesselation looks kind of funny but has many good features - it doesn't have discontinuities at poles, and is always equal area. And with the "nested" healpixel formulation, pixels are identified by integers. Pixel IDs are hierarchical based on leading bits - so, for example, pixel 106 (=0110 1010) contains pixel 1709 (=0110 1010 1101). This lets you do some marvelous optimizations in queries if you structure your data appropriately. Nearest neighbor searches can be extremely quick if things are HEALPix-indexed - and so can radius searches, and arbitrary polygon searches.

HEALPixels are used today for more than just their original intent. LSST will use them for storing all-sky data and point source catalogs, for example.

More here:

- Original NASA/JPL site: https://healpix.jpl.nasa.gov/

- Popular Python implementation: https://healpy.readthedocs.io/en/latest/

- Good PDF primer: https://healpix.jpl.nasa.gov/pdf/intro.pdf

And an experimental database being built on healpix for extremely large data volumes (certainly many TB, maybe single-digit PB): https://github.com/astronomy-commons/hipscat

aimor · on March 22, 2024

Thanks for sharing, I've needed this for years and never stumbled across it.

eh_why_not · on March 22, 2024

Gem indeed! Here's its wikipedia page:

https://en.wikipedia.org/wiki/HEALPix

michelpp · on March 22, 2024

I experimented with geospatial Hilbert Curves as a Postgres extension [0] for PostGIS using the S2 [1] spherical geometry library. S2 uses a scale free cell coverage pattern that is numbered using six interlocking space filling Hilbert Curves [2]. This approach is similar to what is used in this article. S2 doesn't use hexagons but a different cell structure but the idea is generally the same.

By having both high level (cell) and low level (cell id) geometries it was a very powerful library which allowed projection from the hilbert space into a Postgres spatial index (spgist) including various trees, like noted in this article. It appears to be still quite active in development.

[0] https://github.com/michelp/pgs2

[1] https://s2geometry.io/

[2] https://s2geometry.io/devguide/s2cell_hierarchy

max_sendfeld · on March 22, 2024

Ah - I've used S2 in the past. Great work. It scales really well for larger datasets.

Interesting that you mention the use of multiple hilbert curves as well. We also experimented with two Hilbert Curves, rotated by 90 degrees. This helps to get around what we've dubbed the "Hilbert Equator" problem where two objects are quite far on the curve because they are placed close to one of the major fault lines in the fractal (for lack of a better word)

jandrewrogers · on March 22, 2024

There are always caveats with these types of indexing methods, especially if you require dynamic high-performance indexing and support for rectangles/polygons. You can only move the edge cases around, there is no way to eliminate them. This allows you to tailor an indexing scheme for specific workload assumptions but this obviously breaks down if you need an algorithm that generalizes to many workloads and data models. There isn't just one type of edge case with this type of indexing, there are several which may or may not be relevant depending on what you are trying to do.

Some research from the 1980s showed it is only possible to mitigate bounded categories of edge case, thereby improving generality, by indexing on complex higher-dimensionality embeddings. Mitigating more categories requires more dimensions and more complexity. However, no one could figure out how to construct these embeddings for even basic cases or deal with more practical curse of dimensionality issues, so that is largely forgotten (the researchers themselves made comments to the effect that they didn't think a tractable solution was possible). I've never even been able to find that literature in electronic form, unfortunately.

Like AI, it is an interesting open-ended problem space. You can prove that an elegant optimal solution is not tractable so it ends up being a search for asymptotically optimal algorithms that become exponentially more complex the closer you get to optimal.

michelpp · on March 22, 2024

Tackling these boundary problems are the literal "edge cases" in geospatial indexing and they exist everywhere, so this i a good reason for using an existing library as the authors have already solved them.

Hexagons are cool, but they are not necessarily the bestagon for a spherical geometry since you cannot break a hexagon into smaller hexagons, whereas an S2 cell is a "cube" with spherical sides or HEALPix uses a rhombic dodecahedron [0] both of which can be split into smaller divisions of themselves.

Not to discourage you from your experimentation, it's all trade offs and you might find a good one. Good luck!

[0] https://en.wikipedia.org/wiki/HEALPix

Nevermark · on March 23, 2024

> you cannot break a hexagon into smaller hexagons

Actually you can.

1. Think of hexagons as six equilateral triangles sharing a center point.

2. Place one smaller vertically flipped equilateral triangle, in each original triangle.

3. Each original hexagon center point is now the center of a smaller (1/2 linear dimension, 1/4 area) hexagon.

4. New small hexagons replace each of the six original hexagon's edges. Since edges are shared, this is an increase a 3x increase in number of hexagons.

So each new hexagon has 1/4 the area of the original ones (and 1/2 the linear dimensions). This results in 2x the linear dimension resolution, 4x the area resolution.

Grids could also be increased in scale the same way. By retaining a half-sized (in linear terms) square at the center of each original square, and turning each original edge and corner into new squares. With the same 1/2 and 1/4 ratios of linear and area scaling.

plagiarist · on March 22, 2024

Does that end up with specific points that have that same difficulty at intersections of equators? I suppose even if that is the case, a point is something that's easier to position over a dead area than a line would be.

durkie · on March 22, 2024

any idea why postgis hasn't formally incorporated S2 cell functions? It seems like it'd be right up their alley.

tromp · on March 22, 2024

> a vehicle with a Hilbert Curve position of 0.34 is really close to one with 0.35 and really far from one with 0.89

But points with a large difference in their single curve coordinate can be either far apart or close together. E.g. on this 16 point Hilbert curve

     __.  .__ 
     __|  |__ 
    |   __   |
    |__|  |__|

the '.' marked points at 1/16th and 15/16th along the curve are adjacent.

wolframhempel · on March 22, 2024

It's an important point. We've ended up using two Hilbert curves, rotated by 90 degrees to tackle this problem. If it is close on either, it is close in 2D space

tromp · on March 22, 2024

Interesting. So one might think that being close together in 2D space corresponds to being close on

BOTH cartesian coordinates / EITHER Hilbert coordinate

And that being far apart in 2D space corresponds to being far apart on

EITHER cartesian coordinate / BOTH Hilbert coordinates

But if we consider the two commas below which are close together in 2D space, we see they are far apart in any rotation of this Hilbert curve?!

        __ __    __ __
    |__|   __|  |__   |__|
     __   |__    __|   __
    |  |__ __|  ;__ __|  |
    |__    __,__ __    __|
     __|  |__    __|  |__
    |   __   |  |   __   |
    |__|  |__|  |__|  |__|

dwallin · on March 22, 2024

As mentioned in other replies, this doesn't fully solve a lot of the cases. I think z order curves might actually be better here, they are slightly less accurate individually, but in a way that would seemingly minimize the worst case results when taking the minimum of the two. But at that point you are probably better off using a single z-order curve and using the coordinates to check the manhattan or Chebyshev distance using bitwise operations.

jandrewrogers · on March 22, 2024

I would second the use of Z-order curves. Hilbert curves were famously known as more optimal on spinning disk in a specific context a few decades ago. Inertia has kept them top-of-mind even though modern systems do not have the problems Hilbert curves were better at solving and those curves are more complex to use than e.g. Z-order curves.

sp332 · on March 22, 2024

So you converted your 2D space into another 2D space?

joe_the_user · on March 22, 2024

The problem of finding the shortest path on a map (a la google maps etc) is often solved with approaches like "hub networks" and "contraction hierarchies" rather than simple spatial networks. Things like dijkstra's algorithm are only efficient if you're input is the raw graph and point and you only want one path once.

See just for a start: https://arxiv.org/abs/1504.05140

Lichtso · on March 22, 2024

Here is one of the recent advancements in performing similarity (including nearest neighbor) search on sparse high dimensional data using learned indices: https://arxiv.org/abs/2204.10028

The paper basically deals with the most adverse case but the approach should work in lower dimensions and in data with a non-uniform density distribution as well.

zX41ZdbW · on March 22, 2024

I've recently implemented a map of the Internet using a similar technique: https://reversedns.space/

In ClickHouse, we have almost every mentioned technique: H3 and S2, geohashes, and indexing by space-filling curves.

esafak · on March 22, 2024

For reference:

https://clickhouse.com/docs/en/sql-reference/functions/geo/h...

https://clickhouse.com/docs/en/sql-reference/functions/geo/s...

feverzsj · on March 22, 2024

The purpose of using space-filling curves is to convert coordinates into 1 dimensional data, thus they can be indexed by ordinary database index, no matter distrusted or not. The problem is it's much slower than a "real" spatial index like R-tree, and the supported queries are rather limited.

On the other hand, spatial data can be easily partitioned, rendering the aforementioned method less useful. Also, I've never seen anyone needed more than a single instance of postgres to index all the spatial data, unless you are dealing with spatial temporal data, which can grow infinitely large. Then again, spatial temporal data can be easily partitioned.

patches11 · on March 22, 2024

It would be interesting to know what other solutions exist in this space and why they weren’t chosen.

I know of at least one, GeoMesa, which seems like it could at least provide the building blocks to achieve what they are trying to do.

joe_the_user · on March 22, 2024

There's a long of study on the subject (that I've only surveyed). See my comment https://news.ycombinator.com/item?id=39792884

scentoni · on March 23, 2024

Also see "Evenly distributing points on a sphere" https://news.ycombinator.com/item?id=17842057

klysm · on March 22, 2024

I got to use space filling curves in my first ever professional programming experience to build image pyramids in a cache efficient way

xrd · on March 23, 2024

Makes me want to play more hex.

https://www.drking.org.uk/hexagons/hex/templates.html

otoburb · on March 22, 2024

I knew about hexagons but not about R-trees and Hilbert curves, and then combining the two. Even though that was an inbound content article, it was at least informative.