The poor performance argument is not true even for Python ecosystem that the aut...

kylebarron · 2024-03-26T13:36:27 1711460187

Sorry, this is not true _at all_ for geospatial data.

A quick benchmark [0] shows that saving to GeoPackage, FlatGeobuf, and GeoParquet are roughly 10x faster than saving to CSV. Additionally, the CSV is much larger than any other format.

[0]: https://gist.github.com/kylebarron/f632bbf95dbb81c571e4e64cd...

culebron21 · 2024-03-26T17:23:33 1711473813

And here's my quick benchmark, dataset from my full-time job:

  > import geopandas as gpd
  > import pandas as pd
  > from shapely.geometry import Point

  > d = pd.read_csv('data/tracks/2024_01_01.csv')
  > d.shape
  (3690166, 4)
  > list(d)
  ['user_id', 'timestamp', 'lat', 'lon']

  > %%timeit -n 1
  > d.to_csv('/tmp/test.csv')
  14.9 s ± 1.18 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

  > d2 = gpd.GeoDataFrame(d.drop(['lon', 'lat'], axis=1), geometry=gpd.GeoSeries([Point(*i) for i in d[['lon', 'lat']].values]), crs=4326)
  > d2.shape, list(d2)
  ((3690166, 3), ['user_id', 'timestamp', 'geometry'])

  > %%timeit -n 1
  > d2.to_file('/tmp/test.gpkg')
  4min 32s ± 7.5 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

  > %%timeit -n 1
  > d.to_csv('/tmp/test.csv.gz')
  37.4 s ± 291 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

  > ls -lah /tmp/test*
  -rw-rw-r-- 1 culebron culebron 228M мар 26 21:10 /tmp/test.csv
  -rw-rw-r-- 1 culebron culebron  63M мар 26 22:03 /tmp/test.csv.gz
  -rw-r--r-- 1 culebron culebron 423M мар 26 21:58 /tmp/test.gpkg

CSV saved in 15s, GPKG in 272s. 18x slowdown.

I guess your dataset is countries borders, isn't it? Something that 1) has few records and makes a small r-tree, and 2) contains linestrings/polygons that can be densified, similar to Google Polyline algorithm.

But a lot of geospatial data is just sets of points. For instance: housing per entire country (couple of million points). Address database (IIRC 20+M points). Or GPS logs of multiple users, received from logging database, ordered by time, not assembled in tracks -- several million per day.

For such datasets, use CSV, don't abuse indexed formats. (Unless you store it for a long time and actually use the index for spatial search, multiple times.)

kylebarron · 2024-03-27T02:26:15 1711506375

Your issue is that you're using the default (old) binding to GDAL, based on Fiona [0].

You need to use pyogrio [1], its vectorized counterpart, instead. Make sure you use `engine="pyogrio"` when calling `to_file` [2]. Fiona does a loop in Python, while pyogrio is exclusively compiled. So pyogrio is usually about 10-15x faster than fiona. Soon, in pyogrio version 0.8, it will be another ~2-4x faster than pyogrio is now [3].

[0]: https://github.com/Toblerity/Fiona

[1]: https://github.com/geopandas/pyogrio

[2]: https://geopandas.org/en/stable/docs/reference/api/geopandas...

[3]: https://github.com/geopandas/pyogrio/pull/346

culebron21 · 2024-03-27T07:04:38 1711523078

CSV is still faster than geo-formats with pyogrio. From what I saw, it writes most of the file quickly, then spends a lot of time, I think, building the index.

        > %%timeit -n 1
        > d.to_csv('/tmp/test.csv')
        10.8 s ± 1.05 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

        > %%timeit -n 1
        > d2.to_file('/tmp/test.gpkg', engine='pyogrio')
        1min 15s ± 5.96 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

        > %%timeit -n 1
        > d.to_csv('/tmp/test.csv.gz')
        35.3 s ± 1.37 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

        > %%timeit -n 1
        > d2.to_file('/tmp/test.fgb', driver='FlatGeobuf', engine='pyogrio')
        19.9 s ± 512 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

        > ls -lah /tmp/test*
        -rw-rw-r-- 1 culebron culebron 228M мар 27 11:02 /tmp/test.csv
        -rw-rw-r-- 1 culebron culebron  63M мар 27 11:27 /tmp/test.csv.gz
        -rw-rw-r-- 1 culebron culebron 545M мар 27 11:52 /tmp/test.fgb
        -rw-r--r-- 1 culebron culebron 423M мар 27 11:14 /tmp/test.gpkg

culebron21 · 2024-03-27T02:30:23 1711506623

Still CSV is 2x smaller than GPKG with this kind of data. And CSV.gz is 7x smaller.

kylebarron · 2024-03-27T02:36:52 1711507012

That's why I'm working on the GeoParquet spec [0]! It gives you both compression-by-default and super fast reads and writes! So it's usually as small as gzipped CSV, if not smaller, while being faster to read and write than GeoPackage.

Try using `GeoDataFrame.to_parquet` and `GeoPandas.read_parquet`

[0]: https://github.com/opengeospatial/geoparquet

culebron21 · 2024-03-27T09:49:18 1711532958

...but this has spared me today some irritation at work. Thanks!

smcin · 2024-03-26T13:26:55 1711459615

> for geospatial data... GeoPackage was "the Format of the Future" 8 years ago

What's the current consensus? Can you link to a summary article?

(Some still say GeoPackage is: https://mapscaping.com/shapefiles-vs-geopackage/ )

culebron21 · 2024-03-26T17:44:31 1711475071

I'd say compared to Shapefile, it is indeed better in every aspect (to begin with, shp has 8-character column names limit). For some kinds of data and operations GPKG is superior to other geo-formats. Like 1) store a lot of data, but retreive within an area (you can set an arbitrary polygon as a filter with GDAL driver, IIRC), 2) append/delete/modify and have the data indexed -- with CSV here you'll have to just reprocess and rewrite the entire file.

The problem is that in data science you want whole datasets to be atomic, to have reproducible results. So you don't care much of these sub-dataset operations.

Another sudden issue with GPKG and atomicity is that sqlite changes DB modification time every time you just read. So if you use Makefile, which checks for updates by modification time, you either have to let it re-run some updates, or manually touch other files downstream, or rely on separate files that you `touch` (unix tool that updates file's modification time).

I read a Russian OSM blogger Ilya Zverev evangelize for GPKG back in 2016 in his blog: https://shtosm.ru. I guess he was referring to GPKG vs ShapeFile too, not CSV. I think he's totally correct in this. But look above at my other comment with a benchmark: CSV turns out far easier on resources if you have lots of points.

Back in 2017 I've made a tool that could read and write CSV, Fiona-supported formats (GeoJson, GPKG, CSV, Postgres DB), and our proprietary MongoDB. (Here's the tool, without the Mongo feature https://github.com/culebron/erde/ ) And I tried all easily available formats, and every single one has some favorable cases, and sucks at some other (well, Shapefile is outdated, so it's out of competition). Among them, FGB is kinda like better GPKG if you don't need mutations.

smcin · 2024-03-28T00:02:19 1711584139

What's the one-line on that: there is consensus that Shapefile is on its way out but no consensus on its successor, not GeoPackage or anything else? Where can we simply see as of today, what % of geospatial files in use are Shapefile, GeoPackage et al??

(I tried to estimate from references to formats on https://gis.stackexchange.com/ but it just gave me a headache.)

culebron21 · 2024-03-31T09:13:47 1711876427

If you need mutability and indexing, then choose GeoPackage. If you can skip mutability, but still need indexing, probably FlatGeoBuffer. If you can skip both indexing and mutability -- then CSV or GeoJSON will suffice (especially if you need small data and human readable).