Scalable Geospatial (Vector) Data Science

Nikolai Janakiev @njanakiev

(Geospatial) Big Data

OpenStreetMap

  • 6,140,639,049 nodes
  • 677,822,881 ways
  • 7,940,285 relations

www.openstreetmap.org/stats/data_stats.html

PostgreSQL

PostgreSQL and PostGIS

PostgreSQL Parallelization

  • Since PostgreSQL 9.6 (2016)
  • Enabled by default since PostgreSQL 10 (2017)

PostGIS Parallelization

  • With tweaks since PostgreSQL 11 and PostGIS 2.5
  • Out of the box since PostgreSQL 12 and PostGIS 3

Scaling PostgreSQL

Columnar Storage (Foreign Data Wrapper)

GPU Processing

Distributed PostgreSQL

Hadoop

Apache Hadoop

  • 2006 Apache Nutch - Hadoop
  • 2007 Used by Facebook, LinkedIn, Twitter, among others
  • 2008 Apache Foundation Project

Apache Hadoop

Hadoop Distributed File System (HDFS)

Hadoop YARN

Apache Spark

  • Created 2009, open sourced 2010
  • distributed computation engine, written in Scala, with bindings for Java, Python, and R

GeoMesa (geomesa.org)

Spatial Indexing

Integrations

Beyond the Elephants

Presto (prestodb.io) or Trino (trino.io)

  • Distributed SQL Query Engine for Big Data

Dask (dask.org)

  • Parallel and distributed computing library for analytics written in Python

GeoPandas (geopandas.org)

  • Extends the datatypes used by Pandas to allow spatial operations on geometric types
  • rtree - Spatial index for Python GIS

ElasticSearch (elastic.co)

Geo Point Indexing

Geo queries

  • geo_bounding_box query
  • geo_distance query
  • geo_polygon query
  • geo_shape query

MongoDB (mongodb.com)

  • Document-oriented NoSQL database which uses JSON-like documents with optional schemas

Geospatial Index

Geospatial Queries

  • $geoIntersects
  • $geoWithin
  • $near
  • $nearSphere
  • $geoNear

Redis (redis.io)

  • in-memory data structure store, used as a database, cache and message broker

Geospatial Indexing