Question

У меня есть два фрейма данных геопанды.Для каждой строки в левом кадре я хотел бы найти, какие строки в правом кадре перекрывают этот ряд пространственно и на сколько.Получив эту информацию, я смогу выполнить пространственное объединение на основе степени перекрытия.

К сожалению, я делаю это на большом количестве полигонов: все участки переписи США вштат (в Техасе их 5265) и большое количество полигонов, похожих по размеру (но не совпадающих с) блоками переписи США (в Техасе их ~ 914 231).

Я ищу способэто быстрее.Пока мой код приведен ниже.

Используемые наборы данных можно получить из переписи населения США: блокирует данные , обрабатывает данные .

#!/usr/bin/env python3

import geopandas as gpd
import geopandas_fast_sjoin as gpfsj
import time
import os
import pickle
import sys

os.environ["GDAL_DATA"] = "/usr/share/gdal"

TRACT_FILE    = "./data/tracts/tl_2010_{fips}_tract10.shp"
BLOCK_FILE    = "./data/blocks/tabblock2010_{fips}_pophu.shp"
PROJECTION    = '+proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=37.5 +lon_0=-96 +x_0=0 +y_0=0 +ellps=GRS80 +datum=NAD83 +units=m +no_defs'

print("Reading data...")
start_time = time.time()
tracts = gpd.read_file(TRACT_FILE.format(fips=48))
blocks = gpd.read_file(BLOCK_FILE.format(fips=48))
print('Time: ', time.time()-start_time )

print("Converting coordinate reference systems...")
start_time = time.time()
tracts = tracts.to_crs(PROJECTION)
blocks = blocks.to_crs(PROJECTION)
print('Time: ', time.time()-start_time )

print("Performing spatial join...")
start_time = time.time()
joined = gpd.sjoin(tracts, blocks, how='left')
print('Time: ', time.time()-start_time )

print("Calculating areas of intersection...")
start_time = time.time()
joined['area_of_intersect'] = [row['geometry'].intersection(blocks.loc[row['index_right']]['geometry']).area for i,row in joined.iterrows()]
print('Time: ', time.time()-start_time )

Richard · Answer 1 · 21 февраля 2019

Существует несколько оптимизаций, которые бы ускорили эту операцию: выполнение всей работы в C ++ без использования Python, использование пространственного индекса для быстрой идентификации кандидатов на пересечения, использование подготовленных геометрий для быстрой проверки кандидатов и распараллеливание всей операции.через ваши доступные ядра.

Все это может быть выполнено в Python с недостатком некоторых издержек, за исключением того, что в моих тестах использование многопроцессорного модуля в вашем многигабайтном наборе данных приводило Python к исчерпанию доступной памяти.В Windows это, вероятно, было бы неизбежно, в Linux копирование при записи должно было предотвратить это.Возможно, это можно сделать с помощью тщательного программирования, но весь смысл использования Python состоит в том, чтобы не беспокоиться о таких деталях.Поэтому я решил переместить вычисления в C ++.

. Для этого я использовал pybind11 , чтобы создать новый модуль Python, который принимает списки геометрий из геопанд и выдает три списка.: (1) рядные индексы левой геометрии;(2) рядные индексы правой геометрии;(3) Область перекрытия между двумя (только если> 0).

Например, для входа с геометриями left = [A, B, C, D] и right = [E, F, G, H], пусть:

E полностью в A
F перекрываются с A и B
G, а H ничего не перекрывается

Тогда возврат выглядит следующим образом:

List1 List2 List3
A     E     Area(E)
A     F     AreaIntersection(A,F)
B     F     AreaIntersection(B,F)

На моей машине операция sjoin заняла 73 с, а вычисление пересечений заняло 1066 с, всего 1139 с (19 минут)

На моем 12-ядерном компьютере приведенный ниже код занимает все 50 секунд.

Таким образом, для пространственного объединения, в котором необходимы области пересечения, это экономит лишь немного времени.Тем не менее, для пространственного соединения, где области пересечения необходимы , это экономит много времени.Иными словами, для расчета всех этих пересечений требуется много работы!

В дальнейшем тестировании для расчета площадей пересечения без использования подготовленных геометрий для ускорения потребовалось 287 с на 12 ядрах.Таким образом, распараллеливание пересечений приводит к 4-кратному ускорению, а распараллеливание с подготовленными геометриями - к 23-кратному ускорению.

Makefile

all:
    $(CXX) -O3 -g -shared -std=c++11 -I include `python3-config --cflags --ldflags --libs` quick_join.cpp -o geopandas_fast_sjoin.so -fPIC -Wall -lgeos -fopenmp

quick_join.cpp

#define GEOS_USE_ONLY_R_API 1

#include <geos/geom/Geometry.h>
#include <geos/geom/prep/PreparedGeometry.h>
#include <geos/geom/prep/PreparedGeometryFactory.h>
#include <geos/index/strtree/STRtree.h>

#include <pybind11/pybind11.h>
#include <pybind11/stl.h>

#include <memory>

#ifdef _OPENMP
#include <omp.h>
#else
#define omp_get_max_threads() 1
#define omp_get_thread_num() 0
#endif

///Fast Spatial Joins
///
///@params gp_left  List of GEOS geometry pointers from the left data frame.
///                 The code works best if gp_left is comprised of relatively
///                 fewer and relatively larger geometries.
///@params gp_right List of GEOS geometry pointers from the right data frame
///                 The code works best if gp_right is comprised of relatively
///                 more numerous and relatively smaller geometries.
///
///The list of GEOS geometry pointers can be acquired with
///    geos_pointers = [x._geom for x in df['geometry']]
///
///A common task in GeoPandas is taking two dataframes and combining their
///contents based on how much the contents' geometries overlap. However, this
///operation is slow in GeoPandas because most of it is performed in Python.
///Here, we offload the entire computation to C++ and use a number of techniques
///to achieve good performance.
///

///Namely, we create a spatial index from the left-hand geometries. For each
///geometry from the right-hand side, this allows us to very quickly find which
///geometries on the left-hand side it might overlap with. For each geometry on
///the left-hand side, we create a "prepared geometry", this accelerates simple
///spatial queries, such as checking for containment or disjointedness, by an
///order of magnitude. Finally, we parallelize the entire computation across all
///of the computer's threads.
///
///@return Three lists: (1) Row indices of left-hand geometries; (2) Row indices
///        of right-hand geometries; (3) Area of overlap between the two (only 
///        if >0).
///
///For example, for an input with geometries left=[A,B,C,D] and right=[E,F,G,H],
///let: 
/// * E be entirely in A
/// * F overlap with A and B
/// * G and H overlap nothing
///
///Then the return looks like:
///List1 List2 List3
///A     E     Area(E)
///A     F     AreaIntersection(A,F)
///B     F     AreaIntersection(B,F)
pybind11::tuple fast_sjoin(pybind11::list gp_left, pybind11::list gp_right) {
  typedef geos::geom::Geometry Geometry;
  typedef geos::geom::prep::PreparedGeometry PGeometry;

  //These are our return values
  std::vector<size_t> lefts;  //Indices of geometries from the left
  std::vector<size_t> rights; //Geometries of the overlapping geometries from the right
  std::vector<double> areas;  //Area of the overlap (if it is >0)

  //If either list is empty, there is nothing to do
  if(gp_left.size()==0 || gp_right.size()==0)
    return pybind11::make_tuple(lefts, rights, areas);

  //Used to cache pointers to geometries so we don't have to constantly be doing
  //conversions
  std::vector<const Geometry* > lgeoms;
  std::vector<const Geometry* > rgeoms;

  //Prepared geometries can massively accelerate computation by provide quick
  //predicate checks on whether one geometry contains another. Here, we stash
  //the prepared geometries. Unfortunately, GEOS 3.6.2's prepared geometries are
  //not reentrant. So we make a private set of prepared geometries for each
  //thread
  std::vector<std::vector<const PGeometry*>> lpgeoms(omp_get_max_threads());

  //Used for creating prepared geometries
  geos::geom::prep::PreparedGeometryFactory preparer;

  //For each geometry on the left, convert the input (a Python integer) into a
  //geometry pointer, then create a prepared geometry
  for(size_t i=0;i<gp_left.size();i++){
    const size_t    ptr_add = gp_left[i].cast<size_t>(); //Item is an integer
    const Geometry* geom    = reinterpret_cast<Geometry*>(ptr_add);
    for(int i=0;i<omp_get_max_threads();i++)
      lpgeoms.at(i).push_back(preparer.prepare(geom));
    lgeoms.push_back(geom);
  }

  //For each geometry on the right, convert the input (a Python integer) into a
  //geometry pointer
  for(size_t i=0;i<gp_right.size();i++){
    const size_t    ptr_add = gp_right[i].cast<size_t>(); //Item is an integer
    const Geometry* geom    = reinterpret_cast<Geometry*>(ptr_add);
    rgeoms.push_back(geom);
  }

  //The STRtree spatial index stores rectangles and allows us to quickly find
  //all the rectangles that overlap with a query rectangle. We leverage this
  //here by inserting the bounding boxes of all of the left geometries into a
  //spatial index. The spatial index also allows us to store a pointer to a data
  //structure; this pointer is returned if a query finds a hit or hits. We abuse
  //this capability by using the pointer to store the array index containing the
  //left geometry. This allows us to quickly find both the left geometry and its
  //associated prepared geometry.
    geos::index::strtree::STRtree index;
  for(size_t i=0;i<lgeoms.size();i++)
    index.insert(lgeoms[i]->getEnvelopeInternal(), reinterpret_cast<void*>(i));

  //Once all of the geometries are inserted into the spatial index, the index
  //must be built. This must be done in serial since GEOS does not have
  //protection against multiple threads trying it. (This is logical since it
  //eliminates a lock that would otherwise slow queries, but a better design
  //would probably have been to throw an exception.) The GEOS spatial tree also
  //lacks an explicit command for building the tree (wtf), so here we perform a
  //meaningless, single-threaded query to ensure the tree gets built.
  {
    std::vector<void *> results;
    index.query( lgeoms[0]->getEnvelopeInternal(), results );
  }

  //Each query to the spatial index populates a predefined vector with the
  //results of the query. We define this vector here, outside of the loop, to
  //avoid the memory cost of reallocating on each iteration of the loop.
  std::vector<void *> results;

  //These are custom OpenMP reduction operators for combining vectors together
  //following the parallel section. We leverage them to have each thread build
  //its own private result vectors which are afterwards combined into a single
  //result.
  #pragma omp declare reduction(merge : std::vector<uint64_t> : omp_out.insert(omp_out.end(), omp_in.begin(), omp_in.end()))
  #pragma omp declare reduction(merge : std::vector<double>   : omp_out.insert(omp_out.end(), omp_in.begin(), omp_in.end()))

  //Now we loop through all of the geometries on the right-hand side. We do this
  //in parallel because we assume there are many of them.
  #pragma omp parallel for default(none) shared(rgeoms,index,lgeoms,lpgeoms) private(results) reduction(merge:lefts) reduction(merge:rights) reduction(merge:areas)
  for(unsigned int r=0;r<rgeoms.size();r++){
    const Geometry *const rgeom = rgeoms.at(r);

    index.query( rgeom->getEnvelopeInternal(), results );
    for(const auto q: results){
      //results is a list of "pointers". But we abused the pointers by using
      //them to stash array indices. Let's unpack the "pointers" into indices
      //now.
      const size_t lnum = reinterpret_cast<size_t>(q);
      const Geometry  *const lgeom  = lgeoms.at(lnum);
      const PGeometry *const lpgeom = lpgeoms.at(omp_get_thread_num()).at(lnum);

      if(lpgeom->contains(rgeom)){
        //The right-hand geometry is entirely inside the left-hand geometry
        lefts.push_back(lnum);
        rights.push_back(r);
        areas.push_back(rgeom->getArea());
      } else if(lpgeom->disjoint(rgeom)){
        //The right-hand geometry is entirely outside the left-hand geometry
      } else {
        //The right-hand geometry is partially inside and partially outside the
        //left-hand geometry, so we get the area of intersection of the two.
        std::unique_ptr<Geometry> igeom(rgeom->intersection(lgeom)); 
        const auto    iarea = igeom->getArea();               
        lefts.push_back(lnum);
        rights.push_back(r);
        areas.push_back(iarea);
      }
    }

    //Clear the results vector so we're ready for the next iteration. Note that
    //clearing it does not release its memory, so after the first few iterations
    //we should no longer be performing allocations.
    results.clear();
  }

  return pybind11::make_tuple(lefts, rights, areas);
}



PYBIND11_MODULE(geopandas_fast_sjoin,m){
  m.doc() = "Fast spatial joins";

  m.def("fast_sjoin", &fast_sjoin, "Performs a fast spatial join");
}

test.py

#!/usr/bin/env python3

import geopandas as gpd
import geopandas_fast_sjoin as gpfsj
import time
import os
import pickle
import sys

os.environ["GDAL_DATA"] = "/usr/share/gdal"

DATA_DIR      = "./data/"
TRACT_FILE    = "./data/tracts/tl_2010_{fips}_tract10.shp"
BLOCK_FILE    = "./data/blocks/tabblock2010_{fips}_pophu.shp"
PRECINCT_FILE = "./data/precincts/precincts2008/USA_precincts.shp"
STATES_FILE   = "./data/states/tl_2010_us_state10.shp"
PROJECTION    = '+proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=37.5 +lon_0=-96 +x_0=0 +y_0=0 +ellps=GRS80 +datum=NAD83 +units=m +no_defs'

print("Reading data...")
start_time = time.time()
tracts = gpd.read_file(TRACT_FILE.format(fips=48))
blocks = gpd.read_file(BLOCK_FILE.format(fips=48))
print('Time: ', time.time()-start_time )

print("Converting coordinate reference systems...")
start_time = time.time()
tracts = tracts.to_crs(PROJECTION)
blocks = blocks.to_crs(PROJECTION)
print('Time: ', time.time()-start_time )

print("Performing spatial join...")
start_time = time.time()
joined = gpd.sjoin(tracts, blocks, how='left')
joined['area_of_intersect'] = [row['geometry'].intersection(blocks.loc[row['index_right']]['geometry']).area for i,row in joined.iterrows()]
print('Time: ', time.time()-start_time )

# pickle.dump( (blocks,tracts), open( "save.p", "wb" ) )
# sys.exit(-1)
# blocks, tracts = pickle.load( open( "save.p", "rb" ) )

print("Getting geometries...")
start_time = time.time()
tgeoms = [x._geom for x in tracts['geometry']]
bgeoms = [x._geom for x in blocks['geometry']]
print('Time: ', time.time()-start_time )

print("Running example")
start_time = time.time()
lefts, rights, areas = gpfsj.fast_sjoin(tgeoms,bgeoms)
print('Time: ', time.time()-start_time )

Как я могу ускорить геопространственное пространственное соединение?

Пожалуйста, войдите или зарегистрируйтесь чтобы ответить на этот вопрос.

1 Ответ

Makefile

quick_join.cpp

test.py

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Как я могу ускорить геопространственное пространственное соединение?

Пожалуйста, войдите или зарегистрируйтесь чтобы ответить на этот вопрос.

1 Ответ

Makefile

quick_join.cpp

test.py

Пожалуйста, войдите или зарегистрируйтесь что бы добавить комментарий.

Похожие темы