Решение
Вы можете сделать это, используя numpy трансляцию. Массив R2
имеет форму (A_rows, B_rows)
, т.е. (10, 4). Для очень больших массивов A
и B
вам, возможно, придется применить пакетную обработку. Для этой цели используйте пользовательскую функцию get_distmins()
, как показано ниже.
# use numpy broadcasting
# R2 => squared distance matrix
R2 = ((A[:,0].reshape(-1,1) - B[:,0].reshape(1,-1))**2 +
(A[:,1].reshape(-1,1) - B[:,1].reshape(1,-1))**2)
# min distance from all points in B, for each row in A
np.sqrt(R2.min(axis=1))
Выход :
array([0.18452659, 0.33305075, 0.25056875, 0.37935569, 0.164265 ,
0.4730478 , 0.23230641, 0.21364414, 0.1050965 , 0.15871064])
Dummy Data
import numpy as np
np.random.seed(seed=42) # for reproducibility
A = np.random.rand(20).reshape(-1, 2)
B = np.random.rand(8).reshape(-1, 2)
Как обрабатывать большие массивы: A
и B
?
Мы используем batch_size
для применения пакетной обработки к массивам. Пользовательская функция get_distmins()
может сделать это следующим образом:
# import numpy as np
# np.random.seed(seed=42)
# A = np.random.rand(20000).reshape(-1, 2)
# B = np.random.rand(8000).reshape(-1, 2)
%%time
distmins, argmins = get_distmins(A, B, batch_size=2000, dtype=np.float32)
Пользовательская функция для больших массивов A
и B
import numpy as np
# TQDM is used for showing a progress-bar.
from tqdm.notebook import tqdm # --> use only with jupyter-notebook
#from tqdm import tqdm # --> use for both notebook/non-notebook
def get_distmins(A, B, batch_size=None, dtype=np.float64):
"""
Parameters
----------
A (array): a numpy array of shape (numrows_A, 2)
B (array): a numpy array of shape (numrows_B, 2)
batch_size (int or None): Default is None. If None, it auto-selects
batch_size = A.shape[0] (numrows_A)
dtype: Default is np.float64.
Choosing this could be beneficial in
managing memory for large A and B (if applicable).
How to choose a good batch_size?
If the arrays A and/or B are too large,
then the calculations may not fit in memory,
and you will have to use a suitable batch_size
to process all the data.
+ start with a smaller size of A and B
+ and set batch_size = A.shape[0]
+ check when it can no longer handle
the data in the memory
+ now choose batch_size such that,
(A.shape[0] * B.shape[0])_max_allowed
== (B.shape[0])_true * batch_size
Example
-------
%%time
distmins, argmins = get_distmins(A, B, batch_size=2000, dtype=np.float32)
"""
A.astype(dtype, copy=False)
B.astype(dtype, copy=False)
if batch_size is None:
batch_size = A.shape[0]
argmins = []
distmins = []
# use numpy broadcasting
for start in tqdm(range(0, A.shape[0], batch_size), desc='loop'):
end = start + batch_size
#print((start,end))
R2 = ((A[start:end,0].reshape(-1,1) - B[:,0].reshape(1,-1))**2 +
(A[start:end,1].reshape(-1,1) - B[:,1].reshape(1,-1))**2)
argmin = R2.argmin(axis=1)
distmin = np.sqrt(R2[:, argmin])
argmins.append(argmin.copy())
distmins.append((distmin.astype(dtype)).copy())
#sleep(0.5)
del R2, argmin, distmin
# min distance from all points in B, for each row in A
print('\n>>> done!...')
return (distmins, argmins)