Pyspark Iterative Query - PullRequest
       5

Pyspark Iterative Query

0 голосов
/ 17 апреля 2020

Я должен выбрать строки так, чтобы разница дат между выбранной строкой была> = 7. Я должен решить это в pyspark.

PTID|DATE
P1|1st Apr,2020
P1|10th Apr,2020
P1|12th Apr, 2020
P1|16th Apr, 2020
P1|17th Apr, 2020
P1|29th Apr,2020
P2|1st Apr,2020
P2|15th Apr, 2020



Required Output:-
PTID|DATE
P1|1st Apr,2020
P1|10th Apr,2020
P1|17th Apr, 2020
P1|29th Apr,2020
P2|1st Apr,2020
P2|15th Apr, 2020



Explanation of Output:-
explanation:
1. All these rows signifies visit of a patient in hospital
2. I have to select rows in a way that no two rows should have date difference less than 7
3. 1st row will come for sure
4. 2nd row will also come as it is >= 7 days from 1st row
5. 3rd row will not come as it is only 2 days from the 2nd row
5. 4th row will not come as it is only 6 days from second row(here comparison is with 2nd as 3rd was not selected)
6. 5th row will come as it is 7 days from 2nd row ((here comparison is with 2nd as 3rd&4th were not selected)
...