Для "чистого" входа / отсутствия проверки используйте set.intersection
. Создание set
более крупного текста (если вы можете сохранить его в памяти) значительно ускорит эту задачу.
Набор сокращает количество слов для проверки до уникальных проверок, а сама проверка - O (1) - это примерно так быстро, как вы можете получить:
from urllib.request import urlopen
# use from disc, else get once from url and save to disc to use it
try:
with open("faust.txt") as f:
data = f.read()
except:
# partial credit: https://stackoverflow.com/a/46124819/7505395
# get some freebe text - Goethes Faust should suffice
url = "https://archive.org/stream/fausttragedy00goetuoft/fausttragedy00goetuoft_djvu.txt"
data = urlopen(url).read()
with open("faust.txt", "wb") as f:
f.write(data)
Обработка данных для измерений:
words = data.split() # words: 202915
unique = set(words) # distinct words: 34809
none_true = {"NoWayThatsInIt_1", "NoWayThatsInIt_2", "NoWayThatsInIt_3", "NoWayThatsInIt_4"}
one_true = none_true | {"foul"}
# should use timeit for it, havent got it here
def sloppy_time_measure(f, text):
import time
print(text, end="")
t = time.time()
# execute function 1000 times
for _ in range(1000):
f()
print( (time.time() - t) * 1000, "milliseconds" )
# .intersection calculates _full_ intersection, not only an "in" check:
lw = len(words)
ls = len(unique)
sloppy_time_measure(lambda: none_true.intersection(words), f"Find none in list of {lw} words: ")
sloppy_time_measure(lambda: one_true.intersection(words), f"Find one in list of {lw} words: ")
sloppy_time_measure(lambda: any(w in words for w in none_true),
f"Find none using 'in' in list of {lw} words: ")
sloppy_time_measure(lambda: none_true.intersection(unique), f"Find none in set of {ls} uniques: ")
sloppy_time_measure(lambda: one_true.intersection(unique), f"Find one in set of {ls} uniques: ")
sloppy_time_measure(lambda: any(w in unique for w in one_true),
f"Find one using 'in' in set of {ls} uniques: ")
Выводы для 1000 приложений поиска (добавлен интервал для ясности ):
# in list
Find none in list of 202921 words: 5038.942813873291 milliseconds
Find one in list of 202921 words: 4234.968662261963 milliseconds
Find none using 'in' in list of 202921 words: 9726.848363876343 milliseconds
# in set
Find none in set of 34809 uniques: 15.897989273071289 milliseconds
Find one in set of 34809 uniques: 11.409759521484375 milliseconds
Find one using 'in' in set of 34809 uniques: 39.183855056762695 milliseconds