Python s int
может обрабатывать не-ASCII Unicode цифры, поэтому это работает:
import requests
from bs4 import BeautifulSoup
import pandas as pd
base_url = "http://www.etnet.com.hk/www/tc/news/categorized_news_list.php?page=1&category=result"
result = requests.get(base_url)
src = result.content
soup = BeautifulSoup(src, 'html.parser')
df = pd.DataFrame()
news = []
for a_tag in soup.find_all('p'):
news.append(a_tag.text)
df = df.append(pd.DataFrame(news, columns=['News']))
def to_int(s):
try:
return int(s)
except ValueError:
return 0
df['num'] = df['News'].str.extract('\((\d+)\)')
df["stock_num"] = df["num"].apply(to_int).astype("int64")
print(df)
News num stock_num
0 21/01/2020 09:31 NaN 0
1 【企業盈警】中彩網通(8071)料去年錄虧損,或中止虧損業務 8071 8071
2 《經濟通通訊社21日專訊》中彩網通(08071)預期去年第四季度的收入將顯著下降,而截至... 08071 8071
3 21/01/2020 09:28 NaN 0
4 《業績變臉》再多15家A股公司計提減值準備,減值近3百億人幣 NaN 0
.. ... ... ...
894 強積金資訊 NaN 0
895 MPF小字典 NaN 0
896 退休金計算機 NaN 0
897 NaN 0
898 我的MPF NaN 0
[899 rows x 3 columns]