수집된 데이터 및 데이터 전처리 문서

0. 목적

Selenium 웹 크롤링을 통해 얻은 html 코드를 담은 txt 파일을 전처리를 통해 embedding 벡터를 추가하고 sqlite db에 적재하기 용이한 csv 파일 형태로 가공

1. 원천 데이터 분석

1 - 1. 레스토랑 정보 파싱

# 각 필드는 빈 이름일 수 있기 때문에 분기가 필요함
page_head = page_soup.select_one(".profile-top-wrapper")

# 식당 이름
restaurant_name = textify(page_head.select_one("h1.tit"))
print(restaurant_name)

# 음식 이미지
imglink = page_head.select_one(".btn-main-photo-viewer").select_one("img").get("src")
print(imglink)

# 지역명과 태그 적힌 필드
desciption_field = page_head.select_one("div.btxt")

# 지역명 (실제로는 다양하지만 이번 예제에서는 신대방삼거리역 뿐)
region = textify(desciption_field.select_one("a.area"))
print(region)

# 식당종류
category = desciption_field.select("a.btxt")
category = [] if category is None else [textify(t) for t in category]
print(category)

# 주소
address = page_head.select("li.locat > a, li.locat > span")
address = [] if address is None else address
address = " ".join([textify(t) for t in address])
print(address)

# 영업시간
work_hour = page_head.select_one("#today-main-hours").text.strip()
work_hour = "" if work_hour is None else work_hour
print(work_hour)
hours = re.findall(r"\\d{2}:\\d{2}", work_hour)
print(hours)

# 전화번호
tel = textify(page_head.select_one("li.tel"))
print(tel)

# 태그
tags = page_head.select_one("li.tag")
tags = [] if tags is None else tags.select("a")
tags = [textify(t) for t in tags]
print(tags)

# 특징
character = page_head.select_one("li.char")
character = [] if character is None else character.select("a")
character = [textify(t) for t in character]
print(character)

사용자에게 제공할 가치가 있는 정보를 위주로 선택

식당 이름
식당 이미지 링크
지역명
식당 종류
주소
영업시간
전화번호
태그
특징

1 - 2. 메뉴 정보 파싱

# 메뉴 선택
# 마찬가지로 없을 수도 있으니 분기가 필요
menu_list = page_soup.select_one("#div_detail")
menu_list = None if menu_list is None else menu_list.select_one(".list.Restaurant-MenuList")
li_list = [] if menu_list is None else menu_list.select("li")
print(len(li_list))

for li in li_list:
    # 메뉴명
    menu_name = textify(li.select_one("span.restaurant-menu"))

    # 가격
    menu_price = textify(li.select_one("p.restaurant-price"))

    # 설명 (없을 확률 높음)
    menu_desc = textify(li.select_one("p.menu-description"))

    print(f"메뉴명: {menu_name}. 가격: {menu_price}")
    print(f"{menu_desc}")