Failed
I attempted to create a web scraping bot to gather job information. Initially, I utilized the Selenium
library; however, I encountered a challenge since Selenium doesn't directly provide status codes like traditional web requests. To address this, I turned to the selenium-wire
library. Unfortunately, I encountered issues with anti-bot measures, which prevented me from accessing the target website. Subsequently, I tried using both selenium-stealth
and undetected chromebrowser
solutions, but unfortunately, these attempts were also unsuccessful.
Github
GitHub - haeeema/bypass_selenium: A bypass using selenium
Code
from bs4 import BeautifulSoup
from seleniumwire import webdriver
# *selenium-wire for status_code
from selenium.webdriver.chrome.options import Options
# from selenium_stealth import stealth
# **selenium_stealth
base_url = "<https://kr.indeed.com/jobs?q=>"
search_term = "python"
options = Options()
'''
options.add_argument('window-size=1920x1080')
options.add_argument("disable-gpu")
options.add_argument("lang=ko_KR")
options.add_argument(
'user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36')
options.add_argument('--headless')
options.add_argument("start-maximized")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
'''
browser = webdriver.Chrome(options=options)
'''
stealth(browser,
languages=["en-US", "en"],
vendor="Google Inc.",
platform="Win32",
webgl_vendor="Intel Inc.",
renderer="Intel Iris OpenGL Engine",
fix_hairline=True,
)
'''
# **selenium_stealth
browser.get(f"{base_url}{search_term}")
response = browser.last_request.response
# selenium-wire
if response.status_code != 200:
print("Can't request page")
else:
soup = BeautifulSoup(browser.page_source, "html.parser")
jobs_list = soup.find("ul", class_="jobsearch-ResultsList")
jobs = jobs_list.find_all("li", recursive=False)
# Beautifulsoup find all of "li"
# We only want to search for "li"s that are directly under ul tag.
results = []
for job in jobs:
zone = job.find("div", class_="mosaic-zone")
if zone == None:
# h2 = job.find("h2", class_="jobTitle")
# a = h2.find("a")
anchor = job.select_one("h2 a")
title = anchor["aria-label"]
link = anchor["href"]
company = job.find("span", class_="companyName")
location = job.find("div", class_="companyLocation")
job_data = {
"link": f"<https://kr.indeed.com>{link}",
"company": company.string,
"location": location.string,
"position": title,
}
results.append(job_data)
for result in results:
print(result)
*Install selenium-wire for status code
Selenium Wire - Extending Python Selenium | ScrapeOps
I installed Selenium-Wire to obtain the status code, but I'm unable to access the page due to using a bot.
Add chrome options -fail
options.add_argument('window-size=1920x1080')
options.add_argument("disable-gpu")
options.add_argument("lang=ko_KR")
options.add_argument(
'user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36')
options.add_argument('--headless')
options.add_argument("start-maximized")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
**Add selenium_stealth -fail
Stealth not working with selenium wire.
from selenium_stealth import stealth
stealth(browser,
languages=["en-US", "en"],
vendor="Google Inc.",
platform="Win32",
webgl_vendor="Intel Inc.",
renderer="Intel Iris OpenGL Engine",
fix_hairline=True,
)