Question

How to configure selenium webdriver to scrape data on server

Posted on February 8, 2024
Python Modules Selenium DigitalOcean Droplets Python
Asked by lenartgolob

I have a web scrapper written in python with the library selenium that works on my local machine. When I push the data to my droplet, I cannot run that app. This is one of my web scrapping methods:

def defense_dash_lt10(pbp_stats, season):
    # Less than 10 foot
    url = 'https://www.nba.com/stats/players/defense-dash-lt10?Season=' + season

    options = Options()
    options.add_argument('--no-sandbox')
    driver = webdriver.Chrome(service=ChromeService(
        ChromeDriverManager().install()), options=options)

    driver.get(url)

    selects = driver.find_elements(By.CLASS_NAME, "DropDown_select__4pIg9 ")
    for select in selects:
        options = Select(select).options

    for option in options:
        if option.text == 'All':
            option.click() # select() in earlier versions of webdriver
            break

    # Find the table element
    table = driver.find_element(By.CLASS_NAME, 'Crom_table__p1iZz')

    # Find all rows in the table
    rows = table.find_elements(By.TAG_NAME, 'tr')
    defense_dash_lt10 = []

    # Loop through each row and extract the data from each cell
    for row in rows:
        player_dd_lt10 = []
        # Find all cells in the row
        cells = row.find_elements(By.TAG_NAME, 'td')
        for cell in cells:
                player_dd_lt10.append(cell.text)
        # Add pbp stats to defense dash if not empty
        if player_dd_lt10:
            if player_dd_lt10[0] in pbp_stats:
                defense_dash_lt10.append(player_dd_lt10 + pbp_stats[player_dd_lt10[0]][-2:])
            else:
                defense_dash_lt10.append(player_dd_lt10 + ['NaN', 'NaN'])
    header = ['Player', 'Team', 'Age', 'Position', 'GP', 'Games', 'FREQ%', 'DFGM', 'DFGA', 'DFG%', 'FG%', 'DIFF%', "MP", "BLKR"]
    return header, defense_dash_lt10

Submit an answer

This textbox defaults to using Markdown to format your answer.

You can type !ref in this text area to quickly search our full set of tutorials, documentation & marketplace offerings and insert the link!

These answers are provided by our Community. If you find them useful, show some love by clicking the heart. If you run into issues leave a comment, or add your own answer to help others.

KFSys

Site Moderator

• February 12, 2024

Heya,

To run a Selenium-based web scraper on a server, such as a DigitalOcean droplet, you need to configure it to work in a headless environment. Servers typically don’t have a GUI, so you can’t run browsers in the regular, graphical mode. Here’s how to modify your existing Selenium setup to work on a server:

Install Necessary Packages on the Server:

Ensure Python is installed on the server.
Install the necessary drivers and browser. For Chrome, you’ll need ChromeDriver and the Chrome browser itself. You can install them using your server’s package manager. For example, in Ubuntu:

sudo apt-get update
sudo apt-get install -y unzip xvfb libxi6 libgconf-2-4
sudo apt-get install default-jdk 
sudo apt-get install -y google-chrome-stable

Download ChromeDriver matching your Chrome version:

wget https://chromedriver.storage.googleapis.com/2.41/chromedriver_linux64.zip
unzip chromedriver_linux64.zip
sudo mv chromedriver /usr/bin/chromedriver
sudo chown root:root /usr/bin/chromedriver
sudo chmod +x /usr/bin/chromedriver

Install Selenium for Python if not already installed:

pip install selenium

Modify Your Selenium Script to Use Headless Mode: Update your script to include headless options for the browser. Here’s an example modification for Chrome:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
options.add_argument('--disable-gpu')

driver = webdriver.Chrome(options=options)

This will initialize Chrome in headless mode, allowing it to run without a GUI.

Run Your Script on the Server:

Transfer your script to the server.
Run your script as you normally would on your local machine.

Additional Considerations:

Memory Usage: Selenium can be resource-intensive. Ensure your server has enough resources (RAM, CPU) to handle the load.
Execution Time: Web scraping can be slow, especially in headless mode. Consider this in your server’s resource planning.
Error Handling: Add robust error handling to your script to manage network issues, unexpected page structures, and other runtime errors.
Legal and Ethical Considerations: Ensure that your scraping activities comply with the terms of service of the website and relevant laws (like GDPR, if applicable).

Debugging: If your script doesn’t work as expected, add more verbose logging to understand where it fails. Sometimes issues can arise due to differences between the local and server environments (such as different versions of packages or browsers).

Remember, running a web scraper on a server is essentially the same as running it locally, with the key difference being the headless setup and ensuring all dependencies are correctly installed on the server.

Bobby Iliev

Site Moderator

• February 9, 2024

Hi there!

Running a Selenium-based web scraper on a DigitalOcean Droplet involves several considerations that differ from running the script on your local machine. You will have to set up a headless browser environment, managing web driver installations, and ensuring your script can run in a non-GUI server environment.

Here’s how you could do that:

1. Install Required Packages

Ensure your Droplet is up to date and has Python installed. You’ll also need to install Selenium and a web driver manager, such as webdriver-manager, which simplifies the management of binary drivers for different web browsers.

You can install these using pip. If you haven’t installed pip, you can install it using your package manager (e.g., apt for Ubuntu/Debian).

sudo apt update
sudo apt install python3-pip
pip3 install selenium webdriver-manager

2. Install a Web Browser and WebDriver

For headless operation, you can use Chrome or Firefox. This example uses Chrome, but the process is similar for Firefox.

Install Google Chrome:

wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb
sudo apt install ./google-chrome-stable_current_amd64.deb

Install ChromeDriver: The webdriver-manager package you installed earlier will handle the ChromeDriver installation in your Python script, so you don’t need to manually install ChromeDriver.

3. Modify Your Script for Headless Operation

To run your browser in headless mode (without a GUI), you need to modify your Selenium script to specify headless options.

from selenium import webdriver
from selenium.webdriver.chrome.service import Service as ChromeService
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.options import Options

options = Options()
options.add_argument('--headless')  # Runs Chrome in headless mode.
options.add_argument('--no-sandbox')  # Bypass OS security model
options.add_argument('--disable-dev-shm-usage')  # Overcome limited resource problems

driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()), options=options)

4. Running Your Script

Now, you should be able to run your script on the server just like you would on your local machine. Ensure you’re using the correct Python command (python3 or python) based on your server’s configuration.