powershell_selenium_web_scraping_to_excel

Web Scraping Script with PowerShell, Selenium, and Excel Overview This PowerShell script uses Selenium WebDriver, HtmlAgilityPack, and ImportExcel to scrape a specified website, extract article content, and save the data into an Excel file. The script is currently designed to work for one specific site, but can be easily adapted to work with other URLs by modifying the relevant lines in the script.

Prerequisites Before running the script, ensure you have the following installed:

Selenium PowerShell Module: Used to automate the web browser. Install-Module -Name Selenium

ChromeDriver: The WebDriver for Chrome. Download ChromeDriver and place it in the specified directory.

HtmlAgilityPack: For parsing HTML content. Ensure you have HtmlAgilityPack.dll in your PowerShell Modules directory.

ImportExcel: To write the scraped data to an Excel file. Install-Module -Name ImportExcel

Set the path to your ChromeDriver on line 6:

$webDriverPath = 'C:\project1\chromedriver-win64'

Set the path to your Excel file on line 7:

$excelFilePath = 'C:\project1\data.xlsx'

Set the path to the HtmlAgilityPack.dll file on line 45:

Add-Type -Path 'C:\Program Files\WindowsPowerShell\Modules\HtmlAgilityPack\lib\Net45\HtmlAgilityPack.dll'

On line 123, replace 'https://www.google.com' with the URL of the site you want to scrape.

On line 128, update the XPath query to match the structure of the site you’re scraping: $articleLinks = $driver.FindElementsByXPath("//a[contains(@href, '/somethinghere/')]")

The script will: Navigate to the specified website. Accept cookie consent if required. Scrape article URLs based on the XPath. Extract the text content from the articles. Save the extracted text and URL into an Excel file.

Notes: The script uses a retry mechanism when saving data to Excel in case the file is locked or in use. The script processes each article URL only once and skips previously scraped URLs, ensuring no duplication in your Excel file. You can modify the site-specific elements (URL, XPath) to adapt the script to other websites

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

powershell_selenium_web_scraping_to_excel

About

Releases

Packages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
README.md		README.md
data.xlsx		data.xlsx
powershell_selenium_web_scraping_to_excel		powershell_selenium_web_scraping_to_excel

OrPhe0/powershell_selenium_web_scraping_to_excel

Folders and files

Latest commit

History

Repository files navigation

powershell_selenium_web_scraping_to_excel

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages