{"id":6514,"date":"2023-10-18T14:47:43","date_gmt":"2023-10-18T14:47:43","guid":{"rendered":"https:\/\/royadata.io\/blog\/?p=6514"},"modified":"2023-10-18T14:47:43","modified_gmt":"2023-10-18T14:47:43","slug":"how-to-build-a-web-crawler-using-selenium-proxies","status":"publish","type":"post","link":"http:\/\/royadata.io\/blog\/how-to-build-a-web-crawler-using-selenium-proxies\/","title":{"rendered":"Building a Web Crawler Using Selenium and Proxies"},"content":{"rendered":"<p><picture class=\"aligncenter size-full wp-image-378 perfmatters-lazy\" loading=\"lazy\"><source type=\"image\/webp\" data-srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Web-Crawler-Using-Selenium-and-Proxies.jpg.webp 1064w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Web-Crawler-Using-Selenium-and-Proxies-300x150.jpg.webp 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Web-Crawler-Using-Selenium-and-Proxies-768x383.jpg.webp 768w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Web-Crawler-Using-Selenium-and-Proxies-1024x511.jpg.webp 1024w\" srcset=\"data:image\/svg+xml,%3Csvg%20xmlns='http:\/\/www.w3.org\/2000\/svg'%20viewBox='0%200%201064%20531'%3E%3C\/svg%3E\" data-sizes=\"(max-width: 1064px) 100vw, 1064px\" \/><img decoding=\"async\" src=\"data:image\/svg+xml,%3Csvg%20xmlns='http:\/\/www.w3.org\/2000\/svg'%20viewBox='0%200%201064%20531'%3E%3C\/svg%3E\" alt=\"Web Crawler Using Selenium and Proxies\" width=\"1064\" height=\"531\" data-src=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Web-Crawler-Using-Selenium-and-Proxies.jpg\" data-srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Web-Crawler-Using-Selenium-and-Proxies.jpg 1064w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Web-Crawler-Using-Selenium-and-Proxies-300x150.jpg 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Web-Crawler-Using-Selenium-and-Proxies-768x383.jpg 768w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Web-Crawler-Using-Selenium-and-Proxies-1024x511.jpg 1024w\" data-sizes=\"(max-width: 1064px) 100vw, 1064px\" loading=\"lazy\" \/>\n<\/picture>\n<noscript><picture class=\"aligncenter size-full wp-image-378\"><source type=\"image\/webp\" srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Web-Crawler-Using-Selenium-and-Proxies.jpg.webp 1064w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Web-Crawler-Using-Selenium-and-Proxies-300x150.jpg.webp 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Web-Crawler-Using-Selenium-and-Proxies-768x383.jpg.webp 768w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Web-Crawler-Using-Selenium-and-Proxies-1024x511.jpg.webp 1024w\" sizes=\"(max-width: 1064px) 100vw, 1064px\"\/><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Web-Crawler-Using-Selenium-and-Proxies.jpg\" alt=\"Web Crawler Using Selenium and Proxies\" width=\"1064\" height=\"531\" srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Web-Crawler-Using-Selenium-and-Proxies.jpg 1064w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Web-Crawler-Using-Selenium-and-Proxies-300x150.jpg 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Web-Crawler-Using-Selenium-and-Proxies-768x383.jpg 768w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Web-Crawler-Using-Selenium-and-Proxies-1024x511.jpg 1024w\" sizes=\"(max-width: 1064px) 100vw, 1064px\"\/>\n<\/picture>\n<\/noscript><\/p>\n<p>Once upon a time, people looking for information had to physically walk into a brick-and-mortar library, find the right books, and read through them intently.<\/p>\n<p>Today, it seems a given that whatever data you\u2019re looking for exists on the Internet. There are over a billion websites on the World Wide Web at any given moment, containing enough information to take up out 305 billion printed sheets of paper.<\/p>\n<p>The good news is that no matter what kind of data you\u2019re looking for, you can be sure to find it online. The bad news is that there is so much data online that personally sifting through it borders on the physically impossible.<\/p>\n<p><picture class=\"aligncenter size-full wp-image-379 perfmatters-lazy\" loading=\"lazy\"><source type=\"image\/webp\" data-srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/World-Wide-Web.jpg.webp 800w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/World-Wide-Web-300x185.jpg.webp 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/World-Wide-Web-768x474.jpg.webp 768w\" srcset=\"data:image\/svg+xml,%3Csvg%20xmlns='http:\/\/www.w3.org\/2000\/svg'%20viewBox='0%200%20800%20494'%3E%3C\/svg%3E\" data-sizes=\"(max-width: 800px) 100vw, 800px\" \/><img decoding=\"async\" src=\"data:image\/svg+xml,%3Csvg%20xmlns='http:\/\/www.w3.org\/2000\/svg'%20viewBox='0%200%20800%20494'%3E%3C\/svg%3E\" alt=\"World Wide Web\" width=\"800\" height=\"494\" data-src=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/World-Wide-Web.jpg\" data-srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/World-Wide-Web.jpg 800w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/World-Wide-Web-300x185.jpg 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/World-Wide-Web-768x474.jpg 768w\" data-sizes=\"(max-width: 800px) 100vw, 800px\" loading=\"lazy\" \/>\n<\/picture>\n<noscript><picture class=\"aligncenter size-full wp-image-379\"><source type=\"image\/webp\" srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/World-Wide-Web.jpg.webp 800w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/World-Wide-Web-300x185.jpg.webp 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/World-Wide-Web-768x474.jpg.webp 768w\" sizes=\"(max-width: 800px) 100vw, 800px\"\/><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/World-Wide-Web.jpg\" alt=\"World Wide Web\" width=\"800\" height=\"494\" srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/World-Wide-Web.jpg 800w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/World-Wide-Web-300x185.jpg 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/World-Wide-Web-768x474.jpg 768w\" sizes=\"(max-width: 800px) 100vw, 800px\"\/>\n<\/picture>\n<\/noscript><\/p>\n<p>Add in the fact that most websites have different scopes, formats, and frameworks. About <strong>30% of websites use <a href=\"https:\/\/venturebeat.com\/2018\/03\/05\/wordpress-now-powers-30-of-websites\/\"  rel=\"noopener noreferrer\">WordPress<\/a><\/strong>, for instance, and the rest use a variety of other platforms like Joomla, Drupal, Magento, etc.<\/p>\n<p>Enter web crawling. Web crawlers are automated data-gathering tools that interact with websites on their owners\u2019 behalf. This lets you access reams of data ready for output to a local database or spreadsheet for further analysis.<\/p>\n<p>Although it may sound complicated, the truth is that building a web crawler using <a href=\"https:\/\/selenium.dev\/\"  rel=\"noopener noreferrer\">Selenium<\/a> is a pretty straightforward process. Let\u2019s dive in and find out exactly what you need to get started.<\/p>\n<p><picture class=\"aligncenter size-full wp-image-376 perfmatters-lazy\" loading=\"lazy\"><source type=\"image\/webp\" data-srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Two-Ways-to-Crawl-Web-Data.jpg.webp 1063w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Two-Ways-to-Crawl-Web-Data-300x150.jpg.webp 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Two-Ways-to-Crawl-Web-Data-768x384.jpg.webp 768w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Two-Ways-to-Crawl-Web-Data-1024x512.jpg.webp 1024w\" srcset=\"data:image\/svg+xml,%3Csvg%20xmlns='http:\/\/www.w3.org\/2000\/svg'%20viewBox='0%200%201063%20531'%3E%3C\/svg%3E\" data-sizes=\"(max-width: 1063px) 100vw, 1063px\" \/><img decoding=\"async\" src=\"data:image\/svg+xml,%3Csvg%20xmlns='http:\/\/www.w3.org\/2000\/svg'%20viewBox='0%200%201063%20531'%3E%3C\/svg%3E\" alt=\"Two Ways to Crawl Web Data\" width=\"1063\" height=\"531\" data-src=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Two-Ways-to-Crawl-Web-Data.jpg\" data-srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Two-Ways-to-Crawl-Web-Data.jpg 1063w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Two-Ways-to-Crawl-Web-Data-300x150.jpg 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Two-Ways-to-Crawl-Web-Data-768x384.jpg 768w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Two-Ways-to-Crawl-Web-Data-1024x512.jpg 1024w\" data-sizes=\"(max-width: 1063px) 100vw, 1063px\" loading=\"lazy\" \/>\n<\/picture>\n<noscript><picture class=\"aligncenter size-full wp-image-376\"><source type=\"image\/webp\" srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Two-Ways-to-Crawl-Web-Data.jpg.webp 1063w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Two-Ways-to-Crawl-Web-Data-300x150.jpg.webp 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Two-Ways-to-Crawl-Web-Data-768x384.jpg.webp 768w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Two-Ways-to-Crawl-Web-Data-1024x512.jpg.webp 1024w\" sizes=\"(max-width: 1063px) 100vw, 1063px\"\/><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Two-Ways-to-Crawl-Web-Data.jpg\" alt=\"Two Ways to Crawl Web Data\" width=\"1063\" height=\"531\" srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Two-Ways-to-Crawl-Web-Data.jpg 1063w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Two-Ways-to-Crawl-Web-Data-300x150.jpg 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Two-Ways-to-Crawl-Web-Data-768x384.jpg 768w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Two-Ways-to-Crawl-Web-Data-1024x512.jpg 1024w\" sizes=\"(max-width: 1063px) 100vw, 1063px\"\/>\n<\/picture>\n<\/noscript><\/p>\n<hr\/>\n<h2 id=\"there-are-two-ways-to-crawl-web-data\" class=\"ftwp-heading\" style=\"text-align: center;\"><span class=\"ez-toc-section\" id=\"There_are_Two_Ways_to_Crawl_Web_Data\"><\/span>There are Two Ways to Crawl Web Data<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>One of the first obstacles you\u2019ll encounter when learning how to build a web crawler using Selenium is the fact that websites don\u2019t seem to like it. Web crawlers generate a lot of tra\ufb03c, and website administrators tend to feel like web crawlers abuse the server resources they make available to the public.<\/p>\n<p>But major Internet companies like Google crawl data all the time. The only di\ufb00erence is that they ask permission and offer something in return (in Google\u2019s case, placement on the world\u2019s number-one search engine). What do you do if you need access to data and don\u2019t have the convenient backing of a powerful economic incentive on your side?<\/p>\n<p>You can use Selenium to collect data from websites through a browser \u2013 just like a regular user would. But since web administrators don\u2019t like it, you\u2019ll need the <strong>proxy to hide your identity<\/strong> behind so that they can\u2019t trace the activity back to you.<\/p>\n<p>Depending on your jurisdiction and the jurisdiction of the website you want to access, using a proxy could be a life-saver. In 2011, <a href=\"https:\/\/www.theglobeandmail.com\/report-on-business\/industry-news\/the-law-page\/why-reading-a-websites-fine-print-matters\/article595795\/\"><strong>a court in British Columbia<\/strong><\/a> punished a company for scraping content from a real estate website, but more recent cases allow crawling of publicly-accessible content.<\/p>\n<p>Journalists, data analysts, and programmers generally don\u2019t have the resources Google brings to the table when it asks for web crawler access.<\/p>\n<hr\/>\n<h2 id=\"selenium-how-it-works-and-why-you-should-use-it\" class=\"ftwp-heading\"><span class=\"ez-toc-section\" id=\"Selenium_%E2%80%93_How_it_Works_and_Why_You_Should_Use_It\"><\/span>Selenium \u2013 How it Works and Why You Should Use It<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>There are lots of tools and platforms you can use to scrape web data, but most have limitations. For instance, if you use the <a href=\"https:\/\/www.makeuseof.com\/tag\/build-basic-web-crawler-pull-information-website-2\/\"><strong>Python module Scrapy<\/strong><\/a>, you can only access websites that don\u2019t feature JavaScript-heavy user interfaces.<\/p>\n<p>Selenium is a simple tool for automating browsers. With Selenium, you can automate a web browser like Google Chrome or Safari so that any website is crawl-compatible.<\/p>\n<p>The first step is downloading and setting up Selenium. You will need to download a version of Selenium specifically tailored to your browser. For Google Chrome, for instance, this is called <strong><a href=\"https:\/\/sites.google.com\/a\/chromium.org\/chromedriver\/downloads\"  rel=\"noopener noreferrer\">ChromeDriver<\/a><\/strong>.<\/p>\n<p>When you extract the file (ChromeDriver.exe, for instance), make sure to remember where you put it, because you\u2019ll need it later.<\/p>\n<p>In order to use Selenium to build a web crawler, you\u2019ll need some extra Java modules. This requires a little bit of coding, but it\u2019s not that complicated. First, install <strong><a href=\"https:\/\/spring.io\/guides\/gs\/maven\/\"  rel=\"noopener noreferrer\">Maven<\/a><\/strong>, which is what you\u2019re going to use to build the Java program.<\/p>\n<p><picture class=\"aligncenter size-full wp-image-377 perfmatters-lazy\" loading=\"lazy\"><source type=\"image\/webp\" data-srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Use-to-build-the-Java-program.jpg.webp 1063w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Use-to-build-the-Java-program-300x170.jpg.webp 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Use-to-build-the-Java-program-768x435.jpg.webp 768w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Use-to-build-the-Java-program-1024x580.jpg.webp 1024w\" srcset=\"data:image\/svg+xml,%3Csvg%20xmlns='http:\/\/www.w3.org\/2000\/svg'%20viewBox='0%200%201063%20602'%3E%3C\/svg%3E\" data-sizes=\"(max-width: 1063px) 100vw, 1063px\" \/><img decoding=\"async\" src=\"data:image\/svg+xml,%3Csvg%20xmlns='http:\/\/www.w3.org\/2000\/svg'%20viewBox='0%200%201063%20602'%3E%3C\/svg%3E\" alt=\"Use to build the Java program\" width=\"1063\" height=\"602\" data-src=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Use-to-build-the-Java-program.jpg\" data-srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Use-to-build-the-Java-program.jpg 1063w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Use-to-build-the-Java-program-300x170.jpg 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Use-to-build-the-Java-program-768x435.jpg 768w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Use-to-build-the-Java-program-1024x580.jpg 1024w\" data-sizes=\"(max-width: 1063px) 100vw, 1063px\" loading=\"lazy\" \/>\n<\/picture>\n<noscript><picture class=\"aligncenter size-full wp-image-377\"><source type=\"image\/webp\" srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Use-to-build-the-Java-program.jpg.webp 1063w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Use-to-build-the-Java-program-300x170.jpg.webp 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Use-to-build-the-Java-program-768x435.jpg.webp 768w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Use-to-build-the-Java-program-1024x580.jpg.webp 1024w\" sizes=\"(max-width: 1063px) 100vw, 1063px\"\/><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Use-to-build-the-Java-program.jpg\" alt=\"Use to build the Java program\" width=\"1063\" height=\"602\" srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Use-to-build-the-Java-program.jpg 1063w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Use-to-build-the-Java-program-300x170.jpg 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Use-to-build-the-Java-program-768x435.jpg 768w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Use-to-build-the-Java-program-1024x580.jpg 1024w\" sizes=\"(max-width: 1063px) 100vw, 1063px\"\/>\n<\/picture>\n<\/noscript><\/p>\n<p>Once Maven is ready, you must add this dependency to POM.xml:<\/p>\n<p>Now just run the build process and you\u2019re ready to take your first steps with Selenium.<\/p>\n<hr\/>\n<h2 id=\"basic-introduction-to-using-selenium\" class=\"ftwp-heading\" style=\"text-align: center;\"><span class=\"ez-toc-section\" id=\"Basic_Introduction_to_Using_Selenium\"><\/span><strong>Basic Introduction to Using Selenium<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p><picture class=\"aligncenter size-full wp-image-373 perfmatters-lazy\" loading=\"lazy\"><source type=\"image\/webp\" data-srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Basic-Introduction-to-Using-Selenium.jpg.webp 1058w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Basic-Introduction-to-Using-Selenium-768x441.jpg.webp 768w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Basic-Introduction-to-Using-Selenium-1024x587.jpg.webp 1024w\" srcset=\"data:image\/svg+xml,%3Csvg%20xmlns='http:\/\/www.w3.org\/2000\/svg'%20viewBox='0%200%201058%20607'%3E%3C\/svg%3E\" data-sizes=\"(max-width: 1058px) 100vw, 1058px\" \/><img decoding=\"async\" src=\"data:image\/svg+xml,%3Csvg%20xmlns='http:\/\/www.w3.org\/2000\/svg'%20viewBox='0%200%201058%20607'%3E%3C\/svg%3E\" alt=\"Basic Introduction to Using Selenium\" width=\"1058\" height=\"607\" data-src=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Basic-Introduction-to-Using-Selenium.jpg\" data-srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Basic-Introduction-to-Using-Selenium.jpg 1058w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Basic-Introduction-to-Using-Selenium-300x172.jpg 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Basic-Introduction-to-Using-Selenium-768x441.jpg 768w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Basic-Introduction-to-Using-Selenium-1024x587.jpg 1024w\" data-sizes=\"(max-width: 1058px) 100vw, 1058px\" loading=\"lazy\" \/>\n<\/picture>\n<noscript><picture class=\"aligncenter size-full wp-image-373\"><source type=\"image\/webp\" srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Basic-Introduction-to-Using-Selenium.jpg.webp 1058w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Basic-Introduction-to-Using-Selenium-768x441.jpg.webp 768w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Basic-Introduction-to-Using-Selenium-1024x587.jpg.webp 1024w\" sizes=\"(max-width: 1058px) 100vw, 1058px\"\/><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Basic-Introduction-to-Using-Selenium.jpg\" alt=\"Basic Introduction to Using Selenium\" width=\"1058\" height=\"607\" srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Basic-Introduction-to-Using-Selenium.jpg 1058w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Basic-Introduction-to-Using-Selenium-300x172.jpg 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Basic-Introduction-to-Using-Selenium-768x441.jpg 768w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Basic-Introduction-to-Using-Selenium-1024x587.jpg 1024w\" sizes=\"(max-width: 1058px) 100vw, 1058px\"\/>\n<\/picture>\n<\/noscript><\/p>\n<p>Let\u2019s start with something simple. First, create an instance of ChromeDriver:<\/p>\n<blockquote>\n<p>WebDriver driver = new ChromeDriver();<\/p>\n<\/blockquote>\n<p>Now you\u2019ll have a Google Chrome window open. To navigate to a web page, use this command (using example.com as an example):<\/p>\n<blockquote>\n<p>driver.get(\u201chttp:\/\/www.example.com\u201d);<\/p>\n<\/blockquote>\n<p>To locate HTML elements on a page, use <strong>WebDriver.findElement()<\/strong>. To get the page title, your command should look like this:<\/p>\n<blockquote>\n<p>System.out.println(\u201cTitle: \u201d +<br \/>\ndriver.getTitle());<\/p>\n<\/blockquote>\n<p>This is how Selenium works. It assigns a coding matrix to the browser so that you can automate the things you would normally do by hand. It\u2019s a simple, and powerful way to complete a broad variety of time-intensive tasks. To close out the session, use this command:<\/p>\n<p>driver.quit();<\/p>\n<p>And that\u2019s it. You\u2019ve successfully controlled a browser session using Java in Selenium.<\/p>\n<p>To learn more about using Selenium as a web crawler, use this <strong><a href=\"https:\/\/github.com\/TheDancerCodes\/Selenium-Webscraping-Example\"  rel=\"noopener noreferrer\">GitHub tutorial<\/a><\/strong>. Once you know the commands and understand the methodology, the entire Internet is open to you.<\/p>\n<ul>\n<li><a href=\"https:\/\/royadata.io\/blog\/scrapy-vs-selenium-vs-beautifulsoup-for-web-scraping\/\">Scrapy Vs. Beautifulsoup Vs. Selenium for Web Scraping<\/a><\/li>\n<\/ul>\n<hr\/>\n<h2 id=\"proxies-what-to-look-for-when-building-a-web-crawler-using-selenium\" class=\"ftwp-heading\"><span class=\"ez-toc-section\" id=\"Proxies_%E2%80%93_What_to_Look_for_When_Building_a_Web_Crawler_Using_Selenium\"><\/span>Proxies \u2013 What to Look for When Building a Web Crawler Using Selenium<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>When using Selenium to scrape websites, the main thing you want to protect yourself against is blacklisting. Since web administrators will generally automatically treat Selenium-powered web crawlers as threats, you need to protect your web crawler.<\/p>\n<p>Nobody can guarantee that your web scraper will never get blacklisted, but choosing the right proxy can make a big di\ufb00erence and improve the life expectancy of your crawler.<\/p>\n<p><picture class=\"aligncenter size-full wp-image-374 perfmatters-lazy\" loading=\"lazy\"><source type=\"image\/webp\" data-srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Block-web-crawlers-based-on-the-IP-address.jpg.webp 1059w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Block-web-crawlers-based-on-the-IP-address-300x154.jpg.webp 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Block-web-crawlers-based-on-the-IP-address-768x393.jpg.webp 768w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Block-web-crawlers-based-on-the-IP-address-1024x524.jpg.webp 1024w\" srcset=\"data:image\/svg+xml,%3Csvg%20xmlns='http:\/\/www.w3.org\/2000\/svg'%20viewBox='0%200%201059%20542'%3E%3C\/svg%3E\" data-sizes=\"(max-width: 1059px) 100vw, 1059px\" \/><img decoding=\"async\" src=\"data:image\/svg+xml,%3Csvg%20xmlns='http:\/\/www.w3.org\/2000\/svg'%20viewBox='0%200%201059%20542'%3E%3C\/svg%3E\" alt=\"Block web crawlers based on the IP address\" width=\"1059\" height=\"542\" data-src=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Block-web-crawlers-based-on-the-IP-address.jpg\" data-srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Block-web-crawlers-based-on-the-IP-address.jpg 1059w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Block-web-crawlers-based-on-the-IP-address-300x154.jpg 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Block-web-crawlers-based-on-the-IP-address-768x393.jpg 768w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Block-web-crawlers-based-on-the-IP-address-1024x524.jpg 1024w\" data-sizes=\"(max-width: 1059px) 100vw, 1059px\" loading=\"lazy\" \/>\n<\/picture>\n<noscript><picture class=\"aligncenter size-full wp-image-374\"><source type=\"image\/webp\" srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Block-web-crawlers-based-on-the-IP-address.jpg.webp 1059w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Block-web-crawlers-based-on-the-IP-address-300x154.jpg.webp 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Block-web-crawlers-based-on-the-IP-address-768x393.jpg.webp 768w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Block-web-crawlers-based-on-the-IP-address-1024x524.jpg.webp 1024w\" sizes=\"(max-width: 1059px) 100vw, 1059px\"\/><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Block-web-crawlers-based-on-the-IP-address.jpg\" alt=\"Block web crawlers based on the IP address\" width=\"1059\" height=\"542\" srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Block-web-crawlers-based-on-the-IP-address.jpg 1059w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Block-web-crawlers-based-on-the-IP-address-300x154.jpg 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Block-web-crawlers-based-on-the-IP-address-768x393.jpg 768w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Block-web-crawlers-based-on-the-IP-address-1024x524.jpg 1024w\" sizes=\"(max-width: 1059px) 100vw, 1059px\"\/>\n<\/picture>\n<\/noscript><\/p>\n<p>The majority of websites will <a href=\"https:\/\/royadata.io\/blog\/scrape-a-website-never-get-blacklisted\/\">block web crawlers<\/a> based on the IP address of the originating server or the user\u2019s hosting provider. Clever web administrators will use intelligent tools to determine the pattern of a certain <a href=\"https:\/\/royadata.io\/blog\/proxy-pool\/\">pool of IP addresses<\/a> and then block the whole bunch.<\/p>\n<p>What you need is a proxy that can shift between multiple IP addresses. Don\u2019t settle for a simple solution, either:<\/p>\n<ul>\n<li>Some experts recommend using between 50 and 100 distinct IP addresses to be sure you have a large enough pool.<\/li>\n<\/ul>\n<p><sup>o<\/sup> Make sure you don\u2019t get consecutive IP addresses (1.2.3.4 to 1.2.3.5 to 1.2.3.6, for example).<\/p>\n<p>You need randomized IP addresses with no logical correlation between them.<\/p>\n<p>The important thing is that Selenium, by its nature, is powerfully customizable. Your imagination and coding skills are the only limit to your ability to build a web crawler using Selenium.<\/p>\n<p>For instance, if you are using the Requests <strong>library (more information <a href=\"https:\/\/pypi.org\/project\/selenium-requests\/\"  rel=\"noopener noreferrer\">here<\/a> )<\/strong> then you can write code to use proxy IPs with Selenium like so:<\/p>\n<blockquote>\n<p>r = requests.get(\u2018example.com&#8217;,headers=headers,proxies={\u2018https&#8217;: proxy_url})<br \/>\nproxy = get_random_proxy().replace(\u2018\\n&#8217;, \u201d)<br \/>\nservice_args = [<br \/>\n\u2018\u2013proxy={0}&#8217;.format(proxy),<br \/>\n\u2018\u2013proxy-type=http&#8217;,<br \/>\n\u2018\u2013proxy-auth=user:password&#8217;<br \/>\n]<br \/>\nprint(\u2018Processing..&#8217; + url)<br \/>\ndriver = webdriver.PhantomJS(service_args=service_args)<\/p>\n<\/blockquote>\n<p>Where <strong>example.com<\/strong> is the website you would like to access and <strong>get_random_proxy<\/strong> is the command to obtain a random proxy from within your pool.<\/p>\n<p><picture class=\"aligncenter size-full wp-image-375 perfmatters-lazy\" loading=\"lazy\"><source type=\"image\/webp\" data-srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Import-of-Selenium-Web-Driver.jpg.webp 1064w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Import-of-Selenium-Web-Driver-300x136.jpg.webp 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Import-of-Selenium-Web-Driver-768x349.jpg.webp 768w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Import-of-Selenium-Web-Driver-1024x466.jpg.webp 1024w\" srcset=\"data:image\/svg+xml,%3Csvg%20xmlns='http:\/\/www.w3.org\/2000\/svg'%20viewBox='0%200%201064%20484'%3E%3C\/svg%3E\" data-sizes=\"(max-width: 1064px) 100vw, 1064px\" \/><img decoding=\"async\" src=\"data:image\/svg+xml,%3Csvg%20xmlns='http:\/\/www.w3.org\/2000\/svg'%20viewBox='0%200%201064%20484'%3E%3C\/svg%3E\" alt=\"Import of Selenium Web Driver\" width=\"1064\" height=\"484\" data-src=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Import-of-Selenium-Web-Driver.jpg\" data-srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Import-of-Selenium-Web-Driver.jpg 1064w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Import-of-Selenium-Web-Driver-300x136.jpg 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Import-of-Selenium-Web-Driver-768x349.jpg 768w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Import-of-Selenium-Web-Driver-1024x466.jpg 1024w\" data-sizes=\"(max-width: 1064px) 100vw, 1064px\" loading=\"lazy\" \/>\n<\/picture>\n<noscript><picture class=\"aligncenter size-full wp-image-375\"><source type=\"image\/webp\" srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Import-of-Selenium-Web-Driver.jpg.webp 1064w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Import-of-Selenium-Web-Driver-300x136.jpg.webp 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Import-of-Selenium-Web-Driver-768x349.jpg.webp 768w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Import-of-Selenium-Web-Driver-1024x466.jpg.webp 1024w\" sizes=\"(max-width: 1064px) 100vw, 1064px\"\/><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Import-of-Selenium-Web-Driver.jpg\" alt=\"Import of Selenium Web Driver\" width=\"1064\" height=\"484\" srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Import-of-Selenium-Web-Driver.jpg 1064w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Import-of-Selenium-Web-Driver-300x136.jpg 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Import-of-Selenium-Web-Driver-768x349.jpg 768w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Import-of-Selenium-Web-Driver-1024x466.jpg 1024w\" sizes=\"(max-width: 1064px) 100vw, 1064px\"\/>\n<\/picture>\n<\/noscript><\/p>\n<ul>\n<li><a href=\"https:\/\/royadata.io\/blog\/selenium-proxy\/\">Selenium Proxy Setting \u2013 How to Setup Proxies on Selenium<\/a><\/li>\n<\/ul>\n<p>But this is just the beginning of integrating proxies with your Selenium web crawler. There\u2019s much more you can do:<\/p>\n<ul>\n<li>You can program Selenium to implement a system that sets the frequency of an IP address visiting a target website per day or per hour and then disables that IP address for 24 hours once it reaches its limit.<\/li>\n<li>You can set Selenium to record the IP addresses that get blacklisted. This lets you streamline the process of requesting new IP addresses because you only need to replace the ones that are blocked.<\/li>\n<li>You can increase Selenium\u2019s page-load waiting time to adjust for timeouts. If you are overtaxing the target server and using proxies, you may need to adjust page-load wait times to make Selenium more patient. Investing in a higher quality proxy can ensure faster response times.<\/li>\n<\/ul>\n<p>Now there are <a href=\"https:\/\/www.privateproxyreviews.com\/rotating-proxies\/\"><strong>lots of rotating proxy services in the market<\/strong><\/a>, The rotating proxy work as \u201cbackconnect\u201d that offer proxy API to rotate the IP addresses automatically, if you use those type of services, that will save lots of time on proxies setting up.<\/p>\n<p>Related: <a href=\"https:\/\/royadata.io\/blog\/how-backconnect-proxies-work\/\">A beginner guide to Backconnect Proxy: How Backconnect Proxies Work?<\/a><\/p>\n<p>With a powerful tool like Selenium supported by top-shelf proxies that you can rely on, you will be able to seamlessly gather data from anywhere on the Internet without exposing any vulnerabilities.<\/p>\n<p>Enjoy and happy crawling!<\/p>\n<hr\/>\n<ul>\n<li><a href=\"https:\/\/royadata.io\/blog\/rotating-proxies-api-with-curl\/\">Rotate your proxies with CURL for data mining<\/a><\/li>\n<li><a href=\"https:\/\/royadata.io\/blog\/use-chrome-headless-and-dedicated-proxies-to-scrape-any-website\/\">Use Chrome Headless and Dedicated Proxies to Scrape Any Website<\/a><\/li>\n<li><a href=\"https:\/\/royadata.io\/blog\/scrape-a-website-never-get-blacklisted\/\">How to Scrape a Website and Never Get Blocked<\/a><\/li>\n<li><a href=\"https:\/\/royadata.io\/blog\/how-to-prevent-proxy-banned\/\">How to Avoid Proxies Get banned or blocked<\/a><\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>Once upon a time, people looking for information had to physically walk into a brick-and-mortar library, find the right books, and read through them intently. Today, it seems a given that whatever data you\u2019re looking for exists on the Internet. There are over a billion websites on the World Wide Web at any given moment, &#8230; <a title=\"Building a Web Crawler Using Selenium and Proxies\" class=\"read-more\" href=\"http:\/\/royadata.io\/blog\/how-to-build-a-web-crawler-using-selenium-proxies\/\" aria-label=\"More on Building a Web Crawler Using Selenium and Proxies\">Read more<\/a><\/p>\n","protected":false},"author":1,"featured_media":690,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"_links":{"self":[{"href":"http:\/\/royadata.io\/blog\/wp-json\/wp\/v2\/posts\/6514"}],"collection":[{"href":"http:\/\/royadata.io\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/royadata.io\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/royadata.io\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/royadata.io\/blog\/wp-json\/wp\/v2\/comments?post=6514"}],"version-history":[{"count":0,"href":"http:\/\/royadata.io\/blog\/wp-json\/wp\/v2\/posts\/6514\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"http:\/\/royadata.io\/blog\/wp-json\/wp\/v2\/media\/690"}],"wp:attachment":[{"href":"http:\/\/royadata.io\/blog\/wp-json\/wp\/v2\/media?parent=6514"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/royadata.io\/blog\/wp-json\/wp\/v2\/categories?post=6514"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/royadata.io\/blog\/wp-json\/wp\/v2\/tags?post=6514"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}