{"id":6223,"date":"2023-10-18T14:47:43","date_gmt":"2023-10-18T14:47:43","guid":{"rendered":"https:\/\/royadata.io\/blog\/?p=6223"},"modified":"2023-10-18T14:47:43","modified_gmt":"2023-10-18T14:47:43","slug":"how-to-build-a-web-crawler","status":"publish","type":"post","link":"http:\/\/royadata.io\/blog\/how-to-build-a-web-crawler\/","title":{"rendered":"How to Build a Web Crawler with Python? (2023 Edition)"},"content":{"rendered":"<blockquote>\n<p>Do you want to learn how to build a web crawler from scratch? Join me as I show you how to build a web crawler using Python as the language of choice for the tutorial.<\/p>\n<\/blockquote>\n<p><picture class=\"aligncenter size-full wp-image-7226 perfmatters-lazy\" loading=\"lazy\"><source type=\"image\/webp\" data-srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Build-a-Web-Crawler.jpg.webp 1000w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Build-a-Web-Crawler-300x167.jpg.webp 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Build-a-Web-Crawler-768x426.jpg.webp 768w\" srcset=\"data:image\/svg+xml,%3Csvg%20xmlns='http:\/\/www.w3.org\/2000\/svg'%20viewBox='0%200%201000%20555'%3E%3C\/svg%3E\" data-sizes=\"(max-width: 1000px) 100vw, 1000px\" \/><img decoding=\"async\" src=\"data:image\/svg+xml,%3Csvg%20xmlns='http:\/\/www.w3.org\/2000\/svg'%20viewBox='0%200%201000%20555'%3E%3C\/svg%3E\" alt=\"Build a Web Crawler\" width=\"1000\" height=\"555\" data-src=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Build-a-Web-Crawler.jpg\" data-srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Build-a-Web-Crawler.jpg 1000w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Build-a-Web-Crawler-300x167.jpg 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Build-a-Web-Crawler-768x426.jpg 768w\" data-sizes=\"(max-width: 1000px) 100vw, 1000px\" loading=\"lazy\" \/>\n<\/picture>\n<noscript><picture class=\"aligncenter size-full wp-image-7226\"><source type=\"image\/webp\" srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Build-a-Web-Crawler.jpg.webp 1000w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Build-a-Web-Crawler-300x167.jpg.webp 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Build-a-Web-Crawler-768x426.jpg.webp 768w\" sizes=\"(max-width: 1000px) 100vw, 1000px\"\/><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Build-a-Web-Crawler.jpg\" alt=\"Build a Web Crawler\" width=\"1000\" height=\"555\" srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Build-a-Web-Crawler.jpg 1000w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Build-a-Web-Crawler-300x167.jpg 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Build-a-Web-Crawler-768x426.jpg 768w\" sizes=\"(max-width: 1000px) 100vw, 1000px\"\/>\n<\/picture>\n<\/noscript><\/p>\n<p>Have you ever wondered how the Internet would be without <a href=\"https:\/\/moz.com\/beginners-guide-to-seo\/how-search-engines-operate\"  rel=\"noopener noreferrer\">search engines<\/a>? Well, what if I tell you that web crawlers are some of the secrets that makes search engine what they have become today.<\/p>\n<p>They have proven to be incredibly important, not only in the area of general web search but other aspects in academic research, lead generation, and even Search Engine Optimizations (SEO).<\/p>\n<p>Any project that intends to <a href=\"https:\/\/royadata.io\/blog\/how-to-extract-data-from-a-website\/\">extract data<\/a> from many pages on a website or the full Internet without a prior list of links to extract the data from will most likely make use of web crawlers to achieve that.<\/p>\n<p><picture class=\"aligncenter size-full wp-image-7232 perfmatters-lazy\" loading=\"lazy\"><source type=\"image\/webp\" data-srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/web-crawler-with-python.jpg.webp 1000w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/web-crawler-with-python-300x135.jpg.webp 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/web-crawler-with-python-768x346.jpg.webp 768w\" srcset=\"data:image\/svg+xml,%3Csvg%20xmlns='http:\/\/www.w3.org\/2000\/svg'%20viewBox='0%200%201000%20451'%3E%3C\/svg%3E\" data-sizes=\"(max-width: 1000px) 100vw, 1000px\" \/><img decoding=\"async\" src=\"data:image\/svg+xml,%3Csvg%20xmlns='http:\/\/www.w3.org\/2000\/svg'%20viewBox='0%200%201000%20451'%3E%3C\/svg%3E\" alt=\"web crawler with python\" width=\"1000\" height=\"451\" data-src=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/web-crawler-with-python.jpg\" data-srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/web-crawler-with-python.jpg 1000w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/web-crawler-with-python-300x135.jpg 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/web-crawler-with-python-768x346.jpg 768w\" data-sizes=\"(max-width: 1000px) 100vw, 1000px\" loading=\"lazy\" \/>\n<\/picture>\n<noscript><picture class=\"aligncenter size-full wp-image-7232\"><source type=\"image\/webp\" srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/web-crawler-with-python.jpg.webp 1000w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/web-crawler-with-python-300x135.jpg.webp 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/web-crawler-with-python-768x346.jpg.webp 768w\" sizes=\"(max-width: 1000px) 100vw, 1000px\"\/><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/web-crawler-with-python.jpg\" alt=\"web crawler with python\" width=\"1000\" height=\"451\" srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/web-crawler-with-python.jpg 1000w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/web-crawler-with-python-300x135.jpg 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/web-crawler-with-python-768x346.jpg 768w\" sizes=\"(max-width: 1000px) 100vw, 1000px\"\/>\n<\/picture>\n<\/noscript><\/p>\n<p>If you are interested in developing <a href=\"https:\/\/royadata.io\/blog\/web-crawler\/\">a web crawler<\/a> for a project, then you need to know that the basics of a web crawler are easy, and everyone can design and develop one. However, depending on the complexity and size of your project, the befitting crawler could be difficult to build and maintain. In this article, you will learn how to build web crawlers yourself. Before going into the tutorial proper, let take a look at what web crawler actually means.<\/p>\n<hr\/>\n<h2 id=\"what-are-web-crawlers\" class=\"ftwp-heading\" style=\"text-align: center;\"><span class=\"ez-toc-section\" id=\"What_are_Web_Crawlers\"><\/span><strong>What are Web Crawlers?<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p><picture class=\"aligncenter size-full wp-image-7221 perfmatters-lazy\" loading=\"lazy\"><source type=\"image\/webp\" data-srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/overview-of-web-crawlers.png.webp 1000w\" srcset=\"data:image\/svg+xml,%3Csvg%20xmlns='http:\/\/www.w3.org\/2000\/svg'%20viewBox='0%200%201000%20586'%3E%3C\/svg%3E\" data-sizes=\"(max-width: 1000px) 100vw, 1000px\" \/><img decoding=\"async\" src=\"data:image\/svg+xml,%3Csvg%20xmlns='http:\/\/www.w3.org\/2000\/svg'%20viewBox='0%200%201000%20586'%3E%3C\/svg%3E\" alt=\"overview of web crawlers\" width=\"1000\" height=\"586\" data-src=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/overview-of-web-crawlers.png\" data-srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/overview-of-web-crawlers.png 1000w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/overview-of-web-crawlers-300x176.png 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/overview-of-web-crawlers-768x450.png 768w\" data-sizes=\"(max-width: 1000px) 100vw, 1000px\" loading=\"lazy\" \/>\n<\/picture>\n<noscript><picture class=\"aligncenter size-full wp-image-7221\"><source type=\"image\/webp\" srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/overview-of-web-crawlers.png.webp 1000w\" sizes=\"(max-width: 1000px) 100vw, 1000px\"\/><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/overview-of-web-crawlers.png\" alt=\"overview of web crawlers\" width=\"1000\" height=\"586\" srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/overview-of-web-crawlers.png 1000w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/overview-of-web-crawlers-300x176.png 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/overview-of-web-crawlers-768x450.png 768w\" sizes=\"(max-width: 1000px) 100vw, 1000px\"\/>\n<\/picture>\n<\/noscript><\/p>\n<p>The terms web crawlers and web scrapers are used interchangeably, and many think they mean the same thing. While they loosely mean the same thing, if you go deep, you will discover web scraping and web crawling are not the same things \u2013 and you can even see that from the way web crawlers and web scrapers are designed.<\/p>\n<p>Web crawlers, also known as web spiders, spiderbots, or simply crawlers, are web bots that have been developed to systematically visit webpages on the World Wide Web for the purpose of web indexing and collecting other data from the pages they visit.<\/p>\n<div class=\"su-youtube su-u-responsive-media-yes\">\n<div class=\"perfmatters-lazy-youtube\" data-src=\"https:\/\/www.youtube.com\/embed\/TLosoD249NA\" data-id=\"TLosoD249NA\" data-query onclick=\"if (!window.__cfRLUnblockHandlers) return false; perfmattersLazyLoadYouTube(this);\" data-cf-modified-1e2aa3e5c7af2d11d72df36e->\n<div><img loading=\"lazy\" decoding=\"async\" class=\"perfmatters-lazy\" src=\"data:image\/svg+xml,%3Csvg%20xmlns='http:\/\/www.w3.org\/2000\/svg'%20viewBox='0%200%20480%20360%3E%3C\/svg%3E\" data-src=\"https:\/\/i.ytimg.com\/vi\/TLosoD249NA\/hqdefault.jpg\" alt=\"YouTube video\" width=\"480\" height=\"360\" data-pin-nopin=\"true\"><\/p>\n<div class=\"play\"><\/div>\n<\/div>\n<\/div>\n<p><noscript><iframe loading=\"lazy\" width=\"600\" height=\"400\" src=\"https:\/\/www.youtube.com\/embed\/TLosoD249NA?\" frameborder=\"0\" allowfullscreen allow=\"autoplay; encrypted-media; picture-in-picture\" title=\"\"><\/iframe><\/noscript><\/div>\n<h3 id=\"how-do-they-differ-from-web-scrapers\" class=\"ftwp-heading\"><span class=\"ez-toc-section\" id=\"How_Do_They_Differ_from_Web_Scrapers\"><\/span><strong>How Do They Differ from Web Scrapers?<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>From the above, you can tell that they are different from web scrapers. They are both bots for web data extraction. However, you can see <a href=\"https:\/\/royadata.io\/blog\/web-scraping-tools\/\">web scrapers<\/a> as more streamlined and specialized workers designed for extracting specific data from a specific and defined list of web pages, such as <a href=\"https:\/\/royadata.io\/blog\/yelp-scraper\/\">Yelp reviews<\/a>, <a href=\"https:\/\/royadata.io\/blog\/instagram-scraper\/\">Instagram posts<\/a>, <a href=\"https:\/\/royadata.io\/blog\/amazon-scraper\/\">Amazon price data<\/a>, <a href=\"https:\/\/royadata.io\/blog\/shopify-scrapers\/\">Shopify product data<\/a>, and so on\u2026Web scrapers are fed a list of URLs, and it visits those URLs and scrapes required data.<\/p>\n<p>This is not the case for web crawlers as they are fed a list of URLs, and from this list, the web crawler is meant to find other URLs to be crawled by themselves, following some set of rules. The reason why marketers use the terms interchangeably is that in the process of <a href=\"https:\/\/royadata.io\/blog\/crawling-vs-scraping\/\">web crawling, web scraping<\/a> is involved \u2013 and some web scrapers incorporate aspects of web crawling.<\/p>\n<hr\/>\n<h2 id=\"how-does-web-crawlers-work\" class=\"ftwp-heading\" style=\"text-align: center;\"><span class=\"ez-toc-section\" id=\"How_Does_Web_Crawlers_Work\"><\/span><strong>How Does Web Crawlers Work?<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Depending on the complexity and use cases of a web crawler, it could work in the basic way web crawlers work or have some modifications in working mechanism. At its most basic level, you can see web crawlers as web browsers that browse pages on the Internet collecting information.<\/p>\n<p>The working mechanism for web crawlers is simple. For a web crawler to work, you will have to provide it a list of URLs \u2013 these URLs are known as seed URLs. These seed URLs are added to a list of URLs to be visited. The crawler then goes through the list of URLs to be visited and visit them one after the other.<\/p>\n<p><picture class=\"aligncenter wp-image-7229 perfmatters-lazy\" loading=\"lazy\"><source type=\"image\/webp\" data-srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Web-Crawlers-working-mechanism.jpg.webp 1067w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Web-Crawlers-working-mechanism-300x159.jpg.webp 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Web-Crawlers-working-mechanism-1024x544.jpg.webp 1024w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Web-Crawlers-working-mechanism-768x408.jpg.webp 768w\" srcset=\"data:image\/svg+xml,%3Csvg%20xmlns='http:\/\/www.w3.org\/2000\/svg'%20viewBox='0%200%201000%20531'%3E%3C\/svg%3E\" data-sizes=\"(max-width: 1000px) 100vw, 1000px\" \/><img decoding=\"async\" src=\"data:image\/svg+xml,%3Csvg%20xmlns='http:\/\/www.w3.org\/2000\/svg'%20viewBox='0%200%201000%20531'%3E%3C\/svg%3E\" alt=\"Web Crawlers working mechanism\" width=\"1000\" height=\"531\" data-src=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Web-Crawlers-working-mechanism.jpg\" data-srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Web-Crawlers-working-mechanism.jpg 1067w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Web-Crawlers-working-mechanism-300x159.jpg 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Web-Crawlers-working-mechanism-1024x544.jpg 1024w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Web-Crawlers-working-mechanism-768x408.jpg 768w\" data-sizes=\"(max-width: 1000px) 100vw, 1000px\" loading=\"lazy\" \/>\n<\/picture>\n<noscript><picture class=\"aligncenter wp-image-7229\"><source type=\"image\/webp\" srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Web-Crawlers-working-mechanism.jpg.webp 1067w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Web-Crawlers-working-mechanism-300x159.jpg.webp 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Web-Crawlers-working-mechanism-1024x544.jpg.webp 1024w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Web-Crawlers-working-mechanism-768x408.jpg.webp 768w\" sizes=\"(max-width: 1000px) 100vw, 1000px\"\/><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Web-Crawlers-working-mechanism.jpg\" alt=\"Web Crawlers working mechanism\" width=\"1000\" height=\"531\" srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Web-Crawlers-working-mechanism.jpg 1067w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Web-Crawlers-working-mechanism-300x159.jpg 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Web-Crawlers-working-mechanism-1024x544.jpg 1024w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Web-Crawlers-working-mechanism-768x408.jpg 768w\" sizes=\"(max-width: 1000px) 100vw, 1000px\"\/>\n<\/picture>\n<\/noscript><\/p>\n<p>For each URL the crawler visits, it extracts all the hyperlinks on the page and adds them to the list of URLs to be visited. Aside from collecting hyperlinks in other to cover the width and breadth of the site or web, as in the case of web crawlers not specifically designed for a specific website, web crawlers also collect other information.<\/p>\n<p>Take, for instance, <a href=\"https:\/\/developers.google.com\/search\/docs\/advanced\/crawling\/overview-google-crawlers\"  rel=\"noopener noreferrer\">Google bots<\/a>, the most popular web crawler on the Internet aside from link data, also index the content of a page to make it easier to search. On the other hand, a web archive takes a snapshot of the pages it visits \u2013 other crawlers extract data they are interested in. Aside from a list of URLs to be visited, the crawler also keeps a list of URLs that have already been visited to avoid adding a crawled URLs into the list of sites to be crawled.<\/p>\n<div class=\"perfmatters-lazy-youtube\" data-src=\"https:\/\/www.youtube.com\/embed\/Ey0N1Ry0BPM\" data-id=\"Ey0N1Ry0BPM\" data-query=\"feature=oembed\" onclick=\"if (!window.__cfRLUnblockHandlers) return false; perfmattersLazyLoadYouTube(this);\" data-cf-modified-1e2aa3e5c7af2d11d72df36e->\n<div><img loading=\"lazy\" decoding=\"async\" class=\"perfmatters-lazy\" src=\"data:image\/svg+xml,%3Csvg%20xmlns='http:\/\/www.w3.org\/2000\/svg'%20viewBox='0%200%20480%20360%3E%3C\/svg%3E\" data-src=\"https:\/\/i.ytimg.com\/vi\/Ey0N1Ry0BPM\/hqdefault.jpg\" alt=\"YouTube video\" width=\"480\" height=\"360\" data-pin-nopin=\"true\"><\/p>\n<div class=\"play\"><\/div>\n<\/div>\n<\/div>\n<p><noscript><iframe loading=\"lazy\" title=\"Google Search and JavaScript Sites (Google I\/O'19)\" width=\"1050\" height=\"591\" src=\"https:\/\/www.youtube.com\/embed\/Ey0N1Ry0BPM?feature=oembed\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture\" allowfullscreen><\/iframe><\/noscript><\/p>\n<p>There are a good number of consideration you will have to look into, including a crawling policy that set the rule for URLs to be visited, a re-visit policy that dictates when to look out for a change on a web page, a politeness policy that determines whether you should respect the <a href=\"https:\/\/developers.google.com\/search\/docs\/advanced\/robots\/create-robots-txt\"  rel=\"noopener noreferrer\">robots.txt rules<\/a> or not, and lastly, a parallelization policy for coordinating distributed web crawling exercise, among others.<\/p>\n<hr\/>\n<h1 id=\"developing-a-web-crawler-with-python\" class=\"ftwp-heading\" style=\"text-align: center;\"><span class=\"ez-toc-section\" id=\"Developing_a_Web_Crawler_with_Python\"><\/span><strong>Developing a Web Crawler with Python<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h1>\n<p>From the above, we expect you to have an idea of what web crawlers are. It is now time to move into learning how to develop one yourself. Web crawlers are computer programs written using any of the general-purpose programming languages out there.<\/p>\n<p>You can code a web crawler using Java, C#, PHP, Python, and even JavaScript. This means that the number one prerequisite of developing a web crawler is being able to code in any of the general-purpose programming languages.<\/p>\n<pre>Related: <a href=\"https:\/\/royadata.io\/blog\/web-scraping-javascript-tutorials\/\">How to scrape HTML from a website Using Javascript?<\/a><\/pre>\n<h2 style=\"text-align: center;\"><strong><picture class=\"aligncenter size-full wp-image-7227 perfmatters-lazy\" loading=\"lazy\"><source type=\"image\/webp\" data-srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Developing-a-Web-Crawler.jpg.webp 1000w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Developing-a-Web-Crawler-300x162.jpg.webp 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Developing-a-Web-Crawler-768x414.jpg.webp 768w\" srcset=\"data:image\/svg+xml,%3Csvg%20xmlns='http:\/\/www.w3.org\/2000\/svg'%20viewBox='0%200%201000%20539'%3E%3C\/svg%3E\" data-sizes=\"(max-width: 1000px) 100vw, 1000px\" \/><img decoding=\"async\" src=\"data:image\/svg+xml,%3Csvg%20xmlns='http:\/\/www.w3.org\/2000\/svg'%20viewBox='0%200%201000%20539'%3E%3C\/svg%3E\" alt=\"Developing a Web Crawler\" width=\"1000\" height=\"539\" data-src=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Developing-a-Web-Crawler.jpg\" data-srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Developing-a-Web-Crawler.jpg 1000w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Developing-a-Web-Crawler-300x162.jpg 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Developing-a-Web-Crawler-768x414.jpg 768w\" data-sizes=\"(max-width: 1000px) 100vw, 1000px\" loading=\"lazy\" \/>\n<\/picture>\n<noscript><picture class=\"aligncenter size-full wp-image-7227\"><source type=\"image\/webp\" srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Developing-a-Web-Crawler.jpg.webp 1000w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Developing-a-Web-Crawler-300x162.jpg.webp 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Developing-a-Web-Crawler-768x414.jpg.webp 768w\" sizes=\"(max-width: 1000px) 100vw, 1000px\"\/><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Developing-a-Web-Crawler.jpg\" alt=\"Developing a Web Crawler\" width=\"1000\" height=\"539\" srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Developing-a-Web-Crawler.jpg 1000w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Developing-a-Web-Crawler-300x162.jpg 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Developing-a-Web-Crawler-768x414.jpg 768w\" sizes=\"(max-width: 1000px) 100vw, 1000px\"\/>\n<\/picture>\n<\/noscript><\/strong><\/h2>\n<p>In this article, we are going to be making use of Python because of its simplicity, ease of use, beginner-friendliness, and extensive library support. Even if you are not a Python programmer, you can take a crash course fin python programming in other to understand what will be discussed as all of the codes will be written in Python.<\/p>\n<hr\/>\n<h2 id=\"project-idea-page-title-extractor\" class=\"ftwp-heading\" style=\"text-align: center;\"><span class=\"ez-toc-section\" id=\"Project_Idea_Page_Title_Extractor\"><\/span><strong>Project Idea: Page Title Extractor <\/strong><span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>The project we will be building will be a very easy one and what can be called a proof of concept. The crawler we will be developing will accept a seed URL and visit all pages on the website, outing the links and title to the screen.<\/p>\n<p>We won\u2019t be respecting robots.txt files, no proxy usage, no <a href=\"https:\/\/www.toptal.com\/python\/beginners-guide-to-concurrency-and-parallelism-in-python\"  rel=\"noopener noreferrer\">multithreading<\/a>, and any other complexities \u2013 we are making it easy for you to follow and understand.<\/p>\n<hr\/>\n<h2 id=\"requirements-for-the-project\" class=\"ftwp-heading\" style=\"text-align: center;\"><span class=\"ez-toc-section\" id=\"Requirements_for_the_Project\"><\/span><strong>Requirements for the Project<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Earlier, I stated that Python has an extensive library of tools for web crawling. The most important of them all for web crawling is Scrapy, a web crawling framework that makes it easy for the development of web crawlers in fewer lines of code. However, we won\u2019t be using Scrapy as it hides some details; let make use of the Requests and BeautifulSoup combination for the development.<\/p>\n<div class=\"su-list\" style=\"margin-left:0px\">\n<ul>\n<li><i class=\"sui sui-check\" style=\"color:#3330b1\"><\/i> <strong><a href=\"https:\/\/www.python.org\/downloads\/\"  rel=\"noopener noreferrer\">Python<\/a>: <\/strong>While many Operating Systems come with Python preinstalled, the version installed is usually old, and as such, you will need to install a recent version of Python. You can visit the official download page to download an updated version of the Python programming language.<\/li>\n<li><i class=\"sui sui-check\" style=\"color:#3330b1\"><\/i> <strong><a href=\"https:\/\/pypi.org\/project\/requests\/\"  rel=\"noopener noreferrer\">Requests<\/a>:<\/strong> Dubbed HTTP for Humans, Requests is the best third-party library for sending HTTP requests to web servers. It is very simple and easy to use. under the hood, this library makes use of the urllib package but abstract it and provide you better APIs for handling HTTP requests and responses. This is a third-party library, and as such, you will need to download it. You can use the\n<pre>pip install Requests<\/pre>\n<p>to download it.<\/li>\n<li><i class=\"sui sui-check\" style=\"color:#3330b1\"><\/i> <strong><a href=\"https:\/\/www.crummy.com\/software\/BeautifulSoup\/bs4\/doc\/\"  rel=\"noopener noreferrer\">BeautifulSoup<\/a>:<\/strong> while the Requests library is for sending HTTP requests, BeautifulSoup is for parsing HTML and <a href=\"https:\/\/royadata.io\/blog\/lxml-tutorials\/\">XML<\/a> documents. With BeautifulSoup, you do not have to deal with using regular expressions and the standard HTM parser that are not easy to use and prone to errors if you are not skilled in their usage. BeautifulSoup makes it easy for you to transverse HTML documents and parse out the required data. This tool is also a third-party library and not included in the standard python distribution. You can use download it using the pip command:\n<pre>pip install beautifulsoup4<\/pre>\n<p>.<\/p><\/div>\n<\/li>\n<li><a href=\"https:\/\/royadata.io\/blog\/scrapy-vs-selenium-vs-beautifulsoup-for-web-scraping\/\">Scrapy Vs. Beautifulsoup Vs. Selenium for Web Scraping<\/a><\/li>\n<li><a href=\"https:\/\/royadata.io\/blog\/data-parsing\/\">What is Data Parsing and Parsing Techniques involved?<\/a><\/li>\n<\/ul>\n<hr\/>\n<h2 id=\"steps-for-coding-the-page-title-extractor-project\" class=\"ftwp-heading\" style=\"text-align: center;\"><span class=\"ez-toc-section\" id=\"Steps_for_Coding_the_Page_Title_Extractor_Project\"><\/span><strong>Steps for Coding the Page Title Extractor Project<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>As stated earlier, the process of developing a web crawler can be complex, but the crawler we are developing in this tutorial is very easy. In fact, if you already know <a href=\"https:\/\/royadata.io\/blog\/python-web-scraper-tutorial\/\">how to scrape data from web pages<\/a>, there is a high chance that you already know how to develop a simple web crawler. The Page Title Extractor project will be contained in only one module. You can create a new Python file and name it<\/p>\n<pre>title_extractor.py<\/pre>\n<p>. The module will have a class named TitleExtractor with 2 methods. The two classes are<\/p>\n<pre>crawl<\/pre>\n<p>for defining main crawling logic and<\/p>\n<pre>start<\/pre>\n<p>for giving the crawl method directive on the URL to crawl.<\/p>\n<div class=\"su-list\" style=\"margin-left:0px\">\n<ul>\n<li><i class=\"sui sui-hand-o-right\" style=\"color:#3330b1\"><\/i><br \/>\n<h3 id=\"import-the-necessary-libraries\" class=\"ftwp-heading\"><span class=\"ez-toc-section\" id=\"Import_the_Necessary_Libraries\"><\/span><strong>Import the Necessary Libraries<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h3>\n<\/li>\n<\/ul>\n<p>Let start by importing the required libraries for the project. We require requests, beautifulsoup, and <a href=\"https:\/\/docs.python.org\/3\/library\/urllib.parse.html\"  rel=\"noopener noreferrer\">urlparse<\/a>. Requests is for sending web requests, beautifulsoup for parsing title, and URLs from web pages downloaded by requests. The urlparse library is bundled inside the standard Python library and use for parsing URLs.<\/p>\n<pre>from urllib.parseimport urlparse\n\nimport requests\n\nfrom bs4 import BeautifulSoup<\/pre>\n<ul>\n<li><i class=\"sui sui-hand-o-right\" style=\"color:#3330b1\"><\/i><br \/>\n<h3 id=\"web-crawler-class-definition\" class=\"ftwp-heading\"><span class=\"ez-toc-section\" id=\"Web_Crawler_Class_Definition\"><\/span><strong>Web Crawler Class Definition <\/strong><span class=\"ez-toc-section-end\"><\/span><\/h3>\n<\/li>\n<\/ul>\n<p>After importing the required library, let create a new class name TitleExtractor. This will be the crawler class.<\/p>\n<pre>class TitleCrawler:\n\n<em>\"\"\"\n\nCrawler class accepts a URL as argument.\n\nThis seed url will be the url from which other urls will be discovered\n\n\"\"\"\n\n<\/em>def __init__(self, start_url):\n\nself.urls_to_be_visited = []\n\nself.urls_to_be_visited.append(start_url)\n\nself.visited = []\n\nself.domain = \"https:\/\/\" + urlparse(start_url).netloc<\/pre>\n<p>From the above, you can see the initialization function \u2013 it accepts a URL as argument. There are 3 variables \u2013 the<\/p>\n<pre>urls_to_be_visited<\/pre>\n<p>is for keeping a list of URLs to visit, the<\/p>\n<pre>visited<\/pre>\n<p>variable is for keeping a list of visited URLs to avoid crawling a URL more than once, and the domain variable is for the<\/p>\n<pre>domain<\/pre>\n<p>of the site you are scraping from. You will need it so that only links from the domain are crawled.<\/p>\n<ul>\n<li><i class=\"sui sui-hand-o-right\" style=\"color:#3330b1\"><\/i><br \/>\n<h3 id=\"start-method-coding\" class=\"ftwp-heading\"><span class=\"ez-toc-section\" id=\"Start_Method_Coding\"><\/span><strong>Start Method Coding <\/strong><span class=\"ez-toc-section-end\"><\/span><\/h3>\n<\/li>\n<\/ul>\n<pre>def start(self):\n\nfor urlin self.urls_to_be_visited:\n\nself.crawl(url)\n\n\n\nx = TitleCrawler(\"https:\/\/cop.guru\/\")\n\nx.start()<\/pre>\n<p>The start method above belongs to the TitleExtractor class. You can see a for loop that loops through the urls_to_be_visitedand pass URLs to the crawl method. The crawl method is also a method of the TitleExtractor class. The x variable is for creating an instance of the TitleExtractor class and then calling the start method to get the crawler to start crawling. \u00a0From the above code snippets, nothing has actually been done. The main work is done in the crawl method. Below is the code for the crawl method.<\/p>\n<ul>\n<li><i class=\"sui sui-hand-o-right\" style=\"color:#3330b1\"><\/i><br \/>\n<h3 id=\"crawl-method-coding\" class=\"ftwp-heading\"><span class=\"ez-toc-section\" id=\"Crawl_Method_Coding\"><\/span><strong>Crawl Method Coding <\/strong><span class=\"ez-toc-section-end\"><\/span><\/h3>\n<\/li>\n<\/ul>\n<pre>def crawl(self, link):\n\npage_content = requests.get(link).text\n\nsoup = BeautifulSoup(page_content, \"html.parser\")\n\ntitle = soup.find(\"title\")\n\nprint(\"PAGE BEING CRAWLED: \" + title.text + \"|\" + link )\n\nself.visited.append(link)\n\nurls = soup.find_all(\"a\")\n\nfor urlin urls:\n\nurl = url.get(\"href\")\n\nif urlis not None:\n\nif url.startswith(self.domain):\n\nif urlnot in self.visited:\n\nself.urls_to_be_visited.append(url)\n\nprint(\"Number of Crawled pages:\" + str(len(self.visited)))\n\nprint(\"Number of Links to be crawled:\" + str(len(self.urls_to_be_visited)))\n\nprint(\"::::::::::::::::::::::::::::::::::::::\")<\/pre>\n<p>The URL to be crawled is passed into the crawl method by the start function, and it does that by iterating through the urls_to_be_visited list variable. The first line in the code above sends a request to the URL and returns the content of the page.<\/p>\n<p>Using Beautifulsoup, the title of the page and URLs present on the page are scraped. The web crawler is meant for crawling only URLs for a target website, and such, URLs for external sources are not considered \u2013 you can see that from the second if statement. For a URL to be added to the list of URLs to be visited, it must be a valid URL and has not been visited before.<\/p>\n<ul>\n<li><i class=\"sui sui-hand-o-right\" style=\"color:#3330b1\"><\/i><br \/>\n<h3 id=\"full-code\" class=\"ftwp-heading\"><span class=\"ez-toc-section\" id=\"Full_Code\"><\/span><strong>Full Code<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h3>\n<\/li>\n<\/ul>\n<pre>from urllib.parseimport urlparse\n\nimport requests\n\nfrom bs4 import BeautifulSoup\n\n\n\nclass TitleCrawler:\n\n<em>\"\"\"\n\nCrawler class accepts a URL as argument.\n\nThis seed url will be the url from which other urls will be discovered\n\n\"\"\"\n\n<\/em>def __init__(self, start_url):\n\nself.urls_to_be_visited = []\n\nself.urls_to_be_visited.append(start_url)\n\nself.visited = []\n\nself.domain = \"https:\/\/\" + urlparse(start_url).netloc\n\n\n\ndef crawl(self, link):\n\npage_content = requests.get(link).text\n\nsoup = BeautifulSoup(page_content, \"html.parser\")\n\ntitle = soup.find(\"title\")\n\nprint(\"PAGE BEING CRAWLED: \" + title.text + \"|\" + link )\n\nself.visited.append(link)\n\nurls = soup.find_all(\"a\")\n\nfor urlin urls:\n\nurl = url.get(\"href\")\n\nif urlis not None:\n\nif url.startswith(self.domain):\n\nif urlnot in self.visited:\n\nself.urls_to_be_visited.append(url)\n\nprint(\"Number of Crawled pages:\" + str(len(self.visited)))\n\nprint(\"Number of Links to be crawled:\" + str(len(self.urls_to_be_visited)))\n\nprint(\"::::::::::::::::::::::::::::::::::::::\")\n\n\n\ndef start(self):\n\nfor urlin self.urls_to_be_visited:\n\nself.crawl(url)\n\n\n\nx = TitleCrawler(\"https:\/\/cop.guru\/\")\n\nx.start()<\/pre>\n<p>You can change the seed URL to any other URL. In the code above, we use <a href=\"https:\/\/cop.guru\/\"  rel=\"noopener noreferrer\">https:\/\/cop.guru\/<\/a>. If you run the code above, you will get something like the result below.<\/p>\n<pre>PAGE BEING CRAWLED: Sneaker Bots \u2022 Cop Guru|https:\/\/cop.guru\/sneaker-bots\/\n\n\n\nNumber of Crawled pages:4\n\n\n\nNumber of Links to be crawled:1535\n\n\n\n::::::::::::::::::::::::::::::::::::::\n\n\n\nPAGE BEING CRAWLED: All in One Bots \u2022 Cop Guru|https:\/\/cop.guru\/aio-bots\/\n\n\n\nNumber of Crawled pages:5\n\n\n\nNumber of Links to be crawled:1666\n\n\n\n::::::::::::::::::::::::::::::::::::::\n\n\n\nPAGE BEING CRAWLED: Adidas Bots \u2022 Cop Guru|https:\/\/cop.guru\/adidas-bots\/\n\n\n\nNumber of Crawled pages:6\n\n\n\nNumber of Links to be crawled:1763<\/pre>\n<\/div>\n<hr\/>\n<h2 id=\"a-catch-there-is-a-lot-of-improvement-in-the-project\" class=\"ftwp-heading\" style=\"text-align: center;\"><span class=\"ez-toc-section\" id=\"A_Catch_There_is_a_lot_of_improvement_in_the_Project\"><\/span><strong>A Catch: There is a lot of improvement in the Project <\/strong><span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Looking at the above code, you will most likely run it without any problem, but when an exception is experienced, that\u2019s when the code will stop running. No exception handling was considered in the code for simplicity.<\/p>\n<p>Aside from exception handling, you will discover that <a href=\"https:\/\/royadata.io\/blog\/scrape-a-website-never-get-blacklisted\/#anti-scraping-techniques\">no anti-bot system evading technique<\/a> was incorporated, while in reality, many popular websites have them in place to discourage bot access. There is also the issue of speed, which you can solve by making the bot multithreaded and making the code more efficient. Aside from these, there are other rooms for improvement.<\/p>\n<pre style=\"text-align: center;\"><strong>Conclusion <\/strong><\/pre>\n<p>Looking at the code of the web crawler we developed, you will agree with me that web crawlers are like web scrapers but have a wider scope. Another thing you need to know is that depending on the number of URLs discovered, the running time of the crawler can be long, but with multithreading, this can be shortened. Also, you need to also have it at the back of your mind that complex web crawlers for real projects will require a more planned approach.<\/p>\n<hr\/>\n<ul>\n<li><a href=\"https:\/\/royadata.io\/blog\/crawling-vs-scraping\/\">Web Crawling Vs. Web Scraping<\/a><\/li>\n<li><a href=\"https:\/\/royadata.io\/blog\/web-scraping-with-python\/\">Python Web Scraping Libraries and Framework<\/a><\/li>\n<li><a href=\"https:\/\/royadata.io\/blog\/how-to-build-a-web-crawler-using-selenium-proxies\/\">Building a Web Crawler Using Selenium and Proxies<\/a><\/li>\n<li><a href=\"https:\/\/royadata.io\/blog\/selenium-web-scraping-python\/\"><strong>How to Use Selenium to Web Scrape with Python<\/strong><\/a><\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>Do you want to learn how to build a web crawler from scratch? Join me as I show you how to build a web crawler using Python as the language of choice for the tutorial. Have you ever wondered how the Internet would be without search engines? Well, what if I tell you that web &#8230; <a title=\"How to Build a Web Crawler with Python? (2023 Edition)\" class=\"read-more\" href=\"http:\/\/royadata.io\/blog\/how-to-build-a-web-crawler\/\" aria-label=\"More on How to Build a Web Crawler with Python? (2023 Edition)\">Read more<\/a><\/p>\n","protected":false},"author":1,"featured_media":402,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"_links":{"self":[{"href":"http:\/\/royadata.io\/blog\/wp-json\/wp\/v2\/posts\/6223"}],"collection":[{"href":"http:\/\/royadata.io\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/royadata.io\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/royadata.io\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/royadata.io\/blog\/wp-json\/wp\/v2\/comments?post=6223"}],"version-history":[{"count":0,"href":"http:\/\/royadata.io\/blog\/wp-json\/wp\/v2\/posts\/6223\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"http:\/\/royadata.io\/blog\/wp-json\/wp\/v2\/media\/402"}],"wp:attachment":[{"href":"http:\/\/royadata.io\/blog\/wp-json\/wp\/v2\/media?parent=6223"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/royadata.io\/blog\/wp-json\/wp\/v2\/categories?post=6223"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/royadata.io\/blog\/wp-json\/wp\/v2\/tags?post=6223"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}