{"id":5967,"date":"2023-10-18T14:47:43","date_gmt":"2023-10-18T14:47:43","guid":{"rendered":"https:\/\/royadata.io\/blog\/?p=5967"},"modified":"2023-10-18T14:47:43","modified_gmt":"2023-10-18T14:47:43","slug":"web-scraping-practices","status":"publish","type":"post","link":"http:\/\/royadata.io\/blog\/web-scraping-practices\/","title":{"rendered":"Best Web Scraping Practices &#038; Techniques Tips (2023 Updated)"},"content":{"rendered":"<blockquote>\n<p>Successful web scrapers follow some web scraping practices that make them successful in the field. If you want to go far in web scraping, you have to follow these best practices. Come in now to learn about them.<\/p>\n<\/blockquote>\n<p><picture class=\"aligncenter size-full wp-image-4116 perfmatters-lazy\" loading=\"lazy\"><source type=\"image\/webp\" data-srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Best-Web-Scraping-Practices.jpg.webp 1054w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Best-Web-Scraping-Practices-300x150.jpg.webp 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Best-Web-Scraping-Practices-1024x511.jpg.webp 1024w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Best-Web-Scraping-Practices-768x383.jpg.webp 768w\" srcset=\"data:image\/svg+xml,%3Csvg%20xmlns='http:\/\/www.w3.org\/2000\/svg'%20viewBox='0%200%201054%20526'%3E%3C\/svg%3E\" data-sizes=\"(max-width: 1054px) 100vw, 1054px\" \/><img decoding=\"async\" src=\"data:image\/svg+xml,%3Csvg%20xmlns='http:\/\/www.w3.org\/2000\/svg'%20viewBox='0%200%201054%20526'%3E%3C\/svg%3E\" alt=\"Best Web Scraping Practices\" width=\"1054\" height=\"526\" data-src=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Best-Web-Scraping-Practices.jpg\" data-srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Best-Web-Scraping-Practices.jpg 1054w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Best-Web-Scraping-Practices-300x150.jpg 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Best-Web-Scraping-Practices-1024x511.jpg 1024w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Best-Web-Scraping-Practices-768x383.jpg 768w\" data-sizes=\"(max-width: 1054px) 100vw, 1054px\" loading=\"lazy\" \/>\n<\/picture>\n<noscript><picture class=\"aligncenter size-full wp-image-4116\"><source type=\"image\/webp\" srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Best-Web-Scraping-Practices.jpg.webp 1054w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Best-Web-Scraping-Practices-300x150.jpg.webp 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Best-Web-Scraping-Practices-1024x511.jpg.webp 1024w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Best-Web-Scraping-Practices-768x383.jpg.webp 768w\" sizes=\"(max-width: 1054px) 100vw, 1054px\"\/><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Best-Web-Scraping-Practices.jpg\" alt=\"Best Web Scraping Practices\" width=\"1054\" height=\"526\" srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Best-Web-Scraping-Practices.jpg 1054w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Best-Web-Scraping-Practices-300x150.jpg 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Best-Web-Scraping-Practices-1024x511.jpg 1024w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Best-Web-Scraping-Practices-768x383.jpg 768w\" sizes=\"(max-width: 1054px) 100vw, 1054px\"\/>\n<\/picture>\n<\/noscript><\/p>\n<p>As a newbie in the game of web scraping, you will think your small script can get a task at any scale, but sooner or later, you will not only discover that your script is just a proof of concept, you will discover how na\u00efve you were.<\/p>\n<p>You will discover here is more to <a href=\"https:\/\/royadata.io\/blog\/web-scraping\/\">web scraping<\/a> than you know. You will discover that you need to deal with a good number of anti-scraping techniques for you to be able to scrape some websites, and you will know that just like every other field, web scraping has its own best practices that you must adhere to succeed.<\/p>\n<p>In this article, you are going to be learning about the best practices when scraping a site. You will also learn how to handle common problems you will encounter when web scraping and how to solve them.<\/p>\n<hr\/>\n<h2 id=\"common-pitfalls-in-web-scraping\" class=\"ftwp-heading\" style=\"text-align: center;\"><span class=\"ez-toc-section\" id=\"Common_Pitfalls_in_Web_Scraping\"><\/span><strong>Common Pitfalls in Web Scraping<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>As a web scraper, you need to know that are some pitfalls that you must experience in your web scraping exercise. Some of them happen often \u2013 some occur less frequently. Regardless of the frequency of occurrence, you need to know about them. The common ones are discussed below.<\/p>\n<div class=\"su-youtube su-u-responsive-media-yes\">\n<div class=\"perfmatters-lazy-youtube\" data-src=\"https:\/\/www.youtube.com\/embed\/SA18JCBtlXY\" data-id=\"SA18JCBtlXY\" data-query onclick=\"if (!window.__cfRLUnblockHandlers) return false; perfmattersLazyLoadYouTube(this);\" data-cf-modified-7ed69c1b0ffd5f4c1ecf29f5->\n<div><img loading=\"lazy\" decoding=\"async\" class=\"perfmatters-lazy\" src=\"data:image\/svg+xml,%3Csvg%20xmlns='http:\/\/www.w3.org\/2000\/svg'%20viewBox='0%200%20480%20360%3E%3C\/svg%3E\" data-src=\"https:\/\/i.ytimg.com\/vi\/SA18JCBtlXY\/hqdefault.jpg\" alt=\"YouTube video\" width=\"480\" height=\"360\" data-pin-nopin=\"true\"><\/p>\n<div class=\"play\"><\/div>\n<\/div>\n<\/div>\n<p><noscript><iframe loading=\"lazy\" width=\"600\" height=\"400\" src=\"https:\/\/www.youtube.com\/embed\/SA18JCBtlXY?\" frameborder=\"0\" allowfullscreen allow=\"autoplay; encrypted-media; picture-in-picture\" title=\"pitfall web scraping\"><\/iframe><\/noscript><\/div>\n<ul>\n<li>\n<h3 id=\"change-in-the-html-of-a-page\" class=\"ftwp-heading\"><span class=\"ez-toc-section\" id=\"Change_in_the_HTML_of_a_Page\"><\/span><strong>Change in the HTML of a Page<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h3>\n<\/li>\n<\/ul>\n<p>I decided to start with this point because, in most cases, it has nothing to do with a website trying to prevent you from scraping. However, it is one of the most popular reasons while web scraping scripts stop working. Most sites usually change their layout after some time, and when this happens, the HTML will have to change.<\/p>\n<p>This then means that your code will break and stop working. You need a system that reports to you immediately a change has been discovered on a page so that you can get it fixed. Some sites that use pagination change the layout after some pages to get scrapers to break \u2013 you have to put this into consideration too.<\/p>\n<hr\/>\n<ul>\n<li>\n<h3 id=\"mistakenly-scrapping-the-wrong-data\" class=\"ftwp-heading\"><span class=\"ez-toc-section\" id=\"Mistakenly_Scrapping_the_Wrong_Data\"><\/span><strong>Mistakenly Scrapping the Wrong Data<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h3>\n<\/li>\n<\/ul>\n<p>Another common pitfall you are bound to experience as a web scraper is scrapping the wrong data. Usually, this might not happen when you are scraping a few pages and can quickly go through the scraped data, and as such, you can tell if there is a problem with any of the scraped data.<\/p>\n<ul>\n<li><a href=\"https:\/\/royadata.io\/blog\/how-to-scrape-linkedin-using-proxies\/\">How to Scrape Data from Linkedin Using Proxies<\/a><\/li>\n<li><a href=\"https:\/\/royadata.io\/blog\/scraping-craigslist\/\">The Ultimate Guide to Scraping Craigslist Data with Software<\/a><\/li>\n<li><a href=\"https:\/\/royadata.io\/blog\/using-proxies-to-scrape-whois-domain-data\/\">Using Proxies to Scrape Whois Domain Data<\/a><\/li>\n<\/ul>\n<p>However, when the size of the data to be scrapped is much, and you can\u2019t go through it, then you need to think of the integrity and quality of the whole data scraped.\u00a0 This is because some of the data might not meet your quality guidelines. For this, you need to subject data to a test case before adding it to the database.<\/p>\n<hr\/>\n<ul>\n<li>\n<h3 id=\"anti-scraping-techniques\" class=\"ftwp-heading\"><span class=\"ez-toc-section\" id=\"Anti-Scraping_Techniques\"><\/span><strong>Anti-Scraping Techniques<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h3>\n<\/li>\n<\/ul>\n<p><picture class=\"aligncenter wp-image-4129 size-full perfmatters-lazy\" loading=\"lazy\"><source type=\"image\/webp\" data-srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Anti-Scraping-Techniques.jpg.webp 1022w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Anti-Scraping-Techniques-300x142.jpg.webp 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Anti-Scraping-Techniques-768x364.jpg.webp 768w\" srcset=\"data:image\/svg+xml,%3Csvg%20xmlns='http:\/\/www.w3.org\/2000\/svg'%20viewBox='0%200%201022%20484'%3E%3C\/svg%3E\" data-sizes=\"(max-width: 1022px) 100vw, 1022px\" \/><img decoding=\"async\" src=\"data:image\/svg+xml,%3Csvg%20xmlns='http:\/\/www.w3.org\/2000\/svg'%20viewBox='0%200%201022%20484'%3E%3C\/svg%3E\" alt=\"Anti-Scraping Techniques\" width=\"1022\" height=\"484\" data-src=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Anti-Scraping-Techniques.jpg\" data-srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Anti-Scraping-Techniques.jpg 1022w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Anti-Scraping-Techniques-300x142.jpg 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Anti-Scraping-Techniques-768x364.jpg 768w\" data-sizes=\"(max-width: 1022px) 100vw, 1022px\" loading=\"lazy\" \/>\n<\/picture>\n<noscript><picture class=\"aligncenter wp-image-4129 size-full\"><source type=\"image\/webp\" srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Anti-Scraping-Techniques.jpg.webp 1022w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Anti-Scraping-Techniques-300x142.jpg.webp 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Anti-Scraping-Techniques-768x364.jpg.webp 768w\" sizes=\"(max-width: 1022px) 100vw, 1022px\"\/><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Anti-Scraping-Techniques.jpg\" alt=\"Anti-Scraping Techniques\" width=\"1022\" height=\"484\" srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Anti-Scraping-Techniques.jpg 1022w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Anti-Scraping-Techniques-300x142.jpg 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Anti-Scraping-Techniques-768x364.jpg 768w\" sizes=\"(max-width: 1022px) 100vw, 1022px\"\/>\n<\/picture>\n<\/noscript><\/p>\n<p>Websites do not want their data scrapped, and if they want, they will provide you an API for that. Most complex websites have anti-spam systems in place to prevent web scrapers, crawlers, and other automation bots from accessing their content.<\/p>\n<p>These involve <a href=\"https:\/\/royadata.io\/blog\/scrape-a-website-never-get-blacklisted\/#anti-scraping-techniques\"><strong>some anti-scraping techniques<\/strong><\/a> such as IP tracking and ban, <a href=\"https:\/\/royadata.io\/blog\/honeypot-trap\/\">honeypot traps<\/a>, <a href=\"https:\/\/royadata.io\/blog\/how-to-avoid-captcha\/\">Captchas<\/a>, Ajaxifying a site, <a href=\"https:\/\/royadata.io\/blog\/browser-fingerprint\/\">browser fingerprinting<\/a>, and many others. You will be learning how to solve all of these problems in the section after this.<\/p>\n<hr\/>\n<ul>\n<li>\n<h3 id=\"the-problem-of-scraping-at-a-large-scale\" class=\"ftwp-heading\"><span class=\"ez-toc-section\" id=\"The_Problem_of_Scraping_at_a_Large_Scale\"><\/span><strong>The Problem of Scraping at a Large Scale<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h3>\n<\/li>\n<\/ul>\n<p><picture class=\"aligncenter wp-image-4130 size-full perfmatters-lazy\" loading=\"lazy\"><source type=\"image\/webp\" data-srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Large-Scale-Scraping.jpg.webp 1024w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Large-Scale-Scraping-300x190.jpg.webp 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Large-Scale-Scraping-768x487.jpg.webp 768w\" srcset=\"data:image\/svg+xml,%3Csvg%20xmlns='http:\/\/www.w3.org\/2000\/svg'%20viewBox='0%200%201024%20649'%3E%3C\/svg%3E\" data-sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><img decoding=\"async\" src=\"data:image\/svg+xml,%3Csvg%20xmlns='http:\/\/www.w3.org\/2000\/svg'%20viewBox='0%200%201024%20649'%3E%3C\/svg%3E\" alt=\"Large Scale Scraping\" width=\"1024\" height=\"649\" data-src=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Large-Scale-Scraping.jpg\" data-srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Large-Scale-Scraping.jpg 1024w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Large-Scale-Scraping-300x190.jpg 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Large-Scale-Scraping-768x487.jpg 768w\" data-sizes=\"(max-width: 1024px) 100vw, 1024px\" loading=\"lazy\" \/>\n<\/picture>\n<noscript><picture class=\"aligncenter wp-image-4130 size-full\"><source type=\"image\/webp\" srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Large-Scale-Scraping.jpg.webp 1024w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Large-Scale-Scraping-300x190.jpg.webp 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Large-Scale-Scraping-768x487.jpg.webp 768w\" sizes=\"(max-width: 1024px) 100vw, 1024px\"\/><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Large-Scale-Scraping.jpg\" alt=\"Large Scale Scraping\" width=\"1024\" height=\"649\" srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Large-Scale-Scraping.jpg 1024w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Large-Scale-Scraping-300x190.jpg 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Large-Scale-Scraping-768x487.jpg 768w\" sizes=\"(max-width: 1024px) 100vw, 1024px\"\/>\n<\/picture>\n<\/noscript><\/p>\n<p>If you are a newbie in the field of web scraping, you will think that scraping a site of 10,000 pages is the same as scraping a site of 2 million pages. However, the more data you have to scrape, the more careful and planning you need. Generally, you need to know that the more data you need to scrape, the more time it will take.<\/p>\n<p>Usually, developing your scraper to scrape concurrently and distributing the work among different computers\/servers will make the whole process faster. Also, your database system needs to be scalable, fast, secure, and reliable. Else, you risk wasting a lot of time trying to query the database. Amazon Web Services (AWS) is one of the best choices in the market.<\/p>\n<hr\/>\n<h2 id=\"web-scraping-best-practices\" class=\"ftwp-heading\" style=\"text-align: center;\"><span class=\"ez-toc-section\" id=\"Web_Scraping_Best_Practices\"><\/span><strong>Web Scraping Best Practices<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Just like I stated earlier, every worthwhile activity has its own best practices, and web scraping is not an exception. This part of the article will be used to describe those best practices.<\/p>\n<p><picture class=\"aligncenter size-full wp-image-4117 perfmatters-lazy\" loading=\"lazy\"><source type=\"image\/webp\" data-srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Web-Scraping-Best-Practices.jpg.webp 1069w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Web-Scraping-Best-Practices-300x170.jpg.webp 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Web-Scraping-Best-Practices-1024x579.jpg.webp 1024w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Web-Scraping-Best-Practices-768x434.jpg.webp 768w\" srcset=\"data:image\/svg+xml,%3Csvg%20xmlns='http:\/\/www.w3.org\/2000\/svg'%20viewBox='0%200%201069%20604'%3E%3C\/svg%3E\" data-sizes=\"(max-width: 1069px) 100vw, 1069px\" \/><img decoding=\"async\" src=\"data:image\/svg+xml,%3Csvg%20xmlns='http:\/\/www.w3.org\/2000\/svg'%20viewBox='0%200%201069%20604'%3E%3C\/svg%3E\" alt=\"Web Scraping Best Practices\" width=\"1069\" height=\"604\" data-src=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Web-Scraping-Best-Practices.jpg\" data-srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Web-Scraping-Best-Practices.jpg 1069w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Web-Scraping-Best-Practices-300x170.jpg 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Web-Scraping-Best-Practices-1024x579.jpg 1024w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Web-Scraping-Best-Practices-768x434.jpg 768w\" data-sizes=\"(max-width: 1069px) 100vw, 1069px\" loading=\"lazy\" \/>\n<\/picture>\n<noscript><picture class=\"aligncenter size-full wp-image-4117\"><source type=\"image\/webp\" srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Web-Scraping-Best-Practices.jpg.webp 1069w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Web-Scraping-Best-Practices-300x170.jpg.webp 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Web-Scraping-Best-Practices-1024x579.jpg.webp 1024w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Web-Scraping-Best-Practices-768x434.jpg.webp 768w\" sizes=\"(max-width: 1069px) 100vw, 1069px\"\/><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Web-Scraping-Best-Practices.jpg\" alt=\"Web Scraping Best Practices\" width=\"1069\" height=\"604\" srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Web-Scraping-Best-Practices.jpg 1069w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Web-Scraping-Best-Practices-300x170.jpg 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Web-Scraping-Best-Practices-1024x579.jpg 1024w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Web-Scraping-Best-Practices-768x434.jpg 768w\" sizes=\"(max-width: 1069px) 100vw, 1069px\"\/>\n<\/picture>\n<\/noscript><\/p>\n<ul>\n<li>\n<h3 id=\"respecting-a-websites-robots-txt-file\" class=\"ftwp-heading\"><span class=\"ez-toc-section\" id=\"Respecting_a_Websites_Robotstxt_File\"><\/span><strong>Respecting a Website\u2019s Robots.txt File<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h3>\n<\/li>\n<\/ul>\n<p>Most websites have a <a href=\"https:\/\/www.bestproxyreviews.com\/robots.txt\"  rel=\"noopener noreferrer\">robots.txt<\/a> which they use in communicating with automation bots such as crawlers and scrapers on pages to scrape and not to scrape. They can also give other commands such as frequency of crawling and timing between requests, among other things. One thing I have discovered about most web scrapers excluding the ones owned by search engines is that the robots.txt files for websites are not respected \u2013 they are completely ignored. In fact, some web scrapers see robots.txt as obsolete.<\/p>\n<p>However, it is among the best practices to consider <a href=\"https:\/\/royadata.io\/blog\/scrape-a-website-never-get-blacklisted\/#robots-txt-file-an-overview\">a website\u2019s robots.txt<\/a>. Usually, even if you do not want to follow it disallow rules that dictate paths you shouldn\u2019t follow, you can at least respect the crawl delay instruction in other to be gentle on web servers. You can find how to parse the robots.txt file in your preferred programming language and scraping framework. For python programmers, they can get that done using the <a href=\"https:\/\/docs.python.org\/3\/library\/urllib.robotparser.html\"  rel=\"noopener noreferrer\">urllib.robotparser<\/a> module.<\/p>\n<p><picture class=\"aligncenter size-full wp-image-4123 perfmatters-lazy\" loading=\"lazy\"><source type=\"image\/webp\" data-srcset=\"https:\/\/www.bestproxyreviews.com\/wp-content\/uploads\/2020\/04\/Website\u2019s-Robots-txt-File.jpg.webp 795w, https:\/\/www.bestproxyreviews.com\/wp-content\/uploads\/2020\/04\/Website\u2019s-Robots-txt-File-300x109.jpg.webp 300w, https:\/\/www.bestproxyreviews.com\/wp-content\/uploads\/2020\/04\/Website\u2019s-Robots-txt-File-768x279.jpg.webp 768w\" srcset=\"data:image\/svg+xml,%3Csvg%20xmlns='http:\/\/www.w3.org\/2000\/svg'%20viewBox='0%200%20795%20289'%3E%3C\/svg%3E\" data-sizes=\"(max-width: 795px) 100vw, 795px\" \/><img decoding=\"async\" src=\"data:image\/svg+xml,%3Csvg%20xmlns='http:\/\/www.w3.org\/2000\/svg'%20viewBox='0%200%20795%20289'%3E%3C\/svg%3E\" alt=\"Website\u2019s Robots txt File\" width=\"795\" height=\"289\" data-src=\"https:\/\/www.bestproxyreviews.com\/wp-content\/uploads\/2020\/04\/Website\u2019s-Robots-txt-File.jpg\" data-srcset=\"https:\/\/www.bestproxyreviews.com\/wp-content\/uploads\/2020\/04\/Website\u2019s-Robots-txt-File.jpg 795w, https:\/\/www.bestproxyreviews.com\/wp-content\/uploads\/2020\/04\/Website\u2019s-Robots-txt-File-300x109.jpg 300w, https:\/\/www.bestproxyreviews.com\/wp-content\/uploads\/2020\/04\/Website\u2019s-Robots-txt-File-768x279.jpg 768w\" data-sizes=\"(max-width: 795px) 100vw, 795px\" loading=\"lazy\" \/>\n<\/picture>\n<noscript><picture class=\"aligncenter size-full wp-image-4123\"><source type=\"image\/webp\" srcset=\"https:\/\/www.bestproxyreviews.com\/wp-content\/uploads\/2020\/04\/Website\u2019s-Robots-txt-File.jpg.webp 795w, https:\/\/www.bestproxyreviews.com\/wp-content\/uploads\/2020\/04\/Website\u2019s-Robots-txt-File-300x109.jpg.webp 300w, https:\/\/www.bestproxyreviews.com\/wp-content\/uploads\/2020\/04\/Website\u2019s-Robots-txt-File-768x279.jpg.webp 768w\" sizes=\"(max-width: 795px) 100vw, 795px\"\/><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.bestproxyreviews.com\/wp-content\/uploads\/2020\/04\/Website\u2019s-Robots-txt-File.jpg\" alt=\"Website\u2019s Robots txt File\" width=\"795\" height=\"289\" srcset=\"https:\/\/www.bestproxyreviews.com\/wp-content\/uploads\/2020\/04\/Website\u2019s-Robots-txt-File.jpg 795w, https:\/\/www.bestproxyreviews.com\/wp-content\/uploads\/2020\/04\/Website\u2019s-Robots-txt-File-300x109.jpg 300w, https:\/\/www.bestproxyreviews.com\/wp-content\/uploads\/2020\/04\/Website\u2019s-Robots-txt-File-768x279.jpg 768w\" sizes=\"(max-width: 795px) 100vw, 795px\"\/>\n<\/picture>\n<\/noscript><\/p>\n<ul>\n<li><a href=\"https:\/\/royadata.io\/blog\/python-web-scraper-tutorial\/\">How to Build a Simple Web Scraper with Python<\/a><\/li>\n<li><a href=\"https:\/\/royadata.io\/blog\/use-chrome-headless-and-dedicated-proxies-to-scrape-any-website\/\">Use Chrome Headless to Scrape Any Website<\/a><\/li>\n<\/ul>\n<hr\/>\n<ul>\n<li>\n<h3 id=\"spoofing-the-user-agent-and-other-http-headers\" class=\"ftwp-heading\"><span class=\"ez-toc-section\" id=\"Spoofing_the_User-Agent_and_Other_HTTP_Headers\"><\/span><strong>Spoofing the User-Agent and Other HTTP Headers<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h3>\n<\/li>\n<\/ul>\n<p>When browsers send a request to a web server, it sends details such as <a href=\"https:\/\/royadata.io\/blog\/user-agent\/\">User-Agent<\/a>, which is a string identifying the browser. Alongside the User-Agent, other information sent includes Accept, Accept-Language, Accept-Encoding, and Referrer, among other data.<\/p>\n<p>Web scrapers also have to submit this information else; some sites will deny them access. Now, some sites automatically block certain crawlers and scrapers using their User-Agent to identify them.<\/p>\n<p><picture class=\"aligncenter size-full wp-image-502 perfmatters-lazy\" loading=\"lazy\"><source type=\"image\/webp\" data-srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Get-a-slightly-different-user-agent.jpg.webp 800w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Get-a-slightly-different-user-agent-300x155.jpg.webp 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Get-a-slightly-different-user-agent-768x397.jpg.webp 768w\" srcset=\"data:image\/svg+xml,%3Csvg%20xmlns='http:\/\/www.w3.org\/2000\/svg'%20viewBox='0%200%20800%20414'%3E%3C\/svg%3E\" data-sizes=\"(max-width: 800px) 100vw, 800px\" \/><img decoding=\"async\" src=\"data:image\/svg+xml,%3Csvg%20xmlns='http:\/\/www.w3.org\/2000\/svg'%20viewBox='0%200%20800%20414'%3E%3C\/svg%3E\" alt=\"Get a slightly different user agent\" width=\"800\" height=\"414\" data-src=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Get-a-slightly-different-user-agent.jpg\" data-srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Get-a-slightly-different-user-agent.jpg 800w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Get-a-slightly-different-user-agent-300x155.jpg 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Get-a-slightly-different-user-agent-768x397.jpg 768w\" data-sizes=\"(max-width: 800px) 100vw, 800px\" loading=\"lazy\" \/>\n<\/picture>\n<noscript><picture class=\"aligncenter size-full wp-image-502\"><source type=\"image\/webp\" srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Get-a-slightly-different-user-agent.jpg.webp 800w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Get-a-slightly-different-user-agent-300x155.jpg.webp 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Get-a-slightly-different-user-agent-768x397.jpg.webp 768w\" sizes=\"(max-width: 800px) 100vw, 800px\"\/><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Get-a-slightly-different-user-agent.jpg\" alt=\"Get a slightly different user agent\" width=\"800\" height=\"414\" srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Get-a-slightly-different-user-agent.jpg 800w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Get-a-slightly-different-user-agent-300x155.jpg 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Get-a-slightly-different-user-agent-768x397.jpg 768w\" sizes=\"(max-width: 800px) 100vw, 800px\"\/>\n<\/picture>\n<\/noscript><\/p>\n<p>If you do not want your bot identified as a web scraper, you need to spoof the User-Agent by replacing it with that of a popular web browser. It is even much better if you can rotate the User-Agent, but you have to make sure that the site does not present different layouts to different User-Agents else;<\/p>\n<p>the code will break if there is a layout change you did not account for in your code. When using the User-Agent string of a popular browser, you have to make sure that other <a href=\"https:\/\/royadata.io\/blog\/http-headers\/\">HTTP headers<\/a> correspond with it. Make sure you also provide a value for the referrer header so that it looks more natural.<\/p>\n<hr\/>\n<ul>\n<li>\n<h3 id=\"dealing-with-logins-and-session-cookies\" class=\"ftwp-heading\"><span class=\"ez-toc-section\" id=\"Dealing_with_Logins_and_Session_Cookies\"><\/span><strong>Dealing with Logins and Session Cookies<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h3>\n<\/li>\n<\/ul>\n<p>Some data you might be interested in are hidden behind a login page. When you faced with this, you just have to plan carefully as monitoring your activities becomes easier for a website.<\/p>\n<blockquote>\n<p>But how do you even login and maintain a session cookie in the first place?<\/p>\n<\/blockquote>\n<p>While this might seem to be a difficult task, it is actually not difficult if you know what you are doing. You just have to create a session, then send a request to the login URL with your authentication details as the payload. When your request is successful, you will get a response with the session cookies attached.<\/p>\n<p><picture class=\"aligncenter wp-image-4122 perfmatters-lazy\" loading=\"lazy\"><source type=\"image\/webp\" data-srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Dealing-with-Logins.png.webp 805w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Dealing-with-Logins-300x90.png.webp 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Dealing-with-Logins-768x231.png.webp 768w\" srcset=\"data:image\/svg+xml,%3Csvg%20xmlns='http:\/\/www.w3.org\/2000\/svg'%20viewBox='0%200%201008%20303'%3E%3C\/svg%3E\" data-sizes=\"(max-width: 1008px) 100vw, 1008px\" \/><img decoding=\"async\" src=\"data:image\/svg+xml,%3Csvg%20xmlns='http:\/\/www.w3.org\/2000\/svg'%20viewBox='0%200%201008%20303'%3E%3C\/svg%3E\" alt=\"Dealing with Logins\" width=\"1008\" height=\"303\" data-src=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Dealing-with-Logins.png\" data-srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Dealing-with-Logins.png 805w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Dealing-with-Logins-300x90.png 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Dealing-with-Logins-768x231.png 768w\" data-sizes=\"(max-width: 1008px) 100vw, 1008px\" loading=\"lazy\" \/>\n<\/picture>\n<noscript><picture class=\"aligncenter wp-image-4122\"><source type=\"image\/webp\" srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Dealing-with-Logins.png.webp 805w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Dealing-with-Logins-300x90.png.webp 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Dealing-with-Logins-768x231.png.webp 768w\" sizes=\"(max-width: 1008px) 100vw, 1008px\"\/><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Dealing-with-Logins.png\" alt=\"Dealing with Logins\" width=\"1008\" height=\"303\" srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Dealing-with-Logins.png 805w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Dealing-with-Logins-300x90.png 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Dealing-with-Logins-768x231.png 768w\" sizes=\"(max-width: 1008px) 100vw, 1008px\"\/>\n<\/picture>\n<\/noscript><\/p>\n<p><picture class=\"aligncenter wp-image-4121 perfmatters-lazy\" loading=\"lazy\"><source type=\"image\/webp\" data-srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Session-Cookies.png.webp 777w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Session-Cookies-300x100.png.webp 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Session-Cookies-768x255.png.webp 768w\" srcset=\"data:image\/svg+xml,%3Csvg%20xmlns='http:\/\/www.w3.org\/2000\/svg'%20viewBox='0%200%201003%20333'%3E%3C\/svg%3E\" data-sizes=\"(max-width: 1003px) 100vw, 1003px\" \/><img decoding=\"async\" src=\"data:image\/svg+xml,%3Csvg%20xmlns='http:\/\/www.w3.org\/2000\/svg'%20viewBox='0%200%201003%20333'%3E%3C\/svg%3E\" alt=\"Session Cookies\" width=\"1003\" height=\"333\" data-src=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Session-Cookies.png\" data-srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Session-Cookies.png 777w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Session-Cookies-300x100.png 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Session-Cookies-768x255.png 768w\" data-sizes=\"(max-width: 1003px) 100vw, 1003px\" loading=\"lazy\" \/>\n<\/picture>\n<noscript><picture class=\"aligncenter wp-image-4121\"><source type=\"image\/webp\" srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Session-Cookies.png.webp 777w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Session-Cookies-300x100.png.webp 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Session-Cookies-768x255.png.webp 768w\" sizes=\"(max-width: 1003px) 100vw, 1003px\"\/><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Session-Cookies.png\" alt=\"Session Cookies\" width=\"1003\" height=\"333\" srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Session-Cookies.png 777w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Session-Cookies-300x100.png 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Session-Cookies-768x255.png 768w\" sizes=\"(max-width: 1003px) 100vw, 1003px\"\/>\n<\/picture>\n<\/noscript><\/p>\n<p>With the session cookie returned, you can attach it to each of your requests, and you won\u2019t be asked to login again as sites <a href=\"https:\/\/royadata.io\/blog\/http-cookies\/\">use cookies<\/a> to identify their users. For you to discover the login URL and name of form inputs to be used for the payload, you need to inspect the form in a browser environment by right-clicking and click on the Inspect element option.<\/p>\n<p>The value of the form action is the login URL. For the payload, inspect the form element and pull out the correct name of the username and password field \u2013 with the other field, if any.<\/p>\n<hr\/>\n<ul>\n<li>\n<h3 id=\"handling-hidden-but-required-security-fields-on-post-forms\" class=\"ftwp-heading\"><span class=\"ez-toc-section\" id=\"Handling_Hidden_But_Required_Security_Fields_on_POST_Forms\"><\/span><strong>Handling Hidden (But Required) Security Fields on POST Forms<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h3>\n<\/li>\n<\/ul>\n<p>The above method of login may not work for some sites that have hidden security fields to prevent hackers and spammers from accessing their sites. This is because if you try sending just the username and password without data for the hidden field, the request won\u2019t be successful.<\/p>\n<p>Aside from login, many POST forms also have security fields such as the csrf_token hidden from regular users and automatically populated with a hash value you can\u2019t reproduce.<\/p>\n<p>For you to get the data for this field, you need to create a session, visit the page, and pull out the value from this hidden page. After this, you can then send the request with the value for the hidden field in the request payload.<\/p>\n<hr\/>\n<ul>\n<li>\n<h3 id=\"slowing-down-your-requests-to-avoid-overwhelming-a-website\" class=\"ftwp-heading\"><span class=\"ez-toc-section\" id=\"Slowing_Down_Your_Requests_to_Avoid_Overwhelming_a_Website\"><\/span><strong>Slowing Down Your Requests to Avoid Overwhelming a Website<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h3>\n<\/li>\n<\/ul>\n<p>Web scraping involves sending a good number of requests to websites that you do not own. This means that you are naturally adding to the cost of maintaining the site \u2013 without adding any value to the site. If you cannot add value to a site you are scraping, then try as much to be polite by setting a delay between requests so that you do not overwhelm the website\u2019s server. Some websites even state the optimal crawling delay for <a href=\"https:\/\/royadata.io\/blog\/web-crawler\/\">web crawlers<\/a> and <a href=\"https:\/\/royadata.io\/blog\/web-scraping-tools\/\">scrapers<\/a> in their robots.txt file.<\/p>\n<p>Even without stating it, it is part of the best practices and ethics for you to avoiding hammering a site with too many requests in a short while. This is to avoid slowing the site. It is also important you scrape a site in the night or early in the morning when people are not active \u2013 this is to ensure your actions do not affect other users by making the site slow.<\/p>\n<ul>\n<li><a href=\"https:\/\/royadata.io\/blog\/how-to-prevent-proxy-banned\/\">How to Avoid Proxies Get banned or blocked<\/a><\/li>\n<\/ul>\n<hr\/>\n<ul>\n<li>\n<h3 id=\"distribute-your-requests-across-multiple-ips\" class=\"ftwp-heading\"><span class=\"ez-toc-section\" id=\"Distribute_Your_Requests_Across_Multiple_IPs\"><\/span><strong>Distribute Your Requests Across Multiple IPs<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h3>\n<\/li>\n<\/ul>\n<p>The truth is, this point shouldn\u2019t even be part of the best practices as <a href=\"https:\/\/royadata.io\/blog\/web-scraping-proxies\/\">using proxies when scraping<\/a> is a must. Each website has a request limit it allows from a single IP for a given period of time. If an IP tries to exceed this, such IP will be blocked for sometimes.<\/p>\n<p>So, if you are looking forward to scraping at a reasonable scale, then you need to make use of proxies. With proxies, you can distribute your requests across multiple IPs and make them look like they are getting to the website from different devices.<\/p>\n<p><picture class=\"aligncenter wp-image-4127 size-full perfmatters-lazy\" loading=\"lazy\"><source type=\"image\/webp\" data-srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/proxy-pool.gif.webp\" srcset=\"data:image\/svg+xml,%3Csvg%20xmlns='http:\/\/www.w3.org\/2000\/svg'%20viewBox='0%200%201000%20236'%3E%3C\/svg%3E\" \/><img decoding=\"async\" src=\"data:image\/svg+xml,%3Csvg%20xmlns='http:\/\/www.w3.org\/2000\/svg'%20viewBox='0%200%201000%20236'%3E%3C\/svg%3E\" alt=\"proxy pool\" width=\"1000\" height=\"236\" data-src=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/proxy-pool.gif\" loading=\"lazy\" \/>\n<\/picture>\n<noscript><picture class=\"aligncenter wp-image-4127 size-full\"><source type=\"image\/webp\" srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/proxy-pool.gif.webp\"\/><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/proxy-pool.gif\" alt=\"proxy pool\" width=\"1000\" height=\"236\"\/>\n<\/picture>\n<\/noscript><\/p>\n<p>Using <a href=\"https:\/\/royadata.io\/blog\/proxy-pool\/\"><strong>a proxy pool<\/strong><\/a> is the best for this. This is because they have many IPs in their pool, and you do not have to worry about IP rotation and taking care of bad IPs. When it comes to the <a href=\"https:\/\/royadata.io\/blog\/different-types-of-proxies\/\">type of proxies<\/a>, <a href=\"https:\/\/royadata.io\/blog\/residential-proxies\/\"><strong>residential proxies<\/strong><\/a> are the best for this. However, for some selected sites, <a href=\"https:\/\/royadata.io\/blog\/datacenter-proxies\/\">datacenter proxies<\/a> work quite great.<\/p>\n<ul>\n<li><a href=\"https:\/\/royadata.io\/blog\/rotating-proxies\/\">Rotating proxies to Request with Multiple IPs for web scraping <\/a><\/li>\n<li><a href=\"https:\/\/royadata.io\/blog\/ip-randomizer\/\">How To Generate A Random IP Address For Each Session<\/a><\/li>\n<\/ul>\n<hr\/>\n<ul>\n<li>\n<h3 id=\"handling-missing-html-tags\" class=\"ftwp-heading\"><span class=\"ez-toc-section\" id=\"Handling_Missing_HTML_Tags\"><\/span><strong>Handling Missing HTML Tags<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h3>\n<\/li>\n<\/ul>\n<p>When it comes to web scraping, you need to know that the HTML code of pages cannot be trusted, and it is for a reason \u2013 they are being changed every now and then.<\/p>\n<p>Because of this, it is important you always check for the existence of an element before trying to manipulate it or pull data from it. When you try to pull data from a missing HTML tag, some parsing libraries will return None, while others will throw an exception.<\/p>\n<ul>\n<li><a href=\"https:\/\/royadata.io\/blog\/data-parsing\/\">What is Data Parsing and Parsing Techniques involved?<\/a><\/li>\n<\/ul>\n<p><picture class=\"aligncenter wp-image-4126 perfmatters-lazy\" loading=\"lazy\"><source type=\"image\/webp\" data-srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/HTML-Tags.jpg.webp 800w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/HTML-Tags-300x200.jpg.webp 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/HTML-Tags-768x513.jpg.webp 768w\" srcset=\"data:image\/svg+xml,%3Csvg%20xmlns='http:\/\/www.w3.org\/2000\/svg'%20viewBox='0%200%20999%20667'%3E%3C\/svg%3E\" data-sizes=\"(max-width: 999px) 100vw, 999px\" \/><img decoding=\"async\" src=\"data:image\/svg+xml,%3Csvg%20xmlns='http:\/\/www.w3.org\/2000\/svg'%20viewBox='0%200%20999%20667'%3E%3C\/svg%3E\" alt=\"HTML Tags\" width=\"999\" height=\"667\" data-src=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/HTML-Tags.jpg\" data-srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/HTML-Tags.jpg 800w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/HTML-Tags-300x200.jpg 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/HTML-Tags-768x513.jpg 768w\" data-sizes=\"(max-width: 999px) 100vw, 999px\" loading=\"lazy\" \/>\n<\/picture>\n<noscript><picture class=\"aligncenter wp-image-4126\"><source type=\"image\/webp\" srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/HTML-Tags.jpg.webp 800w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/HTML-Tags-300x200.jpg.webp 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/HTML-Tags-768x513.jpg.webp 768w\" sizes=\"(max-width: 999px) 100vw, 999px\"\/><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/HTML-Tags.jpg\" alt=\"HTML Tags\" width=\"999\" height=\"667\" srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/HTML-Tags.jpg 800w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/HTML-Tags-300x200.jpg 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/HTML-Tags-768x513.jpg 768w\" sizes=\"(max-width: 999px) 100vw, 999px\"\/>\n<\/picture>\n<\/noscript><\/p>\n<p>It is advisable you always use the if statement to check for the availability of tag before trying to work on it. And if an element is missing, the web scraper should log it and notify you so that you know that something has changed on the page in other for you to work on it.<\/p>\n<hr\/>\n<ul>\n<li>\n<h3 id=\"handling-network-errors\" class=\"ftwp-heading\"><span class=\"ez-toc-section\" id=\"Handling_Network_Errors\"><\/span><strong>Handling Network Errors<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h3>\n<\/li>\n<\/ul>\n<blockquote>\n<p>You are writing the code for your web scraper, and you are not thinking of network errors, right?<\/p>\n<\/blockquote>\n<p>Well, it might interest you to know that it often happens, much more than you think. It could be as a result of a problem from your own end, a problem from the web server you are sending requests to, or from <a href=\"https:\/\/royadata.io\/blog\/best-proxy-services\/\">your proxy provider<\/a>.<\/p>\n<p>The rule of thumb is never to trust that network will behave as you expect it to. Problems will arise, and as such, you should write your code in such a way that you factor in the possibility of network errors and handle them accordingly.<\/p>\n<p>Make sure every part of your code where you have to send web requests has exception handling attached, try:<\/p>\n<pre>requests.get(https:\/\/www.google.com)\n\nexcept requests.exceptions.RequestException:\n\n# code for handling the exception here.\n\n<\/pre>\n<p>In handling the error, you can retry, and after a few trials, you can then move to the next URL and log that particular URL and error so that you can get that done manually. Also, make sure you only start parsing out data when the HTTP status code returned is 200.<\/p>\n<ul>\n<li><a href=\"https:\/\/royadata.io\/blog\/http-proxy-error-codes\/\">Most Common HTTP Proxy Error Codes<\/a><\/li>\n<\/ul>\n<hr\/>\n<ul>\n<li>\n<h3 id=\"scrape-google-cache-for-non-time-sensitive-data\" class=\"ftwp-heading\"><span class=\"ez-toc-section\" id=\"Scrape_Google_Cache_for_Non-Time_Sensitive_Data\"><\/span><strong>Scrape Google Cache for Non-Time Sensitive Data<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h3>\n<\/li>\n<\/ul>\n<p>Is the time you are trying to scrape not time-sensitive? Then you might as well leave the website alone and scrape its data by scraping copies on <a href=\"https:\/\/blog.hubspot.com\/marketing\/google-cache\">Google Cache<\/a>. It might interest you to know that most of the pages you are trying to scrape have already been scraped by Google, and you can scrape directly from Google Cache, especially when you are dealing with historical data. You can get the whole HTML of a page, including it, <a href=\"https:\/\/royadata.io\/blog\/scrape-images-from-a-website-with-python\/\">pictures<\/a>, and other files from Google Cache.<\/p>\n<ul>\n<li><a href=\"https:\/\/royadata.io\/blog\/proxies-for-scraping-google\/\">Proxies for Preventing Bans and Captchas When Scraping Google<\/a><\/li>\n<li><a href=\"https:\/\/royadata.io\/blog\/google-proxy\/\"><strong>The Best Google Proxies for SERP data &#038; Never Get Google Blocked<\/strong><\/a><\/li>\n<\/ul>\n<p><picture class=\"aligncenter wp-image-4125 size-large perfmatters-lazy\" loading=\"lazy\"><source type=\"image\/webp\" data-srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Scrap-Google-Cache-1024x641.png.webp 1024w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Scrap-Google-Cache-300x188.png.webp 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Scrap-Google-Cache-768x480.png.webp 768w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Scrap-Google-Cache-320x200.png.webp 320w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Scrap-Google-Cache.png.webp 1186w\" srcset=\"data:image\/svg+xml,%3Csvg%20xmlns='http:\/\/www.w3.org\/2000\/svg'%20viewBox='0%200%201024%20641'%3E%3C\/svg%3E\" data-sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><img decoding=\"async\" src=\"data:image\/svg+xml,%3Csvg%20xmlns='http:\/\/www.w3.org\/2000\/svg'%20viewBox='0%200%201024%20641'%3E%3C\/svg%3E\" alt=\"Scrap Google Cache\" width=\"1024\" height=\"641\" data-src=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Scrap-Google-Cache-1024x641.png\" data-srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Scrap-Google-Cache-1024x641.png 1024w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Scrap-Google-Cache-300x188.png 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Scrap-Google-Cache-768x480.png 768w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Scrap-Google-Cache-320x200.png 320w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Scrap-Google-Cache.png 1186w\" data-sizes=\"(max-width: 1024px) 100vw, 1024px\" loading=\"lazy\" \/>\n<\/picture>\n<noscript><picture class=\"aligncenter wp-image-4125 size-large\"><source type=\"image\/webp\" srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Scrap-Google-Cache-1024x641.png.webp 1024w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Scrap-Google-Cache-300x188.png.webp 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Scrap-Google-Cache-768x480.png.webp 768w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Scrap-Google-Cache-320x200.png.webp 320w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Scrap-Google-Cache.png.webp 1186w\" sizes=\"(max-width: 1024px) 100vw, 1024px\"\/><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Scrap-Google-Cache-1024x641.png\" alt=\"Scrap Google Cache\" width=\"1024\" height=\"641\" srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Scrap-Google-Cache-1024x641.png 1024w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Scrap-Google-Cache-300x188.png 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Scrap-Google-Cache-768x480.png 768w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Scrap-Google-Cache-320x200.png 320w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Scrap-Google-Cache.png 1186w\" sizes=\"(max-width: 1024px) 100vw, 1024px\"\/>\n<\/picture>\n<\/noscript><\/p>\n<p>Google is a very large website and can take any number of request you wish to send to its server anytime of the day without your web scraper impacting it in any negative way.<\/p>\n<p>Such can\u2019t be said of other websites that their server can easily get overwhelmed with requests. Even Scrapy advises that web scrapers should scrape historical data from Google Cache instead of hitting a website directly.<\/p>\n<hr\/>\n<h2 id=\"conclusion\" class=\"ftwp-heading\"><span class=\"ez-toc-section\" id=\"Conclusion\"><\/span><strong>Conclusion<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Web scraping is a serious business that requires a good amount of planning and careful execution, especially if you are going into it on a reasonable scale. As you are planning, there are some key best practices in web scraping that you have to consider. Some of these have been discussed above.<\/p>\n<hr\/>\n<ul>\n<li><a href=\"https:\/\/royadata.io\/blog\/scrapy-vs-selenium-vs-beautifulsoup-for-web-scraping\/\">Scrapy Vs. Beautifulsoup Vs. Selenium for Web Scraping<\/a><\/li>\n<li><a href=\"https:\/\/royadata.io\/blog\/how-to-build-a-web-crawler-using-selenium-proxies\/\">Building a Web Crawler Using Selenium and Proxies<\/a><\/li>\n<li><a href=\"https:\/\/royadata.io\/blog\/selenium-proxy\/\">Selenium Proxy Setting \u2013 How to Setup Proxies on Selenium<\/a><\/li>\n<li><a href=\"https:\/\/royadata.io\/blog\/curl-proxy-settings\/\">Curl Proxy Settings \u2013 How to cURL with a proxy?<\/a><\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>Successful web scrapers follow some web scraping practices that make them successful in the field. If you want to go far in web scraping, you have to follow these best practices. Come in now to learn about them. As a newbie in the game of web scraping, you will think your small script can get &#8230; <a title=\"Best Web Scraping Practices &#038; Techniques Tips (2023 Updated)\" class=\"read-more\" href=\"http:\/\/royadata.io\/blog\/web-scraping-practices\/\" aria-label=\"More on Best Web Scraping Practices &#038; Techniques Tips (2023 Updated)\">Read more<\/a><\/p>\n","protected":false},"author":1,"featured_media":154,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"_links":{"self":[{"href":"http:\/\/royadata.io\/blog\/wp-json\/wp\/v2\/posts\/5967"}],"collection":[{"href":"http:\/\/royadata.io\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/royadata.io\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/royadata.io\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/royadata.io\/blog\/wp-json\/wp\/v2\/comments?post=5967"}],"version-history":[{"count":0,"href":"http:\/\/royadata.io\/blog\/wp-json\/wp\/v2\/posts\/5967\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"http:\/\/royadata.io\/blog\/wp-json\/wp\/v2\/media\/154"}],"wp:attachment":[{"href":"http:\/\/royadata.io\/blog\/wp-json\/wp\/v2\/media?parent=5967"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/royadata.io\/blog\/wp-json\/wp\/v2\/categories?post=5967"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/royadata.io\/blog\/wp-json\/wp\/v2\/tags?post=5967"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}