{"id":6152,"date":"2023-10-18T14:47:43","date_gmt":"2023-10-18T14:47:43","guid":{"rendered":"https:\/\/royadata.io\/blog\/?p=6152"},"modified":"2023-10-18T14:47:43","modified_gmt":"2023-10-18T14:47:43","slug":"proxies-for-scraping-google","status":"publish","type":"post","link":"http:\/\/royadata.io\/blog\/proxies-for-scraping-google\/","title":{"rendered":"Proxies for Preventing Bans and Captchas When Scraping Google"},"content":{"rendered":"<p><picture class=\"aligncenter size-full wp-image-565 perfmatters-lazy\" loading=\"lazy\"><source type=\"image\/webp\" data-srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/you-need-is-a-captcha-checking.jpg.webp 1064w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/you-need-is-a-captcha-checking-300x149.jpg.webp 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/you-need-is-a-captcha-checking-768x382.jpg.webp 768w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/you-need-is-a-captcha-checking-1024x509.jpg.webp 1024w\" srcset=\"data:image\/svg+xml,%3Csvg%20xmlns='http:\/\/www.w3.org\/2000\/svg'%20viewBox='0%200%201064%20529'%3E%3C\/svg%3E\" data-sizes=\"(max-width: 1064px) 100vw, 1064px\" \/><img decoding=\"async\" src=\"data:image\/svg+xml,%3Csvg%20xmlns='http:\/\/www.w3.org\/2000\/svg'%20viewBox='0%200%201064%20529'%3E%3C\/svg%3E\" alt=\"you need is a captcha checking\" width=\"1064\" height=\"529\" data-src=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/you-need-is-a-captcha-checking.jpg\" data-srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/you-need-is-a-captcha-checking.jpg 1064w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/you-need-is-a-captcha-checking-300x149.jpg 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/you-need-is-a-captcha-checking-768x382.jpg 768w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/you-need-is-a-captcha-checking-1024x509.jpg 1024w\" data-sizes=\"(max-width: 1064px) 100vw, 1064px\" loading=\"lazy\" \/>\n<\/picture>\n<noscript><picture class=\"aligncenter size-full wp-image-565\"><source type=\"image\/webp\" srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/you-need-is-a-captcha-checking.jpg.webp 1064w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/you-need-is-a-captcha-checking-300x149.jpg.webp 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/you-need-is-a-captcha-checking-768x382.jpg.webp 768w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/you-need-is-a-captcha-checking-1024x509.jpg.webp 1024w\" sizes=\"(max-width: 1064px) 100vw, 1064px\"\/><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/you-need-is-a-captcha-checking.jpg\" alt=\"you need is a captcha checking\" width=\"1064\" height=\"529\" srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/you-need-is-a-captcha-checking.jpg 1064w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/you-need-is-a-captcha-checking-300x149.jpg 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/you-need-is-a-captcha-checking-768x382.jpg 768w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/you-need-is-a-captcha-checking-1024x509.jpg 1024w\" sizes=\"(max-width: 1064px) 100vw, 1064px\"\/>\n<\/picture>\n<\/noscript><\/p>\n<p>The last thing you need when collecting large volumes of data from Google is an IP ban. The second-to-last thing you need is a captcha checking in on your human-ness. Each of these protection measures is employed by the search giant to weed out bots, which is exactly what you\u2019re running if you\u2019ve come to this article.<\/p>\n<p>Before we get into specific ways to prevent these terrible things, I\u2019d like to address the <strong>ethical aspect of scraping Google<\/strong>. As a rule of thumb, yes, scraping Google is ethical. Harvesting data in and of itself is a common practice today, so much so that Google does it all the time, pulling the same sort of data you\u2019re looking for from websites across the internet.<\/p>\n<p>Actually, it\u2019s so out in the open that Google has a whole page on its <strong><a href=\"https:\/\/support.google.com\/webmasters\/answer\/182072?hl=en\"  rel=\"noopener noreferrer\">Googlebot<\/a><\/strong>, the data scraping tool that crawls the web on Google\u2019s behalf.<\/p>\n<p>The discussion becomes possibly less ethical depending on what you plan to do with the scraped data. I\u2019m not here to judge, so I won\u2019t, but you should know that there a host of very legitimate and universally practiced functions that come from Google scraping, like <strong>competitive analysis<\/strong>, <strong>keyword crunching<\/strong>, and <strong>personal research<\/strong>. Flip the coin and you\u2019ll find an equally large number of illegal functions, like DDoS attacks and unsolicited bulk email marketing.<\/p>\n<blockquote>\n<p>The ethics of scraping Google\u2019s treasure trove of data is in your hands.<\/p>\n<\/blockquote>\n<p>You might also be wondering why Google has IP bans and captchas if so many scraping methods are technically legitimate. Put simply, <strong><span style=\"text-decoration: underline;\">Google wants humans using its site<\/span><\/strong>. Sort of like Terminator, the robots are inherent harbingers of possible death and destruction, and Google wants to safeguard websites on the internet from all those nefarious activities scraping allows. Individual humans can\u2019t bring such evil things to the web, so Google is fine with them.<\/p>\n<p>You can see this played out by the two levels of security your IPs will have to go through. Google doesn\u2019t start o\ufb00 by banning your proxy IPs (unless you\u2019ve done something very bad) \u2014 no, it starts by presenting a captcha. This is specifically designed to baffle bots. You\u2019ve seen the captcha box before, and if you\u2019re scraping you probably see it often. This is the first tier \u2014 the second is a ban, which can be permanent or impermanent, depending on your infraction.<\/p>\n<p>The question remains,<\/p>\n<h2 id=\"how-do-you-get-around-those-bans-and-captchas\" class=\"ftwp-heading\"><span class=\"ez-toc-section\" id=\"How_do_you_get_around_those_bans_and_captchas\"><\/span><strong>How do you get around those bans and captchas?<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>We\u2019ve got six methods, listed below.<\/p>\n<hr\/>\n<h3 id=\"1-limit-individual-proxy-ip-use\" class=\"ftwp-heading\"><span class=\"ez-toc-section\" id=\"1_Limit_Individual_Proxy_IP_Use\"><\/span>1. Limit Individual Proxy IP Use<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>I\u2019ll start with one of the more basic principles of web scraping etiquette. In most scenarios, you or your company will be using <a href=\"https:\/\/royadata.io\/blog\/different-types-of-proxies\/\">large batches of proxies<\/a> to do the data collecting. This is almost always the case because the whole point is to scrape data in large amounts quickly, and you can\u2019t do that very e\ufb00ectively if you only have 5 proxies.<\/p>\n<ul>\n<li><a href=\"https:\/\/royadata.io\/blog\/how-many-proxies-do-i-need-for-my-application\/\">How Many Proxies Do I Need for My Application?<\/a><\/li>\n<\/ul>\n<p>As such, you\u2019ll have a vast number of IPs to use when scraping data. In the software program, you use to scrape (of which there are many, <a href=\"https:\/\/royadata.io\/blog\/why-the-harvester-on-your-scrapebox-isnt-working\/\"><strong>ScrapeBox<\/strong><strong> is a good one<\/strong><\/a>), there will likely be a setting for how often a proxy can query or search. You\u2019ll most likely find this in the API (application program interface).<\/p>\n<p><picture class=\"aligncenter size-full wp-image-566 perfmatters-lazy\" loading=\"lazy\"><source type=\"image\/webp\" data-srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Application-program-interface..jpg.webp 1066w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Application-program-interface.-300x49.jpg.webp 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Application-program-interface.-768x126.jpg.webp 768w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Application-program-interface.-1024x168.jpg.webp 1024w\" srcset=\"data:image\/svg+xml,%3Csvg%20xmlns='http:\/\/www.w3.org\/2000\/svg'%20viewBox='0%200%201066%20175'%3E%3C\/svg%3E\" data-sizes=\"(max-width: 1066px) 100vw, 1066px\" \/><img decoding=\"async\" src=\"data:image\/svg+xml,%3Csvg%20xmlns='http:\/\/www.w3.org\/2000\/svg'%20viewBox='0%200%201066%20175'%3E%3C\/svg%3E\" alt=\"Application program interface).\" width=\"1066\" height=\"175\" data-src=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Application-program-interface..jpg\" data-srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Application-program-interface..jpg 1066w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Application-program-interface.-300x49.jpg 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Application-program-interface.-768x126.jpg 768w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Application-program-interface.-1024x168.jpg 1024w\" data-sizes=\"(max-width: 1066px) 100vw, 1066px\" loading=\"lazy\" \/>\n<\/picture>\n<noscript><picture class=\"aligncenter size-full wp-image-566\"><source type=\"image\/webp\" srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Application-program-interface..jpg.webp 1066w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Application-program-interface.-300x49.jpg.webp 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Application-program-interface.-768x126.jpg.webp 768w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Application-program-interface.-1024x168.jpg.webp 1024w\" sizes=\"(max-width: 1066px) 100vw, 1066px\"\/><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Application-program-interface..jpg\" alt=\"Application program interface).\" width=\"1066\" height=\"175\" srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Application-program-interface..jpg 1066w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Application-program-interface.-300x49.jpg 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Application-program-interface.-768x126.jpg 768w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Application-program-interface.-1024x168.jpg 1024w\" sizes=\"(max-width: 1066px) 100vw, 1066px\"\/>\n<\/picture>\n<\/noscript><\/p>\n<p>The query frequency will be determined in seconds, <strong>or minutes if you\u2019re wanting to be very cautious<\/strong>. I recommend, at the very least, setting an individual proxy IP to be used every 2-5 seconds, potentially more if you\u2019re using combatant operators or keywords (which I\u2019ll get into below).<\/p>\n<p>Let\u2019s say you set your search frequency to 5 seconds. This will make sure that a single IP address, a proxy you are using or rent, won\u2019t make a specific query more often than every 5 seconds. This ties directly into that whole human concept. A real human wouldn\u2019t likely query Google every second for 10 straight minutes. That would mean 600 individual Google searches you conducted, just for the fun of it.<\/p>\n<p>No, <strong>that\u2019s what robots do<\/strong>, and precisely what scraping looks like. Set your individual proxy IP query limit to every 2-5 seconds to be safe, or more second to be safer, and you\u2019ll avoid bans and captchas for that specific IP.<\/p>\n<ul>\n<li><a href=\"https:\/\/royadata.io\/blog\/best-captcha-breaking-service-with-proxies\/\">What is the Best Service for Captcha Breaking with Proxies?<\/a><\/li>\n<\/ul>\n<hr\/>\n<h3 id=\"2-set-a-proxy-rate-limit\" class=\"ftwp-heading\"><span class=\"ez-toc-section\" id=\"2_Set_a_Proxy_Rate_Limit\"><\/span>2. Set a Proxy Rate Limit<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>This is pretty much the same concept as the previous example, with a slight twist. Whereas previously I advised limiting a single proxy IP from querying too often, you also want to limit how often all of your proxies start querying for a topic.<\/p>\n<p>The range of time can really vary for this, and you can stagger your proxies to have di\ufb00erent rate limits to even further reduce Google\u2019s suspicions. For example, you want to beware of making 5,000 queries about \u201cSocial Media Likes\u201d all at once. Even though these requests were sent by di\ufb00erent, seemingly unconnected IP addresses, the fact that they have come <strong>together triggers Google\u2019s ban and captcha procedures<\/strong>.<\/p>\n<p>At the very least set your proxy rate limit to 1-2 seconds. To be foolproof about this, bump that up to 5 seconds for a third of your proxies, 2 seconds for a third, and 8 seconds for the final third. Doing this in combination with individual IP query limits will significantly reduce the risk of Google banning your proxies.<\/p>\n<ul>\n<li><a href=\"https:\/\/royadata.io\/blog\/proxy-compatibility\/\">The Ultimate Guide to Proxy Compatibility<\/a><\/li>\n<\/ul>\n<hr\/>\n<h3 id=\"3-set-your-ips-location-in-google\" class=\"ftwp-heading\"><span class=\"ez-toc-section\" id=\"3_Set_Your_IPs_Location_in_Google\"><\/span>3. Set Your IP\u2019s Location in Google<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>Google has a bad habit of incorrectly deciding where your IP is located. It\u2019s a bit of a joke because IPs are often located in specific countries, like the <a href=\"https:\/\/royadata.io\/blog\/us-proxy\/\">U.S<\/a>. or <a href=\"https:\/\/royadata.io\/blog\/uk-proxy\/\">U.K<\/a>., in order to access content and cloak the user in more of a mainstream appearance. When Google incorrectly determines <a href=\"https:\/\/royadata.io\/blog\/ip-address\/\">your IP address<\/a> it may feel like the whole purpose of your IP has deflated. Do not fear, there is a way!<\/p>\n<p>The best way to remedy a Google geo-location redirection is to simply visit <strong><a href=\"https:\/\/google.com\/ncr\"  rel=\"noopener noreferrer\">http:\/\/google.com\/ncr<\/a><\/strong> , instead of your typical <strong><a href=\"https:\/\/google.com\"  rel=\"noopener noreferrer\">http:\/\/google.com<\/a> <\/strong>. <strong>The \u201cncr\u201d automatically sends you to the U.S. Google<\/strong> (which is the one most people are trying to access), regardless of where your IP is located.<\/p>\n<p><picture class=\"aligncenter size-full wp-image-568 perfmatters-lazy\" loading=\"lazy\"><source type=\"image\/webp\" data-srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Google-geo-location-redirection-is-simply-visit.jpg.webp 1042w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Google-geo-location-redirection-is-simply-visit-300x157.jpg.webp 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Google-geo-location-redirection-is-simply-visit-768x403.jpg.webp 768w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Google-geo-location-redirection-is-simply-visit-1024x538.jpg.webp 1024w\" srcset=\"data:image\/svg+xml,%3Csvg%20xmlns='http:\/\/www.w3.org\/2000\/svg'%20viewBox='0%200%201042%20547'%3E%3C\/svg%3E\" data-sizes=\"(max-width: 1042px) 100vw, 1042px\" \/><img decoding=\"async\" src=\"data:image\/svg+xml,%3Csvg%20xmlns='http:\/\/www.w3.org\/2000\/svg'%20viewBox='0%200%201042%20547'%3E%3C\/svg%3E\" alt=\"Google geo-location redirection is simply visit\" width=\"1042\" height=\"547\" data-src=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Google-geo-location-redirection-is-simply-visit.jpg\" data-srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Google-geo-location-redirection-is-simply-visit.jpg 1042w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Google-geo-location-redirection-is-simply-visit-300x157.jpg 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Google-geo-location-redirection-is-simply-visit-768x403.jpg 768w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Google-geo-location-redirection-is-simply-visit-1024x538.jpg 1024w\" data-sizes=\"(max-width: 1042px) 100vw, 1042px\" loading=\"lazy\" \/>\n<\/picture>\n<noscript><picture class=\"aligncenter size-full wp-image-568\"><source type=\"image\/webp\" srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Google-geo-location-redirection-is-simply-visit.jpg.webp 1042w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Google-geo-location-redirection-is-simply-visit-300x157.jpg.webp 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Google-geo-location-redirection-is-simply-visit-768x403.jpg.webp 768w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Google-geo-location-redirection-is-simply-visit-1024x538.jpg.webp 1024w\" sizes=\"(max-width: 1042px) 100vw, 1042px\"\/><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Google-geo-location-redirection-is-simply-visit.jpg\" alt=\"Google geo-location redirection is simply visit\" width=\"1042\" height=\"547\" srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Google-geo-location-redirection-is-simply-visit.jpg 1042w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Google-geo-location-redirection-is-simply-visit-300x157.jpg 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Google-geo-location-redirection-is-simply-visit-768x403.jpg 768w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Google-geo-location-redirection-is-simply-visit-1024x538.jpg 1024w\" sizes=\"(max-width: 1042px) 100vw, 1042px\"\/>\n<\/picture>\n<\/noscript><\/p>\n<p>When it comes to <a href=\"https:\/\/royadata.io\/blog\/how-to-prevent-proxy-banned\/\">avoiding bans<\/a> and captchas, the purpose of this step is to center your requests from a single country. As I previously mentioned, your proxy software is going to send out many requests to Google in order to search for information. If those similar searches come from fifteen di\ufb00erent countries, all of which Google has incorrectly dropped you at, it will send up red flags.<\/p>\n<p>Keep in mind that this really pertains to why you\u2019re scraping data. If you want to scrape data on Japanese green tea harvesting blogs written by Japanese people in Japan, you\u2019ll actually want Google to think you\u2019re in Japan, like a normal Japanese human. Try to get a proxy provider with multiple world-wide locations for this, and if you\u2019re unsure of an IP\u2019s location, ask your provider or <a href=\"https:\/\/www.iplocation.net\/\"><strong>check for yourself<\/strong><\/a>.<\/p>\n<p>If you want to search the most common market (a.k.a. the U.S.) use the \u201cncr\u201d trick to make sure your crawlers start searching that version of Google.<\/p>\n<ul>\n<li><a href=\"https:\/\/royadata.io\/blog\/proxies-to-stream-netflix-youtube\/\">Using Proxies to Stream Netflix and YouTube From Any Country<\/a><\/li>\n<\/ul>\n<hr\/>\n<h3 id=\"4-set-your-referrer-url\" class=\"ftwp-heading\"><span class=\"ez-toc-section\" id=\"4_Set_Your_Referrer_URL\"><\/span>4. Set Your Referrer URL<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>This goes hand in hand with the last step because it\u2019s about making sure you\u2019re starting o\ufb00 your query on the right foot. In order to scrape Google, you\u2019ll need to access a specific part of Google. The most common searches with the biggest nets are often done on <a href=\"https:\/\/www.google.com\/search\"><strong>https:\/\/www.google.com\/search<\/strong><\/a> , better known as the general search page of Google. This is where most of us type in whatever we want to know about. For those of us using Chrome, we simply type a phrase into the URL bar, and Google Search is employed to give us listings.<\/p>\n<p><strong>This is all how a human searches<\/strong>. Remember that the best way to avoid bans and captchas with Google is to act like a human. Most humans go to <strong>google.com<\/strong> to start their search, while Chrome users are automatically using <strong>google.com<\/strong> to search.<\/p>\n<p><picture class=\"aligncenter size-full wp-image-570 perfmatters-lazy\" loading=\"lazy\"><source type=\"image\/webp\" data-srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/This-is-all-how-a-human-searches..jpg.webp 1075w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/This-is-all-how-a-human-searches.-300x123.jpg.webp 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/This-is-all-how-a-human-searches.-768x315.jpg.webp 768w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/This-is-all-how-a-human-searches.-1024x420.jpg.webp 1024w\" srcset=\"data:image\/svg+xml,%3Csvg%20xmlns='http:\/\/www.w3.org\/2000\/svg'%20viewBox='0%200%201075%20441'%3E%3C\/svg%3E\" data-sizes=\"(max-width: 1075px) 100vw, 1075px\" \/><img decoding=\"async\" src=\"data:image\/svg+xml,%3Csvg%20xmlns='http:\/\/www.w3.org\/2000\/svg'%20viewBox='0%200%201075%20441'%3E%3C\/svg%3E\" alt=\"This is all how a human searches.\" width=\"1075\" height=\"441\" data-src=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/This-is-all-how-a-human-searches..jpg\" data-srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/This-is-all-how-a-human-searches..jpg 1075w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/This-is-all-how-a-human-searches.-300x123.jpg 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/This-is-all-how-a-human-searches.-768x315.jpg 768w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/This-is-all-how-a-human-searches.-1024x420.jpg 1024w\" data-sizes=\"(max-width: 1075px) 100vw, 1075px\" loading=\"lazy\" \/>\n<\/picture>\n<noscript><picture class=\"aligncenter size-full wp-image-570\"><source type=\"image\/webp\" srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/This-is-all-how-a-human-searches..jpg.webp 1075w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/This-is-all-how-a-human-searches.-300x123.jpg.webp 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/This-is-all-how-a-human-searches.-768x315.jpg.webp 768w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/This-is-all-how-a-human-searches.-1024x420.jpg.webp 1024w\" sizes=\"(max-width: 1075px) 100vw, 1075px\"\/><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/This-is-all-how-a-human-searches..jpg\" alt=\"This is all how a human searches.\" width=\"1075\" height=\"441\" srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/This-is-all-how-a-human-searches..jpg 1075w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/This-is-all-how-a-human-searches.-300x123.jpg 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/This-is-all-how-a-human-searches.-768x315.jpg 768w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/This-is-all-how-a-human-searches.-1024x420.jpg 1024w\" sizes=\"(max-width: 1075px) 100vw, 1075px\"\/>\n<\/picture>\n<\/noscript><\/p>\n<p>The issue with bot searches is that, if left alone, the bots will use your search operator or keyword to collect data without visiting google.com at all. They will simply reap from the wheat fields of Google Search as if nobody ever needed google.com for anything. Put in a di\ufb00erent way, a robot bypasses actually visiting google.com because it isn\u2019t necessary. A human wouldn\u2019t do that.<\/p>\n<p>The solution is to set your referrer to google.com specifically. Most software programs built for scraping have a specific part of the API that makes this possible. If yours doesn\u2019t, consider using one that does. If you\u2019re writing your own script, make sure to include this as a function.<\/p>\n<hr\/>\n<h3 id=\"5-create-unique-user-agents-for-your-proxies\" class=\"ftwp-heading\"><span class=\"ez-toc-section\" id=\"5_Create_Unique_User_Agents_for_your_Proxies\"><\/span>5. Create Unique User Agents for your Proxies<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p><a href=\"https:\/\/royadata.io\/blog\/user-agent\/\">User agents<\/a> are the technical term for identifying computer settings based on browser information. Not to be mistaken with actual identity, like your passwords or credit card information, user agents are common and there\u2019s typically no real need to hide them.<\/p>\n<p>However, when scraping Google for data, it\u2019s paramount to <strong>diversify your user agent information<\/strong>. This principle comes back to the same old reason you need to change most of these settings \u2014 Google wants to believe a human is searching.<\/p>\n<p>Even if your IP addresses are di\ufb00erent, your countries are lined up, you have unique keywords and operators, and the query times are set, if Google receives 10,000 requests in ten seconds, all of which has 1024 x 768 screen resolution, use the current version of Firefox, and run on Windows 7, it starts to get very suspicious.<\/p>\n<p><picture class=\"aligncenter size-full wp-image-567 perfmatters-lazy\" loading=\"lazy\"><source type=\"image\/webp\" data-srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Get-user-agent-Information.jpg.webp 1067w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Get-user-agent-Information-300x27.jpg.webp 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Get-user-agent-Information-768x70.jpg.webp 768w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Get-user-agent-Information-1024x93.jpg.webp 1024w\" srcset=\"data:image\/svg+xml,%3Csvg%20xmlns='http:\/\/www.w3.org\/2000\/svg'%20viewBox='0%200%201067%2097'%3E%3C\/svg%3E\" data-sizes=\"(max-width: 1067px) 100vw, 1067px\" \/><img decoding=\"async\" src=\"data:image\/svg+xml,%3Csvg%20xmlns='http:\/\/www.w3.org\/2000\/svg'%20viewBox='0%200%201067%2097'%3E%3C\/svg%3E\" alt=\"Get user agent Information\" width=\"1067\" height=\"97\" data-src=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Get-user-agent-Information.jpg\" data-srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Get-user-agent-Information.jpg 1067w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Get-user-agent-Information-300x27.jpg 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Get-user-agent-Information-768x70.jpg 768w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Get-user-agent-Information-1024x93.jpg 1024w\" data-sizes=\"(max-width: 1067px) 100vw, 1067px\" loading=\"lazy\" \/>\n<\/picture>\n<noscript><picture class=\"aligncenter size-full wp-image-567\"><source type=\"image\/webp\" srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Get-user-agent-Information.jpg.webp 1067w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Get-user-agent-Information-300x27.jpg.webp 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Get-user-agent-Information-768x70.jpg.webp 768w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Get-user-agent-Information-1024x93.jpg.webp 1024w\" sizes=\"(max-width: 1067px) 100vw, 1067px\"\/><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Get-user-agent-Information.jpg\" alt=\"Get user agent Information\" width=\"1067\" height=\"97\" srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Get-user-agent-Information.jpg 1067w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Get-user-agent-Information-300x27.jpg 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Get-user-agent-Information-768x70.jpg 768w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Get-user-agent-Information-1024x93.jpg 1024w\" sizes=\"(max-width: 1067px) 100vw, 1067px\"\/>\n<\/picture>\n<\/noscript><\/p>\n<p>Changing user agent information in your browser is simple, especially if you\u2019re using <a href=\"https:\/\/chrome.google.com\/webstore\/detail\/user-agent-switcher-for-c\/djflhoibgkdhkhhcedjiklpkjnoahfmg?hl=en-US\"  rel=\"noopener noreferrer\"><strong>Google Chrome<\/strong><\/a> or <a href=\"https:\/\/addons.mozilla.org\/en-GB\/firefox\/addon\/uaswitcher\/\"  rel=\"noopener noreferrer\"><strong>Firefox<\/strong><\/a>. You can do this by installing extensions that allow you to swap bits of user agent information for individual proxies, which will fool Google.<\/p>\n<p>This can get complex and time-consuming if you\u2019re running hundreds (or thousands) of proxies, all of which need to have slight tweaks. Sometimes your proxy provider will include tools to do this in their <a href=\"https:\/\/royadata.io\/blog\/myprivateproxy\/\"><strong>API \u2014 Myprivateproxy does<\/strong><\/a>, just ask. Make sure to contact the customer support department of your proxy provider if you\u2019re worried about this step.<\/p>\n<ul>\n<li><a href=\"https:\/\/royadata.io\/blog\/use-chrome-headless-and-dedicated-proxies-to-scrape-any-website\/\">Use Chrome Headless and Dedicated Proxies to Scrape Any Website<\/a><\/li>\n<\/ul>\n<hr\/>\n<h3 id=\"6-avoid-google-search-operators-that-raise-red-flags\" class=\"ftwp-heading\"><span class=\"ez-toc-section\" id=\"6_Avoid_Google_Search_Operators_That_Raise_Red_Flags\"><\/span>6. Avoid Google Search Operators That Raise Red Flags<span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>This is a major one, and an o\ufb00ender that most people use when scraping data on Google. Search operators are terms used to conduct hyper-specific queries on Google. When utilized e\ufb00ectively, they can result in a tremendous amount of highly relevant data for you to sort through.<\/p>\n<p>The most common search operators you\u2019ll see are:<\/p>\n<ul>\n<li><strong>inurl<\/strong><\/li>\n<li><strong>intitle<\/strong><\/li>\n<li><strong>intext<\/strong><\/li>\n<\/ul>\n<p>Maybe you\u2019ve used or seen each of those terms like this: \u201callinurl\u201d and so forth. It\u2019s basically directions for Google to sort types of content, which produces a more specific list of results for you and your bots. Search operators have a lot of rules and are used in a myriad of ways, but when comes to getting banned, they are very important.<\/p>\n<p>Due to their popularity in bot searches, <strong>Google simply does not like them<\/strong>. Normal humans, the creatures you\u2019re trying to emulate, do not go to <strong>google.com<\/strong> and type in \u201cinurl: grasshoppers\u201d to find websites about grasshoppers. They just type \u201cgrasshoppers.\u201d<\/p>\n<p>This is compounded (literally) when you and your bots run queries with multiple search operators. If we continue the above example, running this search \u2014 \u201cintext: grasshopper evolution inurl:grasshoppers\u201d \u2014 will get even more specific information, like websites with grasshoppers in the URL and text that refers to grasshopper evolution.<\/p>\n<p><picture class=\"aligncenter size-full wp-image-569 perfmatters-lazy\" loading=\"lazy\"><source type=\"image\/webp\" data-srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Grasshoppers-in-the-URL.jpg.webp 1063w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Grasshoppers-in-the-URL-300x78.jpg.webp 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Grasshoppers-in-the-URL-768x200.jpg.webp 768w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Grasshoppers-in-the-URL-1024x267.jpg.webp 1024w\" srcset=\"data:image\/svg+xml,%3Csvg%20xmlns='http:\/\/www.w3.org\/2000\/svg'%20viewBox='0%200%201063%20277'%3E%3C\/svg%3E\" data-sizes=\"(max-width: 1063px) 100vw, 1063px\" \/><img decoding=\"async\" src=\"data:image\/svg+xml,%3Csvg%20xmlns='http:\/\/www.w3.org\/2000\/svg'%20viewBox='0%200%201063%20277'%3E%3C\/svg%3E\" alt=\"Grasshoppers in the URL\" width=\"1063\" height=\"277\" data-src=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Grasshoppers-in-the-URL.jpg\" data-srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Grasshoppers-in-the-URL.jpg 1063w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Grasshoppers-in-the-URL-300x78.jpg 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Grasshoppers-in-the-URL-768x200.jpg 768w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Grasshoppers-in-the-URL-1024x267.jpg 1024w\" data-sizes=\"(max-width: 1063px) 100vw, 1063px\" loading=\"lazy\" \/>\n<\/picture>\n<noscript><picture class=\"aligncenter size-full wp-image-569\"><source type=\"image\/webp\" srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Grasshoppers-in-the-URL.jpg.webp 1063w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Grasshoppers-in-the-URL-300x78.jpg.webp 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Grasshoppers-in-the-URL-768x200.jpg.webp 768w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Grasshoppers-in-the-URL-1024x267.jpg.webp 1024w\" sizes=\"(max-width: 1063px) 100vw, 1063px\"\/><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Grasshoppers-in-the-URL.jpg\" alt=\"Grasshoppers in the URL\" width=\"1063\" height=\"277\" srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Grasshoppers-in-the-URL.jpg 1063w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Grasshoppers-in-the-URL-300x78.jpg 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Grasshoppers-in-the-URL-768x200.jpg 768w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Grasshoppers-in-the-URL-1024x267.jpg 1024w\" sizes=\"(max-width: 1063px) 100vw, 1063px\"\/>\n<\/picture>\n<\/noscript><\/p>\n<p>To Google, it becomes painfully obvious that you are not a human trying to write a biology paper on grasshoppers. You are, perhaps, a bot search run by a human trying to start their next niche website.<\/p>\n<p>The number and types of search operators are massive, so, first of all, try to stay away from the common ones. Instead, string together multiple keywords for a more unique search, and refer to <a href=\"https:\/\/bynd.com\/news-ideas\/google-advanced-search-comprehensive-list-google-search-operators\"  rel=\"noopener noreferrer\"><strong>this list<\/strong><\/a> when looking for new ways to query. Also, try to steer of clear of really common keywords in search operators because those have even more red flags around them.<\/p>\n<ul>\n<li><a href=\"https:\/\/royadata.io\/blog\/how-to-find-proxies-to-use-with-seo-software\/\">\u00a0Why use Proxies for SEO Tools<\/a><\/li>\n<\/ul>\n<hr\/>\n<h2 id=\"bans-and-captchas-begone\" class=\"ftwp-heading\"><span class=\"ez-toc-section\" id=\"Bans_and_Captchas_Begone\"><\/span>Bans and Captchas Begone<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Captchas will slow you down, which can be a detriment to the clients you have or the success of the project you\u2019re working on. <strong>IP bans<\/strong> are a whole other headache, and will require you to reach out to your proxy provider. The six tips above will make it so you see fewer bans and captchas, which will increase your e\ufb00ectiveness and reliability in leaps and bounds.<\/p>\n<p>With that said, the most important message I can impart when it comes to Google bans is to do research, and limit the robot-ness of your searches. <strong>Sometimes it\u2019s better to walk slow and look like a human than push the envelope and scrape like a robot.<\/strong><\/p>\n<blockquote>\n<p>What is more sustainable in the long run?<\/p>\n<\/blockquote>\n<hr\/>\n<p>Related,<\/p>\n<ul>\n<li><a href=\"https:\/\/royadata.io\/blog\/how-to-scrape-linkedin-using-proxies\/\">Guide to Scraping Data from Linkedin Using Proxies<\/a><\/li>\n<li><a href=\"https:\/\/royadata.io\/blog\/scraping-craigslist\/\">How to Scrape Craigslist Data with Scraping Software<\/a><\/li>\n<li><a href=\"https:\/\/royadata.io\/blog\/residential-proxies\/\">Picking the Best Residential Proxies for web scraping<\/a><\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>The last thing you need when collecting large volumes of data from Google is an IP ban. The second-to-last thing you need is a captcha checking in on your human-ness. Each of these protection measures is employed by the search giant to weed out bots, which is exactly what you\u2019re running if you\u2019ve come to &#8230; <a title=\"Proxies for Preventing Bans and Captchas When Scraping Google\" class=\"read-more\" href=\"http:\/\/royadata.io\/blog\/proxies-for-scraping-google\/\" aria-label=\"More on Proxies for Preventing Bans and Captchas When Scraping Google\">Read more<\/a><\/p>\n","protected":false},"author":1,"featured_media":339,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"_links":{"self":[{"href":"http:\/\/royadata.io\/blog\/wp-json\/wp\/v2\/posts\/6152"}],"collection":[{"href":"http:\/\/royadata.io\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/royadata.io\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/royadata.io\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/royadata.io\/blog\/wp-json\/wp\/v2\/comments?post=6152"}],"version-history":[{"count":0,"href":"http:\/\/royadata.io\/blog\/wp-json\/wp\/v2\/posts\/6152\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"http:\/\/royadata.io\/blog\/wp-json\/wp\/v2\/media\/339"}],"wp:attachment":[{"href":"http:\/\/royadata.io\/blog\/wp-json\/wp\/v2\/media?parent=6152"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/royadata.io\/blog\/wp-json\/wp\/v2\/categories?post=6152"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/royadata.io\/blog\/wp-json\/wp\/v2\/tags?post=6152"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}