{"id":5898,"date":"2023-10-18T14:47:43","date_gmt":"2023-10-18T14:47:43","guid":{"rendered":"https:\/\/royadata.io\/blog\/?p=5898"},"modified":"2023-10-18T14:47:43","modified_gmt":"2023-10-18T14:47:43","slug":"python-html-parsing","status":"publish","type":"post","link":"http:\/\/royadata.io\/blog\/python-html-parsing\/","title":{"rendered":"Python HTML Parsing: The Best Python Libraries for HTML Parsing"},"content":{"rendered":"<blockquote>\n<p>Are you looking for the best HTML parsing method and tool to use in your Python web scraping projects? Then the article below has been written for you as I compared 3 of the popular HTML parsing libraries.<\/p>\n<\/blockquote>\n<p><picture class=\"aligncenter size-full wp-image-23285 perfmatters-lazy\" loading=\"lazy\"><source type=\"image\/webp\" data-srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Python-HTML-Parsing.jpg.webp 1000w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Python-HTML-Parsing-300x167.jpg.webp 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Python-HTML-Parsing-768x426.jpg.webp 768w\" srcset=\"data:image\/svg+xml,%3Csvg%20xmlns='http:\/\/www.w3.org\/2000\/svg'%20viewBox='0%200%201000%20555'%3E%3C\/svg%3E\" data-sizes=\"(max-width: 1000px) 100vw, 1000px\"\/><img decoding=\"async\" src=\"data:image\/svg+xml,%3Csvg%20xmlns='http:\/\/www.w3.org\/2000\/svg'%20viewBox='0%200%201000%20555'%3E%3C\/svg%3E\" alt=\"Python HTML Parsing\" width=\"1000\" height=\"555\" data-src=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Python-HTML-Parsing.jpg\" data-srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Python-HTML-Parsing.jpg 1000w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Python-HTML-Parsing-300x167.jpg 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Python-HTML-Parsing-768x426.jpg 768w\" data-sizes=\"(max-width: 1000px) 100vw, 1000px\" loading=\"lazy\"\/>\n<\/picture>\n<noscript><picture class=\"aligncenter size-full wp-image-23285\"><source type=\"image\/webp\" srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Python-HTML-Parsing.jpg.webp 1000w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Python-HTML-Parsing-300x167.jpg.webp 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Python-HTML-Parsing-768x426.jpg.webp 768w\" sizes=\"(max-width: 1000px) 100vw, 1000px\"\/><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Python-HTML-Parsing.jpg\" alt=\"Python HTML Parsing\" width=\"1000\" height=\"555\" srcset=\"https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Python-HTML-Parsing.jpg 1000w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Python-HTML-Parsing-300x167.jpg 300w, https:\/\/royadata.io\/blog\/wp-content\/uploads\/2023\/10\/Python-HTML-Parsing-768x426.jpg 768w\" sizes=\"(max-width: 1000px) 100vw, 1000px\"\/>\n<\/picture>\n<\/noscript><\/p>\n<p>Being able to evade detection in other to access a web resource on a remote server and then download it is just one aspect of web scraping. And for obvious reasons, this is taunted as the most difficult to do. The other part of the puzzle that can also be difficult, depending on the complexity of the page elements or even how messy they are, is parsing and extracting the required data from the page. Python is been taunted as the most simple and easy-to-use programming language for web scraping.<\/p>\n<p>However, the HTML parser that comes in its standard library is one of the most difficult options to use. And to be frank with you, I have hardly seen many people make use of it. For this reason, there are a good number of alternative parsers provided as third-party libraries you can use. In this article, I will recommend some of the best Python HTML parsing libraries for web scraping.<\/p>\n<hr\/>\n<h2 style=\"text-align: center;\"><span class=\"ez-toc-section\" id=\"BeautifulSoup_%E2%80%94_Beginner-Friendly_HTML_Parser\"><\/span><strong>BeautifulSoup \u2014 Beginner-Friendly HTML Parser<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h2>\n<div class=\"su-youtube su-u-responsive-media-yes\">\n<div class=\"perfmatters-lazy-youtube\" data-src=\"https:\/\/www.youtube.com\/embed\/0uiPOxUcLgg\" data-id=\"0uiPOxUcLgg\" data-query=\"\" onclick=\"if (!window.__cfRLUnblockHandlers) return false; perfmattersLazyLoadYouTube(this);\" data-cf-modified-7c0158f8562363fb51369d22-=\"\">\n<div><img loading=\"lazy\" decoding=\"async\" class=\"perfmatters-lazy\" src=\"data:image\/svg+xml,%3Csvg%20xmlns='http:\/\/www.w3.org\/2000\/svg'%20viewBox='0%200%20480%20360%3E%3C\/svg%3E\" data-src=\"https:\/\/i.ytimg.com\/vi\/0uiPOxUcLgg\/hqdefault.jpg\" alt=\"YouTube video\" width=\"480\" height=\"360\" data-pin-nopin=\"true\"\/><\/p>\n<div class=\"play\"\/><\/div>\n<\/div>\n<p><noscript><iframe loading=\"lazy\" width=\"600\" height=\"400\" src=\"https:\/\/www.youtube.com\/embed\/0uiPOxUcLgg?\" frameborder=\"0\" allowfullscreen=\"\" allow=\"autoplay; encrypted-media; picture-in-picture\" title=\"\"\/><\/noscript><\/div>\n<p>BeautifulSoup has become the de facto HTML parser for most beginners and even some advanced users. It is a widely used extraction library meant for extracting data from both HTML and XML files. One thing you need to know about BeautifulSoup is that it is not even a parsing as most would want to see it. It is basically an extraction tool, as you will need to define your preferred scraper, or it make use of the HTML.Parser parsing library. So basically, it wraps a parser to provide you with data extraction support.<\/p>\n<p>However, it is loved for two main reasons. First, you can use it to parse and extract data from messy web pages with incorrect markups with little to no problems. Secondly, it is incredibly easy to learn and use, making it the first parser you encounter and get familiar with while learning scraping in Python. Below is a code of how it is used.<\/p>\n<pre><em>import requests<\/em>\n\n\n\n<em>from bs4 import BeautifulSoup<\/em>\n\n\n\n<em>x = requests.get(\u201cYOUR_WEB_TARGET\u201d).content<\/em>\n\n\n\n<em>soup = BeautifulSoup(x)<\/em>\n\n\n\n<em>links = soup.find(\u201ca\u201d, {\u201cclass\u201d: \u201cinternal_links\u201d})<\/em>\n\n\n\n<em>for i in links:<\/em>\n\n\n\n<em>print(I[\u201chref\u201d])<\/em><\/pre>\n<p>The above code will visit your URL of choice and collect all of the URLs with the class name \u201cinternal_links\u201d.<\/p>\n<hr\/>\n<h3><span class=\"ez-toc-section\" id=\"Pros_and_Cons_of_BeautifulSoup\"><\/span><strong>Pros and Cons of BeautifulSoup<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h3>\n<div class=\"su-note\" style=\"border-color:#cce4e5;border-radius:3px;-moz-border-radius:3px;-webkit-border-radius:3px;\">\n<div class=\"su-note-inner su-u-clearfix su-u-trim\" style=\"background-color:#E6FEFF;border-color:#ffffff;color:#333333;border-radius:3px;-moz-border-radius:3px;-webkit-border-radius:3px;\">\n<div class=\"su-row\">\n<div class=\"su-column su-column-size-1-2\">\n<div class=\"su-column-inner su-u-clearfix su-u-trim\">\n<p><strong>Pros<\/strong><strong>:<\/strong><\/p>\n<div class=\"su-list\" style=\"margin-left:0px\">\n<ul>\n<li><i class=\"sui sui-check\" style=\"color:#41AC25\"\/> Beginner Friendly with good documentation and large community support<\/li>\n<li><i class=\"sui sui-check\" style=\"color:#41AC25\"\/> It can be used to parse data even from badly written and malformed HTML or XML documents<\/li>\n<li><i class=\"sui sui-check\" style=\"color:#41AC25\"\/> Provide multiple methods of data extraction, such as select, find, and find_all<\/li>\n<\/ul>\n<\/div>\n<\/div>\n<\/div>\n<div class=\"su-column su-column-size-1-2\">\n<div class=\"su-column-inner su-u-clearfix su-u-trim\">\n<p><strong>Cons<\/strong><strong>:<\/strong><\/p>\n<div class=\"su-list\" style=\"margin-left:0px\">\n<ul>\n<li><i class=\"sui sui-minus\" style=\"color:#f0401d\"\/> Not one of the fastest out there and become quite slower for large documents<\/li>\n<li><i class=\"sui sui-minus\" style=\"color:#f0401d\"\/> Does not support XPath selectors \u2014 only CSS selectors are supported.<\/li>\n<\/ul>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<hr\/>\n<h2 style=\"text-align: center;\"><span class=\"ez-toc-section\" id=\"Lxml_%E2%80%94_Fastest_Python_HTML_Parsing_Library\"><\/span><strong>Lxml \u2014 Fastest Python HTML Parsing Library<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h2>\n<div class=\"su-youtube su-u-responsive-media-yes\">\n<div class=\"perfmatters-lazy-youtube\" data-src=\"https:\/\/www.youtube.com\/embed\/73isvEt3wCU\" data-id=\"73isvEt3wCU\" data-query=\"\" onclick=\"if (!window.__cfRLUnblockHandlers) return false; perfmattersLazyLoadYouTube(this);\" data-cf-modified-7c0158f8562363fb51369d22-=\"\">\n<div><img loading=\"lazy\" decoding=\"async\" class=\"perfmatters-lazy\" src=\"data:image\/svg+xml,%3Csvg%20xmlns='http:\/\/www.w3.org\/2000\/svg'%20viewBox='0%200%20480%20360%3E%3C\/svg%3E\" data-src=\"https:\/\/i.ytimg.com\/vi\/73isvEt3wCU\/hqdefault.jpg\" alt=\"YouTube video\" width=\"480\" height=\"360\" data-pin-nopin=\"true\"\/><\/p>\n<div class=\"play\"\/><\/div>\n<\/div>\n<p><noscript><iframe loading=\"lazy\" width=\"600\" height=\"400\" src=\"https:\/\/www.youtube.com\/embed\/73isvEt3wCU?\" frameborder=\"0\" allowfullscreen=\"\" allow=\"autoplay; encrypted-media; picture-in-picture\" title=\"\"\/><\/noscript><\/div>\n<p>The lxml parsing library is another popular option for parsing HTML and also XML documents. Unlike BeautifulSoup, which is not a full-fledged parser but is built on top of a parser, lxml is a full-fledged parser, and you can even use it as the parser if you need faster parsing and extraction in BeautifulSoup. Interestingly, it is also a standalone parser that is known for its fast and efficient parsing engine, which makes it ideal for large and complex projects. What makes it a fast parser compared to the other parsers on the list is that it is actually built as a binding library onto two C libraries \u2014 libxml2 and libxsit, which are known to be highly optimized for speed and memory efficiency.<\/p>\n<p>While it is known to be highly fast and effective against large and complex documents, you need to know that it is not the easiest to learn and use on the list. In fact, it has the steepest learning curve on the list, and its usage is quite complex, which is why most won\u2019t use it for simpler tasks. Below is a code that shows how to make use of the lxml parsing library.<\/p>\n<pre><em>import requests<\/em>\n\n\n\n<em>from lxml import html<\/em>\n\n\n\n<em>url = \u201cYOUR_SITE_TARGET\u201d<\/em>\n\n\n\n<em>path = \u201c*\/\/*[@id=\u201cpricing\u201d]<\/em>\n\n\n\n<em>response = requests.get(url).content<\/em>\n\n\n\n<em>source_code = html.fromastring(response)<\/em>\n\n\n\n<em>tree = source_code.xpath(path)<\/em>\n\n\n\n<em>print(tree)<\/em><\/pre>\n<hr\/>\n<h3><span class=\"ez-toc-section\" id=\"Pros_and_Cons_of_lxml\"><\/span><strong>Pros and Cons of lxml<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h3>\n<h4><span class=\"ez-toc-section\" id=\"Pros\"><\/span><strong>Pros:<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h4>\n<ul>\n<li>Most efficient in times of speed and memory management<\/li>\n<li>Support both CSS selectors and XPath<\/li>\n<li>Good documentation<\/li>\n<\/ul>\n<h4><span class=\"ez-toc-section\" id=\"Cons\"><\/span><strong>Cons:<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h4>\n<ul>\n<li>Difficult to learn and not beginner-friendly<\/li>\n<\/ul>\n<hr\/>\n<h2 style=\"text-align: center;\"><span class=\"ez-toc-section\" id=\"Requests-HTML_%E2%80%94_Best_for_Parsing_Dynamic_Web_Pages\"><\/span><strong>Requests-HTML \u2014 Best for Parsing Dynamic Web Pages<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h2>\n<div class=\"su-youtube su-u-responsive-media-yes\">\n<div class=\"perfmatters-lazy-youtube\" data-src=\"https:\/\/www.youtube.com\/embed\/a6fIbtFB46g\" data-id=\"a6fIbtFB46g\" data-query=\"\" onclick=\"if (!window.__cfRLUnblockHandlers) return false; perfmattersLazyLoadYouTube(this);\" data-cf-modified-7c0158f8562363fb51369d22-=\"\">\n<div><img loading=\"lazy\" decoding=\"async\" class=\"perfmatters-lazy\" src=\"data:image\/svg+xml,%3Csvg%20xmlns='http:\/\/www.w3.org\/2000\/svg'%20viewBox='0%200%20480%20360%3E%3C\/svg%3E\" data-src=\"https:\/\/i.ytimg.com\/vi\/a6fIbtFB46g\/hqdefault.jpg\" alt=\"YouTube video\" width=\"480\" height=\"360\" data-pin-nopin=\"true\"\/><\/p>\n<div class=\"play\"\/><\/div>\n<\/div>\n<p><noscript><iframe loading=\"lazy\" width=\"600\" height=\"400\" src=\"https:\/\/www.youtube.com\/embed\/a6fIbtFB46g?\" frameborder=\"0\" allowfullscreen=\"\" allow=\"autoplay; encrypted-media; picture-in-picture\" title=\"\"\/><\/noscript><\/div>\n<p>Both BeautifulSoup and lxml though great for parsing, have one major problem they share together. And the problem is they lack support for Javascript execution. This means that if you wanted to scrape a page that dynamically loads its content using Javascript, then these tools won\u2019t be good for scraping them. However, the Requests-HTML tool is here for you in this regard. This is built on top of the requests python library but uses Chromium web browser so that it is able to render Javascript.<\/p>\n<p>It does not just use Chromium for rendering content, it also uses its API for parsing and extraction. This tool is actually easy to use and also does support both CSS selectors and Path, making them quite versatile. Below is a code that shows how to make use of Request=HTML for parsing data from web pages.<\/p>\n<pre><em>from requests_html import HTMLSession<\/em>\n\n\n\n<em>url = \u2018YOUR_TARGET_WEBSITE\u2019<\/em>\n\n\n\n<em>session = HTMLSession()<\/em>\n\n\n\n<em>response = session.get(url)<\/em>\n\n\n\n<em>a_links = response.html.find(\u2018.internal_link\u2019)<\/em>\n\n\n\n<em>for a in a_links:<\/em>\n\n\n\n<em>\u00a0\u00a0\u00a0 text = a_links.find(first=True).text<\/em>\n\n\n\n<em>\u00a0\u00a0\u00a0 print(text)<\/em><\/pre>\n<hr\/>\n<h3><span class=\"ez-toc-section\" id=\"Pros_and_Cons_of_Requests_%E2%80%94_HTML\"><\/span><strong>Pros and Cons of Requests \u2014 HTML<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h3>\n<h4><span class=\"ez-toc-section\" id=\"Pros-2\"><\/span><strong>Pros:<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h4>\n<ul>\n<li>Support Javascript rendering and parsing of dynamic content<\/li>\n<li>Simple to use and initiative APIs<\/li>\n<li>Good documentation support<\/li>\n<\/ul>\n<h4><span class=\"ez-toc-section\" id=\"Cons-2\"><\/span><strong>Cons:<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h4>\n<ul>\n<li>Quite slower than the other mentioned libraries<\/li>\n<li>Requests additional dependencies<\/li>\n<\/ul>\n<hr\/>\n<h2 style=\"text-align: center;\"><span class=\"ez-toc-section\" id=\"FAQs_About_Python_HTML_Parsing\"><\/span><strong>FAQs About Python HTML Parsing<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h2>\n<h3><span class=\"ez-toc-section\" id=\"Q_What_is_the_Best_HTML_Parsing_Library_for_Python\"><\/span><strong>Q. What is the Best HTML Parsing Library for Python?<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>There is no best HTML library for all use case as requirement varies, and so is the best for each specific use case. If what you need is an easy-to-use parser for just regular pages and projects, then BeautifulSoup is the best. You can even use lxml as its parser to make it even faster. For those with a complex project to handle and who need to be efficient in terms of space (memory) and time (speed), then lxml is the best for them. If you are dealing with a dynamic web page with Javascript rendering, then Requests-HTML is the best.<\/p>\n<h3><span class=\"ez-toc-section\" id=\"Q_Does_Parsing_of_HTML_Take_Time\"><\/span><strong>Q. Does Parsing of HTML Take Time?<\/strong><span class=\"ez-toc-section-end\"><\/span><\/h3>\n<p>Parsing and extracting data from HTML is one of the bottlenecks you should consider when calculating the time it will take for you to scrape data from the web. As stated earlier, lxml is the fastest, followed by BeautifulSoup and then requests-html. Even for the lxml, the time taken to parse the individual pages can quickly add up if you have to scrape from hundreds of thousands of pages. So yes, parsing does take time, especially for large projects, and that is why you will want to choose the fastest parser available to you.<\/p>\n<hr\/>\n<pre style=\"text-align: center;\"><strong>Conclusion<\/strong><\/pre>\n<p>The above 3 libraries are just a few of the parsing libraries available to you as a Python developer for your web scraping. There are a good number of other parsing libraries, each having its strengths and weaknesses. However, the ones mentioned above are the best and most popular for their specific use cases.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Are you looking for the best HTML parsing method and tool to use in your Python web scraping projects? Then the article below has been written for you as I compared 3 of the popular HTML parsing libraries. Being able to evade detection in other to access a web resource on a remote server and &#8230; <a title=\"Python HTML Parsing: The Best Python Libraries for HTML Parsing\" class=\"read-more\" href=\"http:\/\/royadata.io\/blog\/python-html-parsing\/\" aria-label=\"More on Python HTML Parsing: The Best Python Libraries for HTML Parsing\">Read more<\/a><\/p>\n","protected":false},"author":1,"featured_media":85,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"_links":{"self":[{"href":"http:\/\/royadata.io\/blog\/wp-json\/wp\/v2\/posts\/5898"}],"collection":[{"href":"http:\/\/royadata.io\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/royadata.io\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/royadata.io\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/royadata.io\/blog\/wp-json\/wp\/v2\/comments?post=5898"}],"version-history":[{"count":0,"href":"http:\/\/royadata.io\/blog\/wp-json\/wp\/v2\/posts\/5898\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"http:\/\/royadata.io\/blog\/wp-json\/wp\/v2\/media\/85"}],"wp:attachment":[{"href":"http:\/\/royadata.io\/blog\/wp-json\/wp\/v2\/media?parent=5898"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/royadata.io\/blog\/wp-json\/wp\/v2\/categories?post=5898"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/royadata.io\/blog\/wp-json\/wp\/v2\/tags?post=5898"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}