The fast, most optimal, and correct HTML & XML parsing library
DOCUMENTATION: https://awolverp.github.io/markupever
SOURCE CODE: https://github.com/awolverp/markupever
Warning
This documentation is incomplete. Documenting everything take a while.
MarkupEver is a modern, fast (high-performance), XML & HTML languages parsing library written in Rust.
-
- Fast
Very high performance and fast (thanks to html5ever and selectors).
-
- Easy To Use
Designed to be easy to use and learn. Completion everywhere.
-
- Low Memory Usage
It boasts efficient memory usage, thanks to Rust's memory allocator, ensuring no memory leaks.
-
- Your CSS Knowledge
Leverage your CSS knowledge to select elements from HTML or XML documents effortlessly.
Installation
You can install MarkupEver using pip:
Use Virtual Environments
It's recommended to use virtual environments for installing and managing libraries in Python.
Examples
Parsing & Scraping
Parsing a HTML content and selecting elements:
Imagine this index.html
file:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Example Document</title>
</head>
<body>
<h1 id="title">Welcome to My Page</h1>
<p>This page has a link and an image.</p>
<a href="https://www.example.com">Visit Example.com</a>
<br>
<img src="https://www.example.com/image.jpg" alt="My Image">
<a href="https://www.google.com">Visit Google</a>
<a>No Link</a>
</body>
</html>
We want to extract the href
attributes from it, and we have three ways to achieve this:
You can parse HTML/XML content with parse()
function.
import markupever
with open("index.html", "rb") as fd: # (2)!
dom = markupever.parse(fd.read(), markupever.HtmlOptions()) # (1)!
for element in dom.select("a[href]"):
print(element.attrs["href"])
-
Use
HtmlOptions()
for HTML documents andXmlOptions()
for XML documents. If used incorrectly, don't worry—it won't disrupt the process. These options specify namespaces and other differences between XML and HTML, while also providing distinct features for each type. -
It's recommended to open files with
"rb"
mode, but not required; you can use"r"
mode also.
You can parse HTML/XML content from files with .parse_file()
function.
import markupever
dom = markupever.parse_file("index.html", markupever.HtmlOptions()) # (1)!
for element in dom.select("a[href]"):
print(element.attrs["href"])
- Use
HtmlOptions()
for HTML documents andXmlOptions()
for XML documents. If used incorrectly, don't worry—it won't disrupt the process. These options specify namespaces and other differences between XML and HTML, while also providing distinct features for each type.
The .parse() and .parse_file() functions are shorthand for using the .Parser class. However, you can also use the class directly. It's designed to allow you to stream input using the .process() method, so you don't have to worry about the memory usage of large inputs.
import markupever
parser = markupever.Parser(markupever.HtmlOptions()) # (1)!
with open("index.html", "rb") as fd: # (2)!
for line in fd: # Read line by line (3)
parser.process(line)
parser.finish()
dom = parser.into_dom()
for element in dom.select("a[href]"):
print(element.attrs["href"])
-
Use
HtmlOptions()
for HTML documents andXmlOptions()
for XML documents. If used incorrectly, don't worry—it won't disrupt the process. These options specify namespaces and other differences between XML and HTML, while also providing distinct features for each type. -
It's recommended to open files with
"rb"
mode, but not required; you can use"r"
mode also. -
You can read the file all at once and pass it to the
process
function. We have broken the file into lines here to show you theParser
's abilities.
Then run main.py
to see result:
Creating Documents
Also there's a structure called TreeDom
(1). You can directly work with it and generate documents (such as HTML and XML) very easy.
- A tree structure which specialy designed for HTML and XML documents. Uses Rust's
Vec
type in backend. The memory consumed by theTreeDom
is dynamic and depends on the number of tokens stored in the tree. The allocated memory is never reduced and is only released when it is dropped.
from markupever import dom
dom = dom.TreeDom()
root: dom.Document = dom.root()
root.create_doctype("html")
html = root.create_element("html", {"lang": "en"})
body = html.create_element("body")
body.create_text("Hello Everyone ...")
print(root.serialize())
# <!DOCTYPE html><html lang="en"><body>Hello Everyone ...</body></html>
Performance
This library is designed with a strong focus on performance and speed. It's written in Rust and avoids the use of unsafe code blocks.
I have compared MarkupEver with BeautifulSoup and Parsel (which directly uses lxml
):
Benchmarks
System
The system on which the benchmarks are done: Manjaro Linux x86_64, 8G, Intel i3-1115G4
Parsing | Min | Max | Avg |
---|---|---|---|
markupever | 4907µs | 4966µs | 4927µs |
markupever (exact_errors) | 8920µs | 9172µs | 8971µs |
beautifulsoup4 (html.parser) | 35283µs | 36460µs | 35828µs |
beautifulsoup4 (lxml) | 22576µs | 23092µs | 22809µs |
parsel | 3937µs | 4147µs | 4072µs |
Selecting (CSS) | Min | Max | Avg |
---|---|---|---|
markupever | 308µs | 314µs | 310µs |
beautifulsoup4 | 2936µs | 3074µs | 2995µs |
parsel | 159µs | 165µs | 161µs |
Serializing | Min | Max | Avg |
---|---|---|---|
markupever | 1932µs | 1973µs | 1952µs |
beautifulsoup4 | 14705µs | 15021µs | 14900µs |
parsel | 1264µs | 1290µs | 1276µs |
Summary
The Parsel is the fastest library (Actually lxml
is) and is specially designed for scraping,
but it offers less control over the document.
The BeautifulSoup is the slowest (and oldest) library which provides full control over the document.
The MarkupEver sites between these two. It is extremely fast, close to Parsel, and offers full control over the document.
Memory Usage
As you know, this library is written in Rust and uses the Rust allocator. Like other libraries written in C and other low-level languages, it uses very low memory, so you don't have to worry about memory usage. Manage huge documents with ease...
License
This project is licensed under the terms of the MPL-2.0 license.