Skip to content

MarkupEver

The fast, most optimal, and correct HTML & XML parsing library


DOCUMENTATION: https://awolverp.github.io/markupever

SOURCE CODE: https://github.com/awolverp/markupever


Warning

This documentation is incomplete. Documenting everything take a while.

MarkupEver is a modern, fast (high-performance), XML & HTML languages parsing library written in Rust.

  • - Fast


    Very high performance and fast (thanks to html5ever and selectors).

    Benchmarks

  • - Easy To Use


    Designed to be easy to use and learn. Completion everywhere.

    Examples

  • - Low Memory Usage


    It boasts efficient memory usage, thanks to Rust's memory allocator, ensuring no memory leaks.

    Memory Usage

  • - Your CSS Knowledge


    Leverage your CSS knowledge to select elements from HTML or XML documents effortlessly.

    Querying

Installation

You can install MarkupEver using pip:

$ pip3 install markupever

Use Virtual Environments

It's recommended to use virtual environments for installing and managing libraries in Python.

$ python3 -m venv venv
$ source venv/bin/activate
$ virtualenv venv
$ source venv/bin/activate
$ python3 -m venv venv
$ venv\Scripts\activate
$ virtualenv venv
$ venv\Scripts\activate

Examples

Parsing & Scraping

Parsing a HTML content and selecting elements:

Imagine this index.html file:

index.html
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Example Document</title>
</head>
<body>
    <h1 id="title">Welcome to My Page</h1>
    <p>This page has a link and an image.</p>
    <a href="https://www.example.com">Visit Example.com</a>
    <br>
    <img src="https://www.example.com/image.jpg" alt="My Image">
    <a href="https://www.google.com">Visit Google</a>
    <a>No Link</a>
</body>
</html>

We want to extract the href attributes from it, and we have three ways to achieve this:

You can parse HTML/XML content with parse() function.

main.py
import markupever
with open("index.html", "rb") as fd: # (2)!
    dom = markupever.parse(fd.read(), markupever.HtmlOptions()) # (1)!

for element in dom.select("a[href]"):
    print(element.attrs["href"])
  1. Use HtmlOptions() for HTML documents and XmlOptions() for XML documents. If used incorrectly, don't worry—it won't disrupt the process. These options specify namespaces and other differences between XML and HTML, while also providing distinct features for each type.

  2. It's recommended to open files with "rb" mode, but not required; you can use "r" mode also.

You can parse HTML/XML content from files with .parse_file() function.

main.py
import markupever
dom = markupever.parse_file("index.html", markupever.HtmlOptions()) # (1)!

for element in dom.select("a[href]"):
    print(element.attrs["href"])
  1. Use HtmlOptions() for HTML documents and XmlOptions() for XML documents. If used incorrectly, don't worry—it won't disrupt the process. These options specify namespaces and other differences between XML and HTML, while also providing distinct features for each type.

The .parse() and .parse_file() functions are shorthand for using the .Parser class. However, you can also use the class directly. It's designed to allow you to stream input using the .process() method, so you don't have to worry about the memory usage of large inputs.

main.py
import markupever
parser = markupever.Parser(markupever.HtmlOptions()) # (1)!

with open("index.html", "rb") as fd: # (2)!
    for line in fd: # Read line by line (3)
        parser.process(line)

parser.finish()
dom = parser.into_dom()

for element in dom.select("a[href]"):
    print(element.attrs["href"])
  1. Use HtmlOptions() for HTML documents and XmlOptions() for XML documents. If used incorrectly, don't worry—it won't disrupt the process. These options specify namespaces and other differences between XML and HTML, while also providing distinct features for each type.

  2. It's recommended to open files with "rb" mode, but not required; you can use "r" mode also.

  3. You can read the file all at once and pass it to the process function. We have broken the file into lines here to show you the Parser's abilities.

Then run main.py to see result:

$ python3 main.py
https://www.example.com
https://www.google.com

Creating Documents

Also there's a structure called TreeDom (1). You can directly work with it and generate documents (such as HTML and XML) very easy.

  1. A tree structure which specialy designed for HTML and XML documents. Uses Rust's Vec type in backend. The memory consumed by the TreeDom is dynamic and depends on the number of tokens stored in the tree. The allocated memory is never reduced and is only released when it is dropped.
from markupever import dom

dom = dom.TreeDom()
root: dom.Document = dom.root()

root.create_doctype("html")

html = root.create_element("html", {"lang": "en"})
body = html.create_element("body")
body.create_text("Hello Everyone ...")

print(root.serialize())
# <!DOCTYPE html><html lang="en"><body>Hello Everyone ...</body></html>

Performance

This library is designed with a strong focus on performance and speed. It's written in Rust and avoids the use of unsafe code blocks.

I have compared MarkupEver with BeautifulSoup and Parsel (which directly uses lxml):

Benchmarks

System

The system on which the benchmarks are done: Manjaro Linux x86_64, 8G, Intel i3-1115G4

Parsing Min Max Avg
markupever 4907µs 4966µs 4927µs
markupever (exact_errors) 8920µs 9172µs 8971µs
beautifulsoup4 (html.parser) 35283µs 36460µs 35828µs
beautifulsoup4 (lxml) 22576µs 23092µs 22809µs
parsel 3937µs 4147µs 4072µs
Selecting (CSS) Min Max Avg
markupever 308µs 314µs 310µs
beautifulsoup4 2936µs 3074µs 2995µs
parsel 159µs 165µs 161µs
Serializing Min Max Avg
markupever 1932µs 1973µs 1952µs
beautifulsoup4 14705µs 15021µs 14900µs
parsel 1264µs 1290µs 1276µs

Summary

The Parsel is the fastest library (Actually lxml is) and is specially designed for scraping, but it offers less control over the document. The BeautifulSoup is the slowest (and oldest) library which provides full control over the document.

The MarkupEver sites between these two. It is extremely fast, close to Parsel, and offers full control over the document.

Memory Usage

As you know, this library is written in Rust and uses the Rust allocator. Like other libraries written in C and other low-level languages, it uses very low memory, so you don't have to worry about memory usage. Manage huge documents with ease...

License

This project is licensed under the terms of the MPL-2.0 license.