Getting started
The first thing expected from such this libraries is the ability to read HTML, XML, and similar documents.
The MarkupEver is designed specially for reading, parsing, and repairing HTML and XML documents (also can parse similar documents).
In MarkupEver we have some functions (1) and a class (2) for doing that.
.parse()
and.parse_file()
functionsParser
class
Additionaly, they have special features that distinguish this library from others:
- You don't worry about huge memory usage.
- You can read and parse documents part by part (such as files, streams, ...).
- You can specify some options for parsing which can help you (with
HtmlOptions()
andXmlOptions()
classes). - You can repair invalid documents automatically.
Parsing Html
Imagine this index.html
file:
<!DOCTYPE html>
<html>
<head>
<title>Incomplete Html</title>
</head>
<body>
<ul>
<li><a href="https://www.example.com">Example Website</a></li>
<li><a href="https://www.wikipedia.org">Wikipedia</a></li>
<li><a href="https://www.bbc.com">BBC</a></li>
<li><a href="https://www.microsoft.com">Microsoft</a></li>
</ul>
We can use .parse()
and .parse_file()
functions to parse documents.
The Difference
the .parse_file()
function gets a BinaryIO
, a TextIO
or a file path and parses it chunk by chunk; but .parse()
function gets all document content at once. By this way, using .parse_file()
is very better than .parse()
.
Let's use them:
HtmlOptions
That's it, we parsed index.html
file and now have a TreeDom
class. We can navigate that:
root = dom.root() # Get root node
root
# Document
title = root.select_one("title") # Accepts CSS selectors
title.name
# QualName(local="title", ns="http://www.w3.org/1999/xhtml", prefix=None)
title.serialize()
# '<title>Incomplete Html</title>'
title.text()
# 'Incomplete Html'
title.parent.name
# QualName(local="head", ns="http://www.w3.org/1999/xhtml", prefix=None)
ul = root.select_one("ul")
ul.serialize()
# <ul>
# <li><a href="https://www.example.com">Example Website</a></li>
# <li><a href="https://www.wikipedia.org">Wikipedia</a></li>
# <li><a href="https://www.bbc.com">BBC</a></li>
# <li><a href="https://www.microsoft.com">Microsoft</a></li>
# </ul>
Common task
One common tasks is extracting all links from a page:
Additionaly, if you serialize the parsed DOM you'll see that the incomplete HTML is repaired:
root.serialize()
# <!DOCTYPE html><html><head>
# <title>Incomplete Html</title>
# </head>
# <body>
# <ul>
# <li><a href="https://www.example.com">Example Website</a></li>
# <li><a href="https://www.wikipedia.org">Wikipedia</a></li>
# <li><a href="https://www.bbc.com">BBC</a></li>
# <li><a href="https://www.microsoft.com">Microsoft</a></li>
# </ul>
# </body></html>
Parsing XML
Imagine this file.xml
file:
<?xml version="1.0" encoding="UTF-8"?>
<bookstore xmlns:bk="http://www.example.com/books" xmlns:mag="http://www.example.com/magazines">
<bk:book>
<bk:title>Programming for Beginners</bk:title>
<bk:author>Jane Doe</bk:author>
<bk:year>2021</bk:year>
</bk:book>
<mag:magazine>
<mag:title>Technology Monthly</mag:title>
<mag:publisher>Tech Publishers</mag:publisher>
<mag:month>March</mag:month>
</mag:magazine>
</bookstore>
Let's use .parse()
/ .parse_file()
function to parse it (we explained them earlier):
XmlOptions
That's it, we parsed file.xml
file and now have a TreeDom
class. We can navigate that like what we did in this section:
root = dom.root() # Get root node
root
# Document
root.select_one("bookstore")
# Element(name=QualName(local="bookstore"), attrs=[], template=false, mathml_annotation_xml_integration_point=false)
for i in root.select("mag|*"): # get all elements which has namespace 'mag'
print(i)
# Element(name=QualName(local="magazine", ns="http://www.example.com/magazines", prefix=Some("mag")), attrs=[], template=false, mathml_annotation_xml_integration_point=false)
# Element(name=QualName(local="title", ns="http://www.example.com/magazines", prefix=Some("mag")), attrs=[], template=false, mathml_annotation_xml_integration_point=false)
# Element(name=QualName(local="publisher", ns="http://www.example.com/magazines", prefix=Some("mag")), attrs=[], template=false, mathml_annotation_xml_integration_point=false)
# Element(name=QualName(local="month", ns="http://www.example.com/magazines", prefix=Some("mag")), attrs=[], template=false, mathml_annotation_xml_integration_point=false)
book = root.select_one("book")
book.serialize()
# <bk:book xmlns:bk="http://www.example.com/books">
# <bk:title>Programming for Beginners</bk:title>
# <bk:author>Jane Doe</bk:author>
# <bk:year>2021</bk:year>
# </bk:book>
Using Parser
The functions .parse()
and .parse_file()
, which you became familiar with earlier, internally use Parser
class
which actually does the parsing. In this part we want to learn the Parser
class.
The Parser
class is an HTML/XML parser, ready to receive Unicode input. It is very easy to use and allows you to stream input using the .process()
method. This way, you don't have to worry about the memory usage of large inputs.
As we said about options parameter in .parse()
and .parse_file()
,
if your input is an HTML document, pass a HtmlOptions()
; if your input is an XML document, pass XmlOptions()
To start, create an instance of the Parser
class. Then, use the Parser.process()
method to send content for parsing. You can call this method as many times as you want (it's thread-safe). When your inputs are finished, call the Parser.finish()
method to mark the parser as finished.
import markupever
# Create Parser
parser = markupever.Parser(options=markupever.HtmlOptions())
# Process contents
parser.process("... content 1 ...")
parser.process("... content 2 ...")
parser.process("... content 3 ...")
# Mark as finished
parser.finish()
That's it! Your HTML document parsing is now finished and complete. The Parser class has several methods and attributes to inform you about the parsed content, such as the lineno
property, quirks_mode
property, and errors()
method. You can see examples:
Returns the quirks mode (always is QUIRKS_MODE_OFF for XML).
See quirks mode on wikipedia for more information.
You can use these properties and methods before calling the Parser.into_dom()
method. The Parser.into_dom()
method converts the parser into a TreeDom
and releases its allocated memory.
import markupever
parser = markupever.Parser(options=markupever.HtmlOptions())
parser.process("... content 1 ...")
parser.process("... content 2 ...")
parser.process("... content 3 ...")
parser.finish()
# Use `.errors()`, `.lineno`, or `.quirks_mode` if you want
dom = parser.into_dom()
More about options
We have two structures for parsing options: HtmlOptions()
and XmlOptions()
.
Use HtmlOptions()
for HTML documents and XmlOptions()
for XML documents. If used incorrectly, don't worry — it won't disrupt the process. These options specify namespaces and other differences between XML and HTML, while also providing distinct features for each type.
HtmlOptions parameters
Let's see what parameters we have:
full_document
- Specifies that is this a complete document? default: True.
exact_errors
- Report all parse errors described in the spec, at some performance penalty? default: False.
-
discard_bom
- Discard aU+FEFF BYTE ORDER MARK
if we see one at the beginning of the stream? default: False. -
profile
- Keep a record of how long we spent in each state? Printed whenfinish()
is called. default: False.
-
iframe_srcdoc
- Is this aniframe srcdoc
document? default: False. -
drop_doctype
- Should we drop the DOCTYPE (if any) from the tree? default: False.
quirks_mode
- Initial TreeBuilder quirks mode. default:markupever.QUIRKS_MODE_OFF
.
XmlOptions parameters
Let's see what parameters we have:
exact_errors
- Report all parse errors described in the spec, at some performance penalty? default: False.
-
discard_bom
- Discard aU+FEFF BYTE ORDER MARK
if we see one at the beginning of the stream? default: False. -
profile
- Keep a record of how long we spent in each state? Printed whenfinish()
is called. default: False.