APUI HTML Parser with Gumbo Integration
APUI HTML Parser with Gumbo Integration
This module provides HTML parsing functionality for the APUI framework using the Gumbo HTML5 parser. It integrates seamlessly with the existing DOM system and provides robust, standards-compliant HTML parsing capabilities.
Features
- HTML5 Compliant Parsing: Uses Google's Gumbo parser for robust HTML5 parsing
- DOM Integration: Converts parsed HTML directly into APUI DOM objects
- Fragment Parsing: Support for parsing HTML fragments with context
- Error Handling: Comprehensive error reporting and validation
- Utility Functions: Helper functions for common HTML operations
- Web Standards: Follows W3C/WHATWG DOM specifications
Components
HTMLParser
The main parser class that handles HTML parsing operations.
#include <APHTML/HTML/HTMLParser.h>
// Create parser instance
aperture::html::HTMLParser parser;
// Parse HTML content
auto document = parser.parseHTML("<html><body><h1>Hello World</h1></body></html>");
// Parse HTML file
auto document = parser.parseHTMLFile("path/to/file.html");
// Parse HTML fragment
auto fragment = parser.parseHTMLFragment("<li>Item 1</li><li>Item 2</li>");
HTMLUtils
Static utility class for common HTML operations.
#include <APHTML/HTML/HTMLUtils.h>
// Quick parsing
auto document = aperture::html::HTMLUtils::parseHTML(htmlContent);
// HTML validation
bool isValid = aperture::html::HTMLUtils::validateHTML(htmlContent);
// HTML escaping
std::string escaped = aperture::html::HTMLUtils::escapeHTML("<script>alert('xss')</script>");
// HTML unescaping
std::string unescaped = aperture::html::HTMLUtils::unescapeHTML("<script>");
HTMLParserOptions
Configuration options for parsing behavior.
aperture::html::HTMLParserOptions options;
options.preserveWhitespace = true;
options.includeComments = false;
options.maxErrors = 5;
auto document = parser.parseHTML(htmlContent, options);
Usage Examples
Basic HTML Parsing
#include <APHTML/HTML/HTMLUtils.h>
#include <APHTML/dom/DOMDocument.h>
// Parse HTML content
std::string html = R"(
<!DOCTYPE html>
<html>
<head><title>Example</title></head>
<body>
<h1>Hello World</h1>
<p>This is a paragraph.</p>
</body>
</html>
)";
auto document = aperture::html::HTMLUtils::parseHTML(html);
if (document) {
// Access parsed content
auto bodyElements = document->getElementsByTagName("body");
if (bodyElements && bodyElements->getLength() > 0) {
auto body = bodyElements->item(0);
std::cout << "Body has " << body->getChildNodes().size() << " children\n";
}
}
Working with Attributes
std::string html = R"(
<div class="container" id="main" data-custom="value">
<input type="text" name="username" required>
<button class="btn" onclick="submit()">Submit</button>
</div>
)";
auto document = aperture::html::HTMLUtils::parseHTML(html);
if (document) {
auto divs = document->getElementsByTagName("div");
if (divs && divs->getLength() > 0) {
auto div = divs->item(0);
std::cout << "Class: " << div->getAttribute("class") << "\n";
std::cout << "ID: " << div->getAttribute("id") << "\n";
std::cout << "Data: " << div->getAttribute("data-custom") << "\n";
}
}
Fragment Parsing
std::string fragment = "<li>Item 1</li><li>Item 2</li><li>Item 3</li>";
auto fragment = aperture::html::HTMLUtils::parseHTMLFragment(fragment);
if (fragment) {
for (unsigned long i = 0; i < fragment->getChildNodes().size(); ++i) {
auto child = fragment->getChildNodes().item(i);
if (child->getNodeType() == aperture::dom::DOMNode::ELEMENT_NODE) {
auto element = std::static_pointer_cast<aperture::dom::DOMElement>(child);
std::cout << "Found: " << element->getTagName() << "\n";
}
}
}
HTML Validation
// Validate HTML content
std::string html = "<div><p>Valid content</p></div>";
if (aperture::html::HTMLUtils::validateHTML(html)) {
std::cout << "HTML is valid\n";
} else {
std::cout << "HTML is invalid\n";
}
HTML Escaping
// Escape HTML special characters
std::string text = "Hello <world> & \"universe\"!";
std::string escaped = aperture::html::HTMLUtils::escapeHTML(text);
// Result: "Hello <world> & "universe"!"
// Unescape HTML entities
std::string unescaped = aperture::html::HTMLUtils::unescapeHTML(escaped);
// Result: "Hello <world> & \"universe\"!"
Error Handling
The parser provides comprehensive error handling:
auto document = aperture::html::HTMLUtils::parseHTML(htmlContent);
if (!document) {
std::string error = aperture::html::HTMLUtils::getLastError();
std::cerr << "Parsing failed: " << error << "\n";
}
Configuration Options
The HTMLParserOptions class allows you to customize parsing behavior:
aperture::html::HTMLParserOptions options;
options.preserveWhitespace = true; // Preserve whitespace nodes
options.includeComments = false; // Exclude comment nodes
options.maxErrors = 10; // Maximum errors to report
options.stopOnFirstError = false; // Continue parsing after errors
options.strictFragmentParsing = true; // Strict fragment parsing mode
auto document = parser.parseHTML(htmlContent, options);
Integration with Existing DOM
The HTML parser integrates seamlessly with the existing APUI DOM system:
- Parsed HTML is converted to native APUI DOM objects
- All DOM methods and properties are available on parsed elements
- Event handling and manipulation work as expected
- CSS styling and layout systems work with parsed content
Performance Considerations
- The Gumbo parser is highly optimized for performance
- Large HTML documents are parsed efficiently
- Memory usage is optimized for typical web content
- Parsing errors are handled gracefully without performance impact
Dependencies
- Gumbo: HTML5 parsing library (included as submodule)
- APUI Foundation: Core framework functionality
- APUI DOM: Document Object Model implementation
Building
The HTML parser module is automatically included when building the APUI HTML engine. The Gumbo library is built as a static library and linked with the main APUI HTML engine.
License
This module is part of the APUI framework and follows the same licensing terms as the rest of the codebase.

