Engineer Notes

APUI HTML Parser with Gumbo Integration

This module provides HTML parsing functionality for the APUI framework using the Gumbo HTML5 parser. It integrates seamlessly with the existing DOM system and provides robust, standards-compliant HTML parsing capabilities.

APUI HTML Parser with Gumbo Integration

Features

HTML5 Compliant Parsing: Uses Google's Gumbo parser for robust HTML5 parsing
DOM Integration: Converts parsed HTML directly into APUI DOM objects
Fragment Parsing: Support for parsing HTML fragments with context
Error Handling: Comprehensive error reporting and validation
Utility Functions: Helper functions for common HTML operations
Web Standards: Follows W3C/WHATWG DOM specifications

Components

HTMLParser

The main parser class that handles HTML parsing operations.

#include <APHTML/HTML/HTMLParser.h>

// Create parser instance
aperture::html::HTMLParser parser;

// Parse HTML content
auto document = parser.parseHTML("<html><body><h1>Hello World</h1></body></html>");

// Parse HTML file
auto document = parser.parseHTMLFile("path/to/file.html");

// Parse HTML fragment
auto fragment = parser.parseHTMLFragment("<li>Item 1</li><li>Item 2</li>");

HTMLUtils

Static utility class for common HTML operations.

#include <APHTML/HTML/HTMLUtils.h>

// Quick parsing
auto document = aperture::html::HTMLUtils::parseHTML(htmlContent);

// HTML validation
bool isValid = aperture::html::HTMLUtils::validateHTML(htmlContent);

// HTML escaping
std::string escaped = aperture::html::HTMLUtils::escapeHTML("<script>alert('xss')</script>");

// HTML unescaping
std::string unescaped = aperture::html::HTMLUtils::unescapeHTML("&lt;script&gt;");

HTMLParserOptions

Configuration options for parsing behavior.

aperture::html::HTMLParserOptions options;
options.preserveWhitespace = true;
options.includeComments = false;
options.maxErrors = 5;

auto document = parser.parseHTML(htmlContent, options);

Usage Examples

Basic HTML Parsing

#include <APHTML/HTML/HTMLUtils.h>
#include <APHTML/dom/DOMDocument.h>

// Parse HTML content
std::string html = R"(
    <!DOCTYPE html>
    <html>
    <head><title>Example</title></head>
    <body>
        <h1>Hello World</h1>
        <p>This is a paragraph.</p>
    </body>
    </html>
)";

auto document = aperture::html::HTMLUtils::parseHTML(html);
if (document) {
    // Access parsed content
    auto bodyElements = document->getElementsByTagName("body");
    if (bodyElements && bodyElements->getLength() > 0) {
        auto body = bodyElements->item(0);
        std::cout << "Body has " << body->getChildNodes().size() << " children\n";
    }
}

Working with Attributes

std::string html = R"(
    <div class="container" id="main" data-custom="value">
        <input type="text" name="username" required>
        <button class="btn" onclick="submit()">Submit</button>
    </div>
)";

auto document = aperture::html::HTMLUtils::parseHTML(html);
if (document) {
    auto divs = document->getElementsByTagName("div");
    if (divs && divs->getLength() > 0) {
        auto div = divs->item(0);
        std::cout << "Class: " << div->getAttribute("class") << "\n";
        std::cout << "ID: " << div->getAttribute("id") << "\n";
        std::cout << "Data: " << div->getAttribute("data-custom") << "\n";
    }
}

Fragment Parsing

std::string fragment = "<li>Item 1</li><li>Item 2</li><li>Item 3</li>";

auto fragment = aperture::html::HTMLUtils::parseHTMLFragment(fragment);
if (fragment) {
    for (unsigned long i = 0; i < fragment->getChildNodes().size(); ++i) {
        auto child = fragment->getChildNodes().item(i);
        if (child->getNodeType() == aperture::dom::DOMNode::ELEMENT_NODE) {
            auto element = std::static_pointer_cast<aperture::dom::DOMElement>(child);
            std::cout << "Found: " << element->getTagName() << "\n";
        }
    }
}

HTML Validation

// Validate HTML content
std::string html = "<div><p>Valid content</p></div>";
if (aperture::html::HTMLUtils::validateHTML(html)) {
    std::cout << "HTML is valid\n";
} else {
    std::cout << "HTML is invalid\n";
}

HTML Escaping

// Escape HTML special characters
std::string text = "Hello <world> & \"universe\"!";
std::string escaped = aperture::html::HTMLUtils::escapeHTML(text);
// Result: "Hello &lt;world&gt; &amp; &quot;universe&quot;!"

// Unescape HTML entities
std::string unescaped = aperture::html::HTMLUtils::unescapeHTML(escaped);
// Result: "Hello <world> & \"universe\"!"

Error Handling

The parser provides comprehensive error handling:

auto document = aperture::html::HTMLUtils::parseHTML(htmlContent);
if (!document) {
    std::string error = aperture::html::HTMLUtils::getLastError();
    std::cerr << "Parsing failed: " << error << "\n";
}

Configuration Options

The HTMLParserOptions class allows you to customize parsing behavior:

aperture::html::HTMLParserOptions options;
options.preserveWhitespace = true;        // Preserve whitespace nodes
options.includeComments = false;          // Exclude comment nodes
options.maxErrors = 10;                   // Maximum errors to report
options.stopOnFirstError = false;         // Continue parsing after errors
options.strictFragmentParsing = true;     // Strict fragment parsing mode

auto document = parser.parseHTML(htmlContent, options);

Integration with Existing DOM

The HTML parser integrates seamlessly with the existing APUI DOM system:

Parsed HTML is converted to native APUI DOM objects
All DOM methods and properties are available on parsed elements
Event handling and manipulation work as expected
CSS styling and layout systems work with parsed content

Performance Considerations

The Gumbo parser is highly optimized for performance
Large HTML documents are parsed efficiently
Memory usage is optimized for typical web content
Parsing errors are handled gracefully without performance impact

Dependencies

Gumbo: HTML5 parsing library (included as submodule)
APUI Foundation: Core framework functionality
APUI DOM: Document Object Model implementation

Building

The HTML parser module is automatically included when building the APUI HTML engine. The Gumbo library is built as a static library and linked with the main APUI HTML engine.

License

This module is part of the APUI framework and follows the same licensing terms as the rest of the codebase.

Edit this pageorReport an issue

Command Executor System

APUIBinder Enhanced Binding System