Advanced Topics

Tuning LLMDigestRules.yaml

LLMDigestRules.yaml controls what gets stripped, where the main content lives, and which cleanup steps run on the converted markdown. All editable without recompiling. Call DigestRulesConfig.Save() to generate a starter file, then add or adjust rule sets below the built-in ones.

What Each Rule Set Field Means

Each rule set in the file defines:

ruleSets:

- name: my-site

domainPatterns:

- mysite.com

junkSelectors:

- "//aside"

- "//nav"

mainContentCandidates:

- "//article"

- "//main"

sitemapBaseUrl: "https://www.mysite.com"

enabledPasses:

- CollapseExcessiveBlanks

- HeadingSpacing

The Seven Cleanup Steps

Seven cleanup steps are available to apply after the HTML-to-markdown conversion. They cover things like collapsing runs of blank lines, stripping comment counters and empty image placeholders, fixing image spacing and layout, reformatting product page price blocks, and adding spacing around headings for a more readable document. Each rule set in the YAML file lists only the steps it needs by name. Any step not listed is skipped.

Adding a Site-Specific Rule Set

A rule set only needs to define what makes a site different from the generic defaults. Any field left out inherits the generic rule set value. Add your rule set below the built-in ones in the YAML file. Call DigestRulesConfig.Save() first if the file does not exist yet.

- name: linkedin

domainPatterns:

- linkedin.com

junkSelectors:

- "//aside[contains(@class,"sidebar")]"

- "//*[contains(@class,"social-action")]"

mainContentCandidates:

- "//main"

- "//*[contains(@class,"update-v2")]"

TIP

If a field is missing from a rule set, GPAL fills it in from the generic rule set. So a site-specific entry can be as minimal as a name, domainPatterns, and the one or two fields that actually differ from generic.