Tutorials

Tutorials

Walking Sitemaps and Capturing Page Digests

GetSiteMapUrls walks a site's sitemap.xml so you can discover pages without hand-coding URLs, and GetHydratedData / GetLLMDigest pull structured and LLM-friendly summaries out of each page you visit. This tutorial crawls a sitemap, captures a modern framework's hydration data, and saves an LLM digest for every page using GPALFile.Next.

Complete Program

Here's the whole workflow, start to finish. Each piece is broken down and explained below.

using System.Collections.Generic;

using GenerallyPositive;

using GenerallyPositive.Browser;

using static GenerallyPositive.Enums;

GPAL

.WithPublishToConsole()

.WithDriverLocation(@"C:drivers")

.WithUseOttoMagic(@"C:OttoMagic");

IBrowser browser = GPAL.Browser

.WithBrowserType(BrowserType.Chrome)

.WithUseStealth(StealthType.GoogleReferrer | StealthType.PatchChromeDriver | StealthType.DarkMode)

.WithUseAutomationEngine(AutomationEngine.OttoMagic)

.WithRespectRobotMetaTags(true)

.WithObeyRobotsTxt(true)

.ToGPALObject();

string siteUrl = "Nike.com";

List<string> sitemapUrls;

browser.GoTo(siteUrl).WaitFor(2_000)

.GetSiteMapUrls(out sitemapUrls);

foreach (string url in sitemapUrls)

{

List<string> nestedUrls;

browser.GoTo(url).WaitFor(2_000)

.GetSiteMapUrls(out nestedUrls);

foreach (string nestedUrl in nestedUrls)

{

browser.GoTo(nestedUrl).WaitFor(5_000);

GPAL.PublishSimpleEvent(GPALEventType.INFO, $"Visited [{nestedUrl}]");

}

}

NextJsHydrationResult nextJsHydrationResult;

LLMDigestResult digestResult;

GPALFile hydratedDataFile = @"c:sdihydratedData.txt";

GPALFile llmDigestFile = @"c:sdillmDigest.txt";

browser.GoTo(siteUrl).WaitFor(5_000)

.GetHydratedData(out nextJsHydrationResult);

browser.SaveHydratedData(hydratedDataFile);

browser

.GetLLMDigest(out digestResult)

.SaveLLMDigest(llmDigestFile.Next);

browser.Close(true);

Walking a Sitemap with GetSiteMapUrls

GetSiteMapUrls reads the current page's sitemap (sitemap.xml or whatever robots.txt points to) and returns the URLs it lists. Top-level sitemaps often list other sitemaps - category or product feeds - rather than pages directly, so calling GetSiteMapUrls again on each result walks down a level. This nested loop visits every URL two levels deep.

List<string> sitemapUrls;

browser.GoTo(siteUrl).WaitFor(2_000)

.GetSiteMapUrls(out sitemapUrls);

foreach (string url in sitemapUrls)

{

List<string> nestedUrls;

browser.GoTo(url).WaitFor(2_000)

.GetSiteMapUrls(out nestedUrls);

foreach (string nestedUrl in nestedUrls)

{

browser.GoTo(nestedUrl).WaitFor(5_000);

GPAL.PublishSimpleEvent(GPALEventType.INFO, $"Visited [{nestedUrl}]");

}

}

TIP

A site's top-level sitemap.xml frequently lists other sitemaps (products.xml, categories.xml, and so on) instead of pages. Call GetSiteMapUrls again on each entry to walk down to the actual page URLs.

Browsing Politely: Stealth and robots.txt

WithRespectRobotMetaTags and WithObeyRobotsTxt tell GPAL to honor a site's crawling rules - robots meta tags on individual pages and the site-wide robots.txt - the same way a well-behaved crawler would. WithUseStealth applies a combination of techniques so the automated browser looks more like a normal visit; StealthType is a flags enum, so OR several together with |.

IBrowser browser = GPAL.Browser

.WithBrowserType(BrowserType.Chrome)

.WithUseStealth(StealthType.GoogleReferrer | StealthType.PatchChromeDriver | StealthType.DarkMode)

.WithUseAutomationEngine(AutomationEngine.OttoMagic)

.WithRespectRobotMetaTags(true)

.WithObeyRobotsTxt(true)

.ToGPALObject();

Capturing Hydrated Data from Modern Frameworks

Many sites built with frameworks like Next.js fetch data on the server and embed it as JSON for the page to "hydrate" into on load. GetHydratedData reads that embedded JSON into a NextJsHydrationResult, and SaveHydratedData writes it to a GPALFile. Hydration happens after the document is otherwise ready, so this step uses a longer WaitFor than a typical navigation.

NextJsHydrationResult nextJsHydrationResult;

GPALFile hydratedDataFile = @"c:sdihydratedData.txt";

browser.GoTo(siteUrl).WaitFor(5_000)

.GetHydratedData(out nextJsHydrationResult);

browser.SaveHydratedData(hydratedDataFile);

Saving Per-Page Digests with GPALFile.Next

GetLLMDigest produces a condensed, LLM-friendly summary of the current page as an LLMDigestResult, and SaveLLMDigest writes it out. llmDigestFile.Next returns a new GPALFile pointing at the next available filename in sequence (llmDigest1.txt, llmDigest2.txt, ...) without changing llmDigestFile itself - call it fresh inside a loop to give every page its own digest file.

LLMDigestResult digestResult;

GPALFile llmDigestFile = @"c:sdillmDigest.txt";

browser

.GetLLMDigest(out digestResult)

.SaveLLMDigest(llmDigestFile.Next);

TIP

llmDigestFile.Next is a fresh GPALFile each time you read it - llmDigestFile itself still points at llmDigest.txt. Reading .Next inside a loop gives you llmDigest1.txt, llmDigest2.txt, and so on, so earlier digests aren't overwritten.