Tutorials

Walking Sitemaps and Capturing Page Digests

GetSiteMapUrls walks a site's sitemap.xml so you can discover pages without hand-coding URLs, and GetHydratedData / GetLLMDigest pull structured and LLM-friendly summaries out of each page you visit. This tutorial crawls a sitemap, captures a modern framework's hydration data, and saves an LLM digest for every page using GPALFile.Next.

Complete Program

Here's the whole workflow, start to finish. Each piece is broken down and explained below.

using System.Collections.Generic;

using GenerallyPositive;

using GenerallyPositive.Browser;

using static GenerallyPositive.Enums;

GPAL

.WithPublishToConsole()

.WithDriverLocation(@"C:drivers")

.WithUseOttoMagic(@"C:OttoMagic");

IBrowser browser = GPAL.Browser

.WithBrowserType(BrowserType.Chrome)

.WithUseStealth(StealthType.GoogleReferrer | StealthType.PatchChromeDriver | StealthType.DarkMode)

.WithUseAutomationEngine(AutomationEngine.OttoMagic)

.WithRespectRobotMetaTags(true)

.WithObeyRobotsTxt(true)

.ToGPALObject();

string siteUrl = "Nike.com";

List<string> sitemapUrls;

browser.GoTo(siteUrl).WaitFor(2_000)

.GetSiteMapUrls(out sitemapUrls);

foreach (string url in sitemapUrls)

{

List<string> nestedUrls;

browser.GoTo(url).WaitFor(2_000)

.GetSiteMapUrls(out nestedUrls);

foreach (string nestedUrl in nestedUrls)

{

browser.GoTo(nestedUrl).WaitFor(5_000);

GPAL.PublishSimpleEvent(GPALEventType.INFO, $"Visited [{nestedUrl}]");

}

NextJsHydrationResult nextJsHydrationResult;

LLMDigestResult digestResult;

GPALFile hydratedDataFile = @"c:sdihydratedData.txt";

GPALFile llmDigestFile = @"c:sdillmDigest.txt";

browser.GoTo(siteUrl).WaitFor(5_000)

.GetHydratedData(out nextJsHydrationResult);

browser.SaveHydratedData(hydratedDataFile);

browser

.GetLLMDigest(out digestResult)

.SaveLLMDigest(llmDigestFile.Next);

browser.Close(true);

Walking a Sitemap with GetSiteMapUrls

GetSiteMapUrls reads the current page's sitemap (sitemap.xml or whatever robots.txt points to) and returns the URLs it lists. Top-level sitemaps often list other sitemaps - category or product feeds - rather than pages directly, so calling GetSiteMapUrls again on each result walks down a level. This nested loop visits every URL two levels deep.

List<string> sitemapUrls;

browser.GoTo(siteUrl).WaitFor(2_000)

.GetSiteMapUrls(out sitemapUrls);

foreach (string url in sitemapUrls)

{

List<string> nestedUrls;

browser.GoTo(url).WaitFor(2_000)

.GetSiteMapUrls(out nestedUrls);

foreach (string nestedUrl in nestedUrls)

{

browser.GoTo(nestedUrl).WaitFor(5_000);

GPAL.PublishSimpleEvent(GPALEventType.INFO, $"Visited [{nestedUrl}]");

}

TIP

A site's top-level sitemap.xml frequently lists other sitemaps (products.xml, categories.xml, and so on) instead of pages. Call GetSiteMapUrls again on each entry to walk down to the actual page URLs.

Browsing Politely: Stealth and robots.txt

WithRespectRobotMetaTags and WithObeyRobotsTxt tell GPAL to honor a site's crawling rules - robots meta tags on individual pages and the site-wide robots.txt - the same way a well-behaved crawler would. WithUseStealth applies a combination of techniques so the automated browser looks more like a normal visit; StealthType is a flags enum, so OR several together with |.

IBrowser browser = GPAL.Browser

.WithBrowserType(BrowserType.Chrome)

.WithUseStealth(StealthType.GoogleReferrer | StealthType.PatchChromeDriver | StealthType.DarkMode)

.WithUseAutomationEngine(AutomationEngine.OttoMagic)

.WithRespectRobotMetaTags(true)

.WithObeyRobotsTxt(true)

.ToGPALObject();

Capturing Hydrated Data from Modern Frameworks

Many sites built with frameworks like Next.js fetch data on the server and embed it as JSON for the page to "hydrate" into on load. GetHydratedData reads that embedded JSON into a NextJsHydrationResult, and SaveHydratedData writes it to a GPALFile. Hydration happens after the document is otherwise ready, so this step uses a longer WaitFor than a typical navigation.

NextJsHydrationResult nextJsHydrationResult;

GPALFile hydratedDataFile = @"c:sdihydratedData.txt";

browser.GoTo(siteUrl).WaitFor(5_000)

.GetHydratedData(out nextJsHydrationResult);

browser.SaveHydratedData(hydratedDataFile);

Saving Per-Page Digests with GPALFile.Next

GetLLMDigest produces a condensed, LLM-friendly summary of the current page as an LLMDigestResult, and SaveLLMDigest writes it out. llmDigestFile.Next returns a new GPALFile pointing at the next available filename in sequence (llmDigest1.txt, llmDigest2.txt, ...) without changing llmDigestFile itself - call it fresh inside a loop to give every page its own digest file.

LLMDigestResult digestResult;

GPALFile llmDigestFile = @"c:sdillmDigest.txt";

browser

.GetLLMDigest(out digestResult)

.SaveLLMDigest(llmDigestFile.Next);

TIP

llmDigestFile.Next is a fresh GPALFile each time you read it - llmDigestFile itself still points at llmDigest.txt. Reading .Next inside a loop gives you llmDigest1.txt, llmDigest2.txt, and so on, so earlier digests aren't overwritten.

Documentation Example

Documentation

Showing off some plain text in these paragraphs eligendi laboriosam illo nostrum corporis at libero vel voluptas? Expedita, facere dolores voluptatem ad ab rem assumenda soluta!

Other formats

Here you can find different accents and emphasis sit amet consectetur adipisicing elit. Obcaecati, iste distinctio veritatis eligendi laboriosam illo nostrum corporis at libero vel voluptas? Expedita, facere dolores voluptatem ad ab rem assumenda soluta!

This is a link and how it could look like bestlinkinthebeautifulworld. Obcaecati, iste distinctio veritatis eligendi laboriosam illo nostrum corporis at libero vel voluptas? Expedita, facere dolores voluptatem ad ab rem assumenda soluta!

Here's just some classic bold text adipisicing elit. Obcaecati, iste distinctio veritatis eligendi laboriosam notBoldSecondbestlinkinthebeautifulworld illo nostrum corporis at libero vel voluptas? Expedita, facere dolores voluptatem ad ab rem assumenda soluta!

Obcaecati, iste distinctio veritatis eligendi laboriosam adipisicing elit illo nostrum corporis at adipisicing elit libero vel voluptas? Expedita, adipisicing facere dolores voluptatem ad ab rem assumenda soluta!

Other cuple of colors in case we want to emphasize several ways adipisicing elit. Obcaecati, iste distinctio veritatis eligendi laboriosam adipisicing elit illo nostrum corporis at voluptatem libero vel voluptas? Expedita, facere dolores voluptatem ad ab rem assumenda soluta!

Adding Images to the content

Lorem ipsum dolor sit amet consectetur adipisicing elit. Obcaecati, iste distinctio veritatis eligendi laboriosam illo nostrum corporis at libero vel voluptas? Expedita, facere dolores voluptatem ad ab rem assumenda soluta! Lorem ipsum dolor, sit amet consectetur adipisicing elit. Quod veniam, quam ad expedita laborum sed at voluptates culpa ipsam ut vel. Ullam temporibus a mollitia quod aliquam ratione exercitationem nesciunt.

Coding Blocks

Lorem ipsum dolor sit amet consectetur adipisicing elit. Repudiandae quas consequuntur illo numquam assumenda autem exercitationem distinctio perspiciatis in natus. Eius dicta similique ipsam ipsa minima, nemo quae enim tempore.

GPAL

.CallIfNotFound(GenericCallIfNotFound)

.WithPublishToConsole();

//System.Drawing.Rectangle windowSize = new System.Drawing.Rectangle(10, 10, 1500, 1024);

// NOTE: we have to set browser = before we execute any steps

// this is due to the 'GenericCallIfNotFound' which might throw an exception, and BankScraper will not have the browser set when it calls scraper.Close()

// until the complete fluent line gets executed (meaning every step, meaning browser is not set until everything else succeeds)

browser = GPAL.Browser

.WithBrowserType(Enums.BrowserType.Chrome)

.WithProfileDataDirectory(ChromeProfileLocation)

.WithUseAutomationEngine(AutomationEngine.Selenium)

.WithWindowSize(new System.Drawing.Rectangle(0,0,1920,1080))

.ToGPALObject();

Tutorials

Walking Sitemaps and Capturing Page Digests

Complete Program

Walking a Sitemap with GetSiteMapUrls

Browsing Politely: Stealth and robots.txt

Capturing Hydrated Data from Modern Frameworks

Saving Per-Page Digests with GPALFile.Next

Documentation Example

Documentation

Category

Endpoint

Enpoint

Enpoint

Enpoint

Other formats

Adding Images to the content

Coding Blocks

On this page