GetSiteMapUrls walks a site's sitemap.xml so you can discover pages without hand-coding URLs, and GetHydratedData / GetLLMDigest pull structured and LLM-friendly summaries out of each page you visit. This tutorial crawls a sitemap, captures a modern framework's hydration data, and saves an LLM digest for every page using GPALFile.Next.
Here's the whole workflow, start to finish. Each piece is broken down and explained below.
using System.Collections.Generic;
using GenerallyPositive;
using GenerallyPositive.Browser;
using static GenerallyPositive.Enums;
GPAL
.WithPublishToConsole()
.WithDriverLocation(@"C:drivers")
.WithUseOttoMagic(@"C:OttoMagic");
IBrowser browser = GPAL.Browser
.WithBrowserType(BrowserType.Chrome)
.WithUseStealth(StealthType.GoogleReferrer | StealthType.PatchChromeDriver | StealthType.DarkMode)
.WithUseAutomationEngine(AutomationEngine.OttoMagic)
.WithRespectRobotMetaTags(true)
.WithObeyRobotsTxt(true)
.ToGPALObject();
string siteUrl = "Nike.com";
List<string> sitemapUrls;
browser.GoTo(siteUrl).WaitFor(2_000)
.GetSiteMapUrls(out sitemapUrls);
foreach (string url in sitemapUrls)
{
List<string> nestedUrls;
browser.GoTo(url).WaitFor(2_000)
.GetSiteMapUrls(out nestedUrls);
foreach (string nestedUrl in nestedUrls)
{
browser.GoTo(nestedUrl).WaitFor(5_000);
GPAL.PublishSimpleEvent(GPALEventType.INFO, $"Visited [{nestedUrl}]");
}
}
NextJsHydrationResult nextJsHydrationResult;
LLMDigestResult digestResult;
GPALFile hydratedDataFile = @"c:sdihydratedData.txt";
GPALFile llmDigestFile = @"c:sdillmDigest.txt";
browser.GoTo(siteUrl).WaitFor(5_000)
.GetHydratedData(out nextJsHydrationResult);
browser.SaveHydratedData(hydratedDataFile);
browser
.GetLLMDigest(out digestResult)
.SaveLLMDigest(llmDigestFile.Next);
browser.Close(true);
GetSiteMapUrls reads the current page's sitemap (sitemap.xml or whatever robots.txt points to) and returns the URLs it lists. Top-level sitemaps often list other sitemaps - category or product feeds - rather than pages directly, so calling GetSiteMapUrls again on each result walks down a level. This nested loop visits every URL two levels deep.
List<string> sitemapUrls;
browser.GoTo(siteUrl).WaitFor(2_000)
.GetSiteMapUrls(out sitemapUrls);
foreach (string url in sitemapUrls)
{
List<string> nestedUrls;
browser.GoTo(url).WaitFor(2_000)
.GetSiteMapUrls(out nestedUrls);
foreach (string nestedUrl in nestedUrls)
{
browser.GoTo(nestedUrl).WaitFor(5_000);
GPAL.PublishSimpleEvent(GPALEventType.INFO, $"Visited [{nestedUrl}]");
}
}
A site's top-level sitemap.xml frequently lists other sitemaps (products.xml, categories.xml, and so on) instead of pages. Call GetSiteMapUrls again on each entry to walk down to the actual page URLs.
WithRespectRobotMetaTags and WithObeyRobotsTxt tell GPAL to honor a site's crawling rules - robots meta tags on individual pages and the site-wide robots.txt - the same way a well-behaved crawler would. WithUseStealth applies a combination of techniques so the automated browser looks more like a normal visit; StealthType is a flags enum, so OR several together with |.
IBrowser browser = GPAL.Browser
.WithBrowserType(BrowserType.Chrome)
.WithUseStealth(StealthType.GoogleReferrer | StealthType.PatchChromeDriver | StealthType.DarkMode)
.WithUseAutomationEngine(AutomationEngine.OttoMagic)
.WithRespectRobotMetaTags(true)
.WithObeyRobotsTxt(true)
.ToGPALObject();
Many sites built with frameworks like Next.js fetch data on the server and embed it as JSON for the page to "hydrate" into on load. GetHydratedData reads that embedded JSON into a NextJsHydrationResult, and SaveHydratedData writes it to a GPALFile. Hydration happens after the document is otherwise ready, so this step uses a longer WaitFor than a typical navigation.
NextJsHydrationResult nextJsHydrationResult;
GPALFile hydratedDataFile = @"c:sdihydratedData.txt";
browser.GoTo(siteUrl).WaitFor(5_000)
.GetHydratedData(out nextJsHydrationResult);
browser.SaveHydratedData(hydratedDataFile);
GetLLMDigest produces a condensed, LLM-friendly summary of the current page as an LLMDigestResult, and SaveLLMDigest writes it out. llmDigestFile.Next returns a new GPALFile pointing at the next available filename in sequence (llmDigest1.txt, llmDigest2.txt, ...) without changing llmDigestFile itself - call it fresh inside a loop to give every page its own digest file.
LLMDigestResult digestResult;
GPALFile llmDigestFile = @"c:sdillmDigest.txt";
browser
.GetLLMDigest(out digestResult)
.SaveLLMDigest(llmDigestFile.Next);
llmDigestFile.Next is a fresh GPALFile each time you read it - llmDigestFile itself still points at llmDigest.txt. Reading .Next inside a loop gives you llmDigest1.txt, llmDigest2.txt, and so on, so earlier digests aren't overwritten.
Showing off some plain text in these paragraphs eligendi laboriosam illo nostrum corporis at libero vel voluptas? Expedita, facere dolores voluptatem ad ab rem assumenda soluta!
Lorem ipsum dolor sit amet consectetur adipisicing elit. Obcaecati, iste distinctio veritatis eligendi laboriosam illo nostrum corporis at libero vel voluptas? Expedita, facere dolores voluptatem ad ab rem assumenda soluta!
Lorem ipsum dolor sit amet consectetur adipisicing elit. Obcaecati, iste distinctio veritatis eligendi laboriosam illo nostrum corporis at libero vel voluptas? Expedita, facere dolores voluptatem ad ab rem assumenda soluta!
Lorem ipsum dolor sit amet consectetur adipisicing elit. Quo veniam mollitia excepturi animi eum illum non libero sapiente provident assumenda, delectus voluptatum nobis sed dolorem adipisci laudantium incidunt. Error, ratione?
Lorem ipsum dolor sit amet consectetur adipisicing elit. Quo veniam mollitia excepturi animi eum illum non libero sapiente provident assumenda, delectus voluptatum nobis sed dolorem adipisci laudantium incidunt. Error, ratione?
Lorem ipsum dolor sit amet consectetur adipisicing elit. Quo veniam mollitia excepturi animi eum illum non libero sapiente provident assumenda, delectus voluptatum nobis sed dolorem adipisci laudantium incidunt. Error, ratione?
Lorem ipsum dolor sit amet consectetur adipisicing elit. Quo veniam mollitia excepturi animi eum illum non libero sapiente provident assumenda, delectus voluptatum nobis sed dolorem adipisci laudantium incidunt. Error, ratione?
Here you can find different accents and emphasis sit amet consectetur adipisicing elit. Obcaecati, iste distinctio veritatis eligendi laboriosam illo nostrum corporis at libero vel voluptas? Expedita, facere dolores voluptatem ad ab rem assumenda soluta!
This is a link and how it could look like bestlinkinthebeautifulworld. Obcaecati, iste distinctio veritatis eligendi laboriosam illo nostrum corporis at libero vel voluptas? Expedita, facere dolores voluptatem ad ab rem assumenda soluta!
Here's just some classic bold text adipisicing elit. Obcaecati, iste distinctio veritatis eligendi laboriosam notBoldSecondbestlinkinthebeautifulworld illo nostrum corporis at libero vel voluptas? Expedita, facere dolores voluptatem ad ab rem assumenda soluta!
Obcaecati, iste distinctio veritatis eligendi laboriosam adipisicing elit illo nostrum corporis at adipisicing elit libero vel voluptas? Expedita, adipisicing facere dolores voluptatem ad ab rem assumenda soluta!
Other cuple of colors in case we want to emphasize several ways adipisicing elit. Obcaecati, iste distinctio veritatis eligendi laboriosam adipisicing elit illo nostrum corporis at voluptatem libero vel voluptas? Expedita, facere dolores voluptatem ad ab rem assumenda soluta!
Lorem ipsum dolor sit amet consectetur adipisicing elit. Obcaecati, iste distinctio veritatis eligendi laboriosam illo nostrum corporis at libero vel voluptas? Expedita, facere dolores voluptatem ad ab rem assumenda soluta! Lorem ipsum dolor, sit amet consectetur adipisicing elit. Quod veniam, quam ad expedita laborum sed at voluptates culpa ipsam ut vel. Ullam temporibus a mollitia quod aliquam ratione exercitationem nesciunt.
Lorem ipsum dolor sit amet consectetur adipisicing elit. Obcaecati, iste distinctio veritatis eligendi laboriosam illo nostrum corporis at libero vel voluptas? Expedita, facere dolores voluptatem ad ab rem assumenda soluta! Lorem ipsum dolor, sit amet consectetur adipisicing elit. Quod veniam, quam ad expedita laborum sed at voluptates culpa ipsam ut vel. Ullam temporibus a mollitia quod aliquam ratione exercitationem nesciunt.
Lorem ipsum dolor sit amet consectetur adipisicing elit. Obcaecati, iste distinctio veritatis eligendi laboriosam illo nostrum corporis at libero vel voluptas? Expedita, facere dolores voluptatem ad ab rem assumenda soluta!
Lorem ipsum dolor sit amet consectetur adipisicing elit. Repudiandae quas consequuntur illo numquam assumenda autem exercitationem distinctio perspiciatis in natus. Eius dicta similique ipsam ipsa minima, nemo quae enim tempore.
GPAL
.CallIfNotFound(GenericCallIfNotFound)
.WithPublishToConsole();
//System.Drawing.Rectangle windowSize = new System.Drawing.Rectangle(10, 10, 1500, 1024);
// NOTE: we have to set browser = before we execute any steps
// this is due to the 'GenericCallIfNotFound' which might throw an exception, and BankScraper will not have the browser set when it calls scraper.Close()
// until the complete fluent line gets executed (meaning every step, meaning browser is not set until everything else succeeds)
browser = GPAL.Browser
.WithBrowserType(Enums.BrowserType.Chrome)
.WithProfileDataDirectory(ChromeProfileLocation)
.WithUseAutomationEngine(AutomationEngine.Selenium)
.WithWindowSize(new System.Drawing.Rectangle(0,0,1920,1080))
.ToGPALObject();