Published on January 13, 2025

How to Enrich Customer Data with LLMs and Web Crawling

How to Enrich Customer Data with LLMs and Web Crawling

One of the most common steps SaaS products lose customers is through the onboarding process. This usually happens when the onboarding process is too complex, requires too much information, or just takes too long before the customer can even try the product.

A common solution to this problem has always been to personalize the onboarding process. However, this has always been an expensive and complex process up until recently when Large Language Models (LLMs) have become more accessible at very low costs.

In this post, I will walk you through a simple setup that can help extract business information from a customer’s website that you can use to personalize the onboarding process.

This process is made up of 3 main steps:

  1. Crawling the website
  2. Converting the HTML to markdown
  3. Extracting business information with LLMs

Step 1: Installing Dependencies

To start, you need to install the necessary dependencies. These include turndown for converting HTML to Markdown, puppeteer for web crawling, and ai, @ai-sdk/openai, and zod for interfacing with LLMs. You can install these using npm or yarn:

npm install turndown puppeteer @types/turndown @types/puppeteer ai @ai-sdk/openai zod

Once you have the dependencies installed, you can start the process.

Step 2: Crawling the Website

The first step in our process is to crawl the customer’s website. We use Puppeteer for this task, which allows us to automate browser actions and retrieve the page content. Here’s how you can set up the function to crawl a website:

import puppeteer, { Browser, Page } from 'puppeteer';

/**
 * Crawls a website and returns the HTML content.
 * @param url - The URL of the website to crawl.
 * @returns The HTML content of the website.
 */
async function crawlWebsite(url: string): Promise<string> {
  let browser: Browser | null = null;
  try {
    browser = await puppeteer.launch();
    const page: Page = await browser.newPage();
    await page.goto(url, { waitUntil: 'networkidle0' });
    const content = await page.content();
    return content;
  } catch (error) {
    console.error('Error crawling website:', error);
    return '';
  } finally {
    if (browser) await browser.close();
  }
}

This function launches a browser, navigates to the specified URL, and waits until the network is idle before retrieving the page content. If there’s an error, it logs the error and returns an empty string. Finally, it ensures the browser is closed properly.

Step 3: Converting HTML to Markdown

Once we have the HTML content, we need to convert it to Markdown. This step makes the content more manageable for the LLM to process and strips out any unnecessary HTML tags that would otherwise saturate the LLM’s context window. We use the turndown library for this conversion:

import TurndownService from 'turndown';

/**
 * Converts HTML to Markdown.
 * @param html - The HTML content to convert.
 * @returns The Markdown content.
 */
function htmlToMarkdown(html: string): string {
  const turndownService = new TurndownService();
  return turndownService.turndown(html);
}

This function creates a new instance of TurndownService and uses its turndown method to convert the HTML to Markdown.

Step 4: Extracting Business Information with LLM

Now, we use an LLM to extract relevant business information from the Markdown content. First, we need to define a schema for the data we want to extract. Of course, I’m using a generic schema here, but you can customize it to your needs.

import { z } from 'zod';

const BusinessInfoSchema = z.object({
  name: z.string().describe('The official name of the SaaS business'),
  field: z
    .string()
    .describe('The industry or sector the SaaS product operates in'),
  description: z
    .string()
    .describe(
      'A brief summary of what the SaaS product does and its main value proposition'
    ),
  pricingPlans: z
    .array(
      z.object({
        name: z.string().describe('The name of the pricing tier'),
        price: z
          .string()
          .describe(
            'The cost of the plan, including currency and billing frequency if available'
          ),
        features: z
          .array(z.string())
          .describe(
            'A list of key features or benefits included in this pricing tier'
          )
      })
    )
    .describe('An array of pricing plans offered by the SaaS product')
});

type BusinessInfo = z.infer<typeof BusinessInfoSchema>;

With the schema defined, we can create a function to extract the business information:

import { generateObject } from 'ai';
import { createOpenAI } from '@ai-sdk/openai';

const openai = createOpenAI({
  apiKey: process.env.OPENAI_API_KEY
});

/**
 * Extracts business information from the given Markdown content.
 * @param markdown - The Markdown content to extract information from.
 * @returns The extracted business information.
 */
async function extractBusinessInfo(markdown: string): Promise<BusinessInfo> {
  const prompt = `
    Extract business information from the following website content:
    ${markdown}

    Please provide the business name, field, a short description, and pricing plans.
  `;

  const { object } = await generateObject({
    model: openai('gpt-4o'),
    schema: BusinessInfoSchema,
    prompt: prompt
  });

  return object;
}

This function uses the AI SDK to generate a structured object based on the Markdown content and our defined schema. The LLM processes the prompt and returns the extracted information in the specified format.

Note: You can use any LLM provider you want, but I’m using OpenAI here. Just make sure to add the OPENAI_API_KEY to your environment variables.

Step 5: Tying It All Together

Finally, we combine all these steps into a main function that automates the entire process:

/**
 * Analyzes a SaaS website and extracts business information.
 * @param url - The URL of the SaaS website to analyze.
 * @returns The extracted business information.
 */
async function analyzeSaaSWebsite(url: string): Promise<BusinessInfo> {
  const html = await crawlWebsite(url);
  const markdown = htmlToMarkdown(html);
  const businessInfo = await extractBusinessInfo(markdown);
  return businessInfo;
}

This function takes a URL and an LLM model as inputs, crawls the website, converts the content to Markdown, and extracts the business information.

Example Output

Here’s an example of what the output might look like after processing a SaaS website:

{
  "name": "Example SaaS",
  "field": "Project Management",
  "description": "Example SaaS provides a comprehensive project management solution for teams of all sizes.",
  "pricingPlans": [
    {
      "name": "Basic",
      "price": "$9.99/month",
      "features": ["Task Management", "Team Collaboration", "Basic Reporting"]
    },
    {
      "name": "Pro",
      "price": "$29.99/month",
      "features": [
        "Advanced Reporting",
        "Time Tracking",
        "Integration with Third-Party Apps"
      ]
    },
    {
      "name": "Enterprise",
      "price": "Contact for pricing",
      "features": [
        "Customizable Workflows",
        "Dedicated Support",
        "On-Premise Deployment"
      ]
    }
  ]
}

This output provides a clear, structured view of the business information extracted from the website, which can be invaluable for SaaS developers looking to understand and serve their customers better.

Final Thoughts

Of course, this is a very basic example just to get you started. You can go a lot deeper, crawling more pages, extracting very specific information that you need, and even tapping into other sources of data to enrich the information you have.

In fact, I have written a separate post on creating AI agents in Node with AI SDK and a separate post on using postgres and pgvector for RAG applications that could be additional building blocks to build even more complex and powerful agentic flows for your SaaS product.

Finally, what you do with the information is up to you. You can use it to pre-populate your onboarding form, create custom onboarding flows, or even use it personalize email sequences for each customer.