Content Analysis DB > Fields > Source Types

Source Types

AI / LLM

LLMs (Large Language Models, like ChatGPT) can be used in a variety of ways in content analysis, but one way they can be useful is by categorizing or summarizing content. Using LLMs at scale can be expensive, so the sophistication, cost, and energy efficiency of the model should be right-sized.

Fields: Audience, Author, Content Type, Tone, Topic, Voice

Advanced Algorithms

There are some algorithms, like local sensitivity hashes, that use probabilistic or other means to compute field values. In addition, there are some non-LLM models and algorithms for categorization (although these tend to be less useful except for categorizing to very common values like common news topics).

Fields: Has [Problem], Near Text Duplicate, [Problem] Count, [Problem] Example, Topic

Analytics

Perhaps the most common fields to add are some web analytics measures like page views. These need to come from analytics tools such as Adobe Analytics, Google Analytics, or Matomo.

Fields: Page Views, [Success Event] Count, URL

CMS

Although it depends on how structured the information is within a CMS and how easy it is to access that information, CMSes can often provide useful metadata about the content (it is, after all, the system defining the content).

Fields: Author, Content Type, Date Published, Division, File Format, Meta Keywords, Site Section, Source System, Title, Topic, URL, Unique Content ID

Content Quality Tool

Content quality can be assessed across many dimensions, and there are some dedicated content quality tools.

Fields: Has [Problem], [Problem] Count, Reading Level, Tone

Crawler

A crawler follows all the links of a site to find the URLs and pull basic data out of it. A crawler in particular has information that no other tool can provide, such as information like crawl depth.

Fields: Crawl Depth, File Format, MIME Type, Meta Description, Meta Keywords, Title, URL

Formula

Formulas can be used to derive values from other fields, for instance pulling out URL "folders" from URLs.

Fields: Content Type, Date Published, Disposition, Division, Effort, File Format, File Group, Folder1, Has [Problem], Page Views, [Problem] Count, Site, Site Type, Source System, Unique Content ID

Manual Review

This is the type of review almost everyone presumes, when someone manually reviews each piece of content to define the field value for a particular piece of content. This may be required at times, but, unless you have a small set of content, you should try to automate to be more efficient and also to iterate/improve over time.

Fields: [Category] Revenue, Division, Has [Problem], [Problem] Count, [Problem] Example, Redundant, Site Type, [Target] Field, Tone, Topic, Voice

Maps & Rules

A map is a set of from → to pairs that allow you to do things like check if a URL has "/blog/" in it (the "from") then it is a "Blog Post" (the "to"). Rules can be more sophisticated like "if it's a Blog Post and over ten years old then delete it". This is a way to categorize content very quickly and efficiently (and also allow you to reiterate often).

Fields: Bucket, [Category] Revenue, Content Type, Disposition, Division, Effort, File Group, Has [Problem], [Problem] Count, Site, Site Section, Site Type, Source System, [Target] Field

Scraper

You can often scrape information out of pages, assuming some level of consistency on pages. The scraping uses some patterns like XPath and/or regex. Please note that you can use an LLM to help you define patterns (but then not run the LLM against every piece of content).

Note: a crawler follows links to get the URLs and basic information about all the pages of a site or site section. A scraper pulls out arbitrary information out of pages.

Fields: Author, Content Type, Date Published, Division, Has [Problem], [IA] Depth, [Problem] Count, [Problem] Example, Site, Site Section, Source System, Title, Topic