LLMs (Large Language Models, like ChatGPT) can be used in a variety of ways in content analysis, but one way they can be useful is by categorizing or summarizing content. Using LLMs at scale can be expensive, so the sophistication, cost, and energy efficiency of the model should be right-sized.
Fields: Audience, Author, Content Type, Tone, Topic, VoiceThere are some algorithms, like local sensitivity hashes, that use probabilistic or other means to compute field values. In addition, there are some non-LLM models and algorithms for categorization (although these tend to be less useful except for categorizing to very common values like common news topics).
Fields: Has [Problem], Near Text Duplicate, [Problem] Count, [Problem] Example, TopicPerhaps the most common fields to add are some web analytics measures like page views. These need to come from analytics tools such as Adobe Analytics, Google Analytics, or Matomo.
Fields: Page Views, [Success Event] Count, URLAlthough it depends on how structured the information is within a CMS and how easy it is to access that information, CMSes can often provide useful metadata about the content (it is, after all, the system defining the content).
Fields: Author, Content Type, Date Published, Division, File Format, Meta Keywords, Site Section, Source System, Title, Topic, URL, Unique Content IDContent quality can be assessed across many dimensions, and there are some dedicated content quality tools.
Fields: Has [Problem], [Problem] Count, Reading Level, ToneA crawler follows all the links of a site to find the URLs and pull basic data out of it. A crawler in particular has information that no other tool can provide, such as information like crawl depth.
Fields: Crawl Depth, File Format, MIME Type, Meta Description, Meta Keywords, Title, URLFormulas can be used to derive values from other fields, for instance pulling out URL "folders" from URLs.
Fields: Content Type, Date Published, Disposition, Division, Effort, File Format, File Group, Folder1, Has [Problem], Page Views, [Problem] Count, Site, Site Type, Source System, Unique Content IDThis is the type of review almost everyone presumes, when someone manually reviews each piece of content to define the field value for a particular piece of content. This may be required at times, but, unless you have a small set of content, you should try to automate to be more efficient and also to iterate/improve over time.
Fields: [Category] Revenue, Division, Has [Problem], [Problem] Count, [Problem] Example, Redundant, Site Type, [Target] Field, Tone, Topic, VoiceA map is a set of from → to pairs that allow you to do things like check if a URL has "/blog/" in it (the "from") then it is a "Blog Post" (the "to"). Rules can be more sophisticated like "if it's a Blog Post and over ten years old then delete it". This is a way to categorize content very quickly and efficiently (and also allow you to reiterate often).
Fields: Bucket, [Category] Revenue, Content Type, Disposition, Division, Effort, File Group, Has [Problem], [Problem] Count, Site, Site Section, Site Type, Source System, [Target] FieldYou can often scrape information out of pages, assuming some level of consistency on pages. The scraping uses some patterns like XPath and/or regex. Please note that you can use an LLM to help you define patterns (but then not run the LLM against every piece of content).
Note: a crawler follows links to get the URLs and basic information about all the pages of a site or site section. A scraper pulls out arbitrary information out of pages.
Fields: Author, Content Type, Date Published, Division, Has [Problem], [IA] Depth, [Problem] Count, [Problem] Example, Site, Site Section, Source System, Title, Topic