ChatGPT Prompts to Enrich Content Inventories | Articles

In Content Analysis, whether we are using the traditional spreadsheet approach or using a data warehouse, we enrich our analysis by adding fields in order to answer content questions or test hypotheses. There are different source types (web analytics, scraping, calculations, fields, manual review, rules, maps, algorithms, etc) to fill in those fields.

ChatGPT may be able to help in many ways, but I want to distinguish between two use cases that may be confusing if we don't clarify what this article is about:

This article → Enriching a content inventory by adding fields to our analysis, especially those that are resistant to other means of automation when we need to do analysis at scale (ChatGPT as a new source type, adding at least one value per "row").
Potential future article → Using ChatGPT as a way of querying and summarizing a content inventory (ChatGPT as an interface to querying existing data, querying all "rows" to get a single answer).

We need a small set of values, for each URL

Regardless of how we get the data, when enriching a content inventory by adding fields, we need to wind up with:

The field must have a value for each URL in the inventory (for instance, a page views field should have a value for each URL, in cases where there was a page view)
The range of potential values should be constrained. Page views are integers between zero and millions, there is some limited set of file formats, etc.

Sampling

It may also make sense to sample values to get an initial feel for the range of values, but regardless conceptually we need to be able to get a value for any/all item(s) in the inventory. If we didn't have this requirement then we may as well sample manually.

We want to end up with something like this:

URL	Content Type	Journey Step	Audience
/news/2024/03/23/product_update	Press Release	Aware	General Public
/staff/david	Bio	Considering	General Public
/article/coding_in_lisp	Article	Customer	Programmer

Diving in

The easiest way to test is to get a ChatGPT Plus account. This will allow you to either refer to URLs or upload a file to make testing far easier (to do so required a Plus account as of March 23, 2024). Also, you'll need to select the GPT 4 model to be able to upload a file or access a URL.

Let's start with a simple enough prompt to get a Content Type (I just copied the main text from https://davidhobbsconsulting.com/bio from Chrome and pasted into a text editor and saved the file, then uploaded using the paperclip next to the prompt entry in the ChatGPT UI):

Our first attempt, using the text from https://davidhobbsconsulting.com/bio

This result looks very promising ("professional bio" sounds about perfect as an initial content type), but we have way too much information.

Context and instructions.

As we saw above, in general we can't just blindly ask questions of ChatGPT and expect responses that are useful for content analysis. We need to provide it two things:

Instructions: This is how you want ChatGPT to handle the eventual prompt. In the case of creating and using a ChatGPT assistant, there's literally an Instructions field to fill out. Or this can just be included as part of the prompt itself.
Context: This is providing more background and structure to our data. This could be things like defining what we mean by content type or defining a range of values that we are interested in getting back.

We do not want sentences back

We do not want as bunch of sentences back. In both ChatGPT and Claude, you can add context like:

                                        Answer in one or two words.
                                

ChatGPT Instruction Text

Restricting values

We may want to restrict the values to a set. In the simple case, we can just enumerate the values we are interested in in text:

                                        The response should only be "Bio" or "News".
                                

Context, entered as prompt or instructions

Now we are getting somewhere. If we enter the above instruction and context example and the same prompt we started with we get:

Response when the instructions and context. In this example, we have set this up as a ChatGPT assistant with the instructions and context in the instructions. This could also be entered in the prompt.

Potentially useful prompts

If we give appropriate context/instructions, we can start getting potentially useful values. In general, the most useful types of values are probably category and user fields.

Content Type

                                        What type of content is the text?
                                

ChatGPT Prompt

If we give the instruction to respond with just one word and don't specify the desired possible values, we get:

                                        Biography
                                

ChatGPT Response

Audience

                                        Who is the audience of the text?
                                

ChatGPT Prompt

Note that we can embed the instructions/context right into the prompt:

Journey Step

Once we start moving into more subtle values like steps in a journey, we probably need to provide more and more context. Here is an initial prompt with context/instructions for getting the potential journey step(s) for an example piece of content (the University of Maryland's Visit UMD page):

Using the text from this page: https://admissions.umd.edu/visit

Factors we need to consider

ChatGPT and other LLMs have great potential. Here are some factors we need to address:

Price. Especially at scale (see complexity calculator), which is where automation is the most useful, we need to consider the overall cost of using LLMs to categorize a lot of content, especially since one of the reasons to automate is so we can iterate and tweak over time.
Performance. LLMs are not always very fast at responding to queries. Especially when we are evaluating a lot of assets, this may get time consuming (again, especially if we want to iterate or test things over time).
Reliability. At least in my experience, it seems that there can be variability in responses that may mean the results are not reliable enough to be useful.
Functionality. In the end, we need to make sure that we get values that are actually useful in real world examples across a bucket of content.
Throttling. Limits of making requests can reduce the effectiveness of using ChatGPT. In initial testing of automating using ChatGPT in Content Chimera, we quickly ran out of tokens even on a very small data set.

Content Chimera

We are exploring adding this type of capability to Content Chimera. If we do so, we envision leveraging strengths already in Chimera:

Text pre-processing. Especially for near-text duplicate analysis, Chimera already does text pre-processing that may be useful for LLMs.
Rationalized data from multiple sources. Chimera can already pull data from multiple sources, and these could be leveraged in LLMs. For instance, we could pass the full text but also other fields to the LLM to categorize.
Maps. Maps could be used as context to LLMs, for example to convert some values that the LLM may respond with, or to specify equivalent terms to the LLM.
More straightforward configuration of complex processes. Chimera is a power tool, but it does not require coding. For instance, the near text duplicate algorithm only requires a small amount of configuration, but already has been tweaked to run effectively without requiring the user to do so.
Calculated fields. Not only could other fields be useful to the LLM, but a calculated field could decide whether to run the LLM or not. For instance, if the content type was exposed by the CMS then just use that value and otherwise run the LLM.
Over-time values. Values can be tracked over time.
A lot of context. From the Fields and Dispositions DB to information stored about the meanings of fields, this information could be automatically fed into an LLM for higher quality results.

Do you have other ideas?

Have you had success at using ChatGPT for content analysis? If so, how? Please reach out via LinkedIn.