In Content Analysis, whether we are using the traditional spreadsheet approach or using a data warehouse, we enrich our analysis by adding fields in order to answer content questions or test hypotheses. There are different source types (web analytics, scraping, calculations, fields, manual review, rules, maps, algorithms, etc) to fill in those fields.
ChatGPT may be able to help in many ways, but I want to distinguish between two use cases that may be confusing if we don't clarify what this article is about:
This article → Enriching a content inventory by adding fields to our analysis, especially those that are resistant to other means of automation when we need to do analysis at scale (ChatGPT as a new source type, adding at least one value per "row").
Potential future article → Using ChatGPT as a way of querying and summarizing a content inventory (ChatGPT as an interface to querying existing data, querying all "rows" to get a single answer).
Regardless of how we get the data, when enriching a content inventory by adding fields, we need to wind up with:
The field must have a value for each URL in the inventory (for instance, a page views field should have a value for each URL, in cases where there was a page view)
The range of potential values should be constrained. Page views are integers between zero and millions, there is some limited set of file formats, etc.
It may also make sense to sample values to get an initial feel for the range of values, but regardless conceptually we need to be able to get a value for any/all item(s) in the inventory. If we didn't have this requirement then we may as well sample manually.
We want to end up with something like this:
URL |
Content Type |
Journey Step |
Audience |
---|---|---|---|
/news/2024/03/23/product_update |
Press Release |
Aware |
General Public |
/staff/david |
Bio |
Considering |
General Public |
/article/coding_in_lisp |
Article |
Customer |
Programmer |
The easiest way to test is to get a ChatGPT Plus account. This will allow you to either refer to URLs or upload a file to make testing far easier (to do so required a Plus account as of March 23, 2024). Also, you'll need to select the GPT 4 model to be able to upload a file or access a URL.
Let's start with a simple enough prompt to get a Content Type (I just copied the main text from https://davidhobbsconsulting.com/bio from Chrome and pasted into a text editor and saved the file, then uploaded using the paperclip next to the prompt entry in the ChatGPT UI):
This result looks very promising ("professional bio" sounds about perfect as an initial content type), but we have way too much information.
As we saw above, in general we can't just blindly ask questions of ChatGPT and expect responses that are useful for content analysis. We need to provide it two things:
Instructions: This is how you want ChatGPT to handle the eventual prompt. In the case of creating and using a ChatGPT assistant, there's literally an Instructions field to fill out. Or this can just be included as part of the prompt itself.
Context: This is providing more background and structure to our data. This could be things like defining what we mean by content type or defining a range of values that we are interested in getting back.
We do not want as bunch of sentences back. In both ChatGPT and Claude, you can add context like:
We may want to restrict the values to a set. In the simple case, we can just enumerate the values we are interested in in text:
Now we are getting somewhere. If we enter the above instruction and context example and the same prompt we started with we get:
If we give the instruction to respond with just one word and don't specify the desired possible values, we get:
Note that we can embed the instructions/context right into the prompt:
Once we start moving into more subtle values like steps in a journey, we probably need to provide more and more context. Here is an initial prompt with context/instructions for getting the potential journey step(s) for an example piece of content (the University of Maryland's Visit UMD page):
ChatGPT and other LLMs have great potential. Here are some factors we need to address:
Price. Especially at scale (see complexity calculator), which is where automation is the most useful, we need to consider the overall cost of using LLMs to categorize a lot of content, especially since one of the reasons to automate is so we can iterate and tweak over time.
Performance. LLMs are not always very fast at responding to queries. Especially when we are evaluating a lot of assets, this may get time consuming (again, especially if we want to iterate or test things over time).
Reliability. At least in my experience, it seems that there can be variability in responses that may mean the results are not reliable enough to be useful.
Functionality. In the end, we need to make sure that we get values that are actually useful in real world examples across a bucket of content.
Throttling. Limits of making requests can reduce the effectiveness of using ChatGPT. In initial testing of automating using ChatGPT in Content Chimera, we quickly ran out of tokens even on a very small data set.
We are exploring adding this type of capability to Content Chimera. If we do so, we envision leveraging strengths already in Chimera:
Text pre-processing. Especially for near-text duplicate analysis, Chimera already does text pre-processing that may be useful for LLMs.
Rationalized data from multiple sources. Chimera can already pull data from multiple sources, and these could be leveraged in LLMs. For instance, we could pass the full text but also other fields to the LLM to categorize.
Maps. Maps could be used as context to LLMs, for example to convert some values that the LLM may respond with, or to specify equivalent terms to the LLM.
More straightforward configuration of complex processes. Chimera is a power tool, but it does not require coding. For instance, the near text duplicate algorithm only requires a small amount of configuration, but already has been tweaked to run effectively without requiring the user to do so.
Calculated fields. Not only could other fields be useful to the LLM, but a calculated field could decide whether to run the LLM or not. For instance, if the content type was exposed by the CMS then just use that value and otherwise run the LLM.
Over-time values. Values can be tracked over time.
A lot of context. From the Fields and Dispositions DB to information stored about the meanings of fields, this information could be automatically fed into an LLM for higher quality results.
Have you had success at using ChatGPT for content analysis? If so, how? Please reach out via LinkedIn.