My List ♥ ()

Scraper Content Analysis Fields

Source Type: Scraper

You can often scrape information out of pages, assuming some level of consistency on pages. The scraping uses some patterns like XPath and/or regex. Please note that you can use an LLM to help you define patterns (but then not run the LLM against every piece of content).

Note: a crawler follows links to get the URLs and basic information about all the pages of a site or site section. A scraper pulls out arbitrary information out of pages.

In Content Chimera

Chimera has extensive scraping capabilities, including defining and testing patterns that include an XPath and regex.To aid analysis, Chimera automatically pulls out six fields when crawling patterns: whether or not there was a match, the count of matches, the first match, second match, third match, and also a comma-separated list of matches. These multiple fields make it easier to do your content analysis.

See Scraper fields below. Or show fields for all field types.

Answer some questions to help select fields

If we take the traditional view of a content inventory or audit, we have rows representing each page (so each row has a unique URL) and then we have columns for things like the meta description or crawl depth. These columns are the different fields we have available to us in our content analysis.

①. Define what you are trying to accomplish.

Your content analysis needs to be grounded on your analyze goal.

Examples: Plan Digital Transformation, Test Content Hypothesis, Provide Better Bid.

②. Define your analysis approach.

Size and complexity of your digital presence

Size and complexity of your digital presence should drive your content analysis approach.

My digital presence is:

Use the calculator

Your approach

Content analysis does not necessarily mean opening up a spreadsheet. Before diving in, you should define your basic approach to the analysis.

Examples: Brute Force; Sample, Rules, Repeat; Quick Take.

③. Select fields toward your goal, grounded in your prioritized list of questions you want answered.

Although you can use this database however you like, in general we recommend that you build up a list of fields that will be useful for your analysis. To do so, just click on the heart next to any field name. After you have hearted some fields, you can see an analysis of your list at My List ♥ (at which point you can move to ④. Start iterating on your analysis, starting with the basics).

Author ♡

The person(s) who wrote the content. This may be different than who published or crafted the page.

General Usefulness:

Ease of Automation:

Compare with other Org fields.

Content Type ♡

Content Type (semantic type of content, such as Product Page or Event) is usually an extremely effective way to group and look for patterns across a digital presence.

General Usefulness:

Ease of Automation:

Compare with other Category fields.

Date Published ♡

Date the content was originally published. This is frequently a useful factor in deciding what content can be culled.

General Usefulness:

Ease of Automation:

Compare with other Quality fields.

Division ♡

The organizational division (or department, vice presidency, company, etc) that owns the page.

General Usefulness:

Ease of Automation:

Compare with other Org fields.

Has [Problem] ♡

Yes or no, does this piece of content have this specific problem? The actual field name would depend on your situation, such as "Has Wall of Text".

General Usefulness:

Ease of Automation:

Compare with other Quality fields.

[IA] Depth ♡

The depth from the perspective of the main navigational structures, for instance the Breadcrumb Depth.

General Usefulness:

Ease of Automation:

Compare with other Category fields.

[Problem] Count ♡

How often does the problem happen on the page? This would be a specific issue, so something like "Left Nav Count".

General Usefulness:

Ease of Automation:

Compare with other Quality fields.

[Problem] Example ♡

An example of a problem (on a specific page) you are investigating. This field could be repeated in an analysis, with actual fields like "Table Example" or "Bad Character Encoding Example".

General Usefulness:

Ease of Automation:

Compare with other Quality fields.

Site ♡

Within which site (as experienced by the site visitor) does this content appear?

General Usefulness:

Ease of Automation:

Consider instead: Site Type

Site Section ♡

The section of a site (for instance the news section, or a section for a particular program).

General Usefulness:

Ease of Automation:

Compare with other Category fields.

Source System ♡

Where is the primary source of content for this URL? For instance, what CMS, document management system, or product information system does this content primarily come from?

General Usefulness:

Ease of Automation:

Compare with other Category fields.

Title ♡

The title of the content is the most useful to people when looking at individual "rows" of an inventory. That said, unlike URL, these are not guaranteed to be unique.

General Usefulness:

Ease of Automation:

Compare with other Basic fields.

Topic ♡

The topic/subject of the content.

General Usefulness:

Ease of Automation:

Compare with other Category fields.

Legend