Content Analysis DB > Fields > Source Types > My List ♥ ()

Scraper Content Analysis Fields

Source Type: Scraper

You can often scrape information out of pages, assuming some level of consistency on pages. The scraping uses some patterns like XPath and/or regex. Please note that you can use an LLM to help you define patterns (but then not run the LLM against every piece of content).

Note: a crawler follows links to get the URLs and basic information about all the pages of a site or site section. A scraper pulls out arbitrary information out of pages.

In Content Chimera

Chimera has extensive scraping capabilities, including defining and testing patterns that include an XPath and regex.To aid analysis, Chimera automatically pulls out six fields when crawling patterns: whether or not there was a match, the count of matches, the first match, second match, third match, and also a comma-separated list of matches. These multiple fields make it easier to do your content analysis.

See Scraper fields below. Or show fields for all field types.
Author
The person(s) who wrote the content. This may be different than who published or crafted the page.
General Usefulness:
Ease of Automation:
Compare with other Org fields.
Content Type
Content Type (semantic type of content, such as Product Page or Event) is usually an extremely effective way to group and look for patterns across a digital presence.
General Usefulness:
Ease of Automation:
Compare with other Category fields.
Date Published
Date the content was originally published. This is frequently a useful factor in deciding what content can be culled.
General Usefulness:
Ease of Automation:
Compare with other Quality fields.
Division
The organizational division (or department, vice presidency, company, etc) that owns the page.
General Usefulness:
Ease of Automation:
Compare with other Org fields.
Has [Problem]
Yes or no, does this piece of content have this specific problem? The actual field name would depend on your situation, such as "Has Wall of Text".
General Usefulness:
Ease of Automation:
Compare with other Quality fields.
[IA] Depth
The depth from the perspective of the main navigational structures, for instance the Breadcrumb Depth.
General Usefulness:
Ease of Automation:
Compare with other Category fields.
[Problem] Count
How often does the problem happen on the page? This would be a specific issue, so something like "Left Nav Count".
General Usefulness:
Ease of Automation:
Compare with other Quality fields.
[Problem] Example
An example of a problem (on a specific page) you are investigating. This field could be repeated in an analysis, with actual fields like "Table Example" or "Bad Character Encoding Example".
General Usefulness:
Ease of Automation:
Compare with other Quality fields.
Site
Within which site (as experienced by the site visitor) does this content appear?
General Usefulness:
Ease of Automation:
Consider instead: Site Type
Site Section
The section of a site (for instance the news section, or a section for a particular program).
General Usefulness:
Ease of Automation:
Compare with other Category fields.
Source System
Where is the primary source of content for this URL? For instance, what CMS, document management system, or product information system does this content primarily come from?
General Usefulness:
Ease of Automation:
Compare with other Category fields.
Title
The title of the content is the most useful to people when looking at individual "rows" of an inventory. That said, unlike URL, these are not guaranteed to be unique.
General Usefulness:
Ease of Automation:
Compare with other Basic fields.
Topic
The topic/subject of the content.
General Usefulness:
Ease of Automation:
Compare with other Category fields.