What's New:
Content Chimera

Early access is now available. Sign up for early access.

27 Apr 2020: Better Scraping and Heatmap Tables

As always, a variety of routine fixes, performance improvements, and backend monitoring improvements were rolled out. In addition, many of the changes below are toward stronger features in the future. But the big headliners for now are better scraping and heatmap tables.

Heatmap Tables

Heatmap tables show compare two categories, with darkness showing the amount in each cell. For instance, this heatmap table shows how frequently different Calls-To-Action are used by site section. We can see that Change Request Flowchart is the most commonly-used CTA across the site, and that articles has the most CTAs.

To use a heatmap table you need just need to specify what should be in the rows and what should be in the columns (after first clicking the gear icon to go to advanced charting options and the selecting Heatmap Table as the chart type):

We are also close to launching scatter charts, which is a good way to compare categories with more values.

Better Scraping of Patterns

Content Chimera has always had the ability to scrape patterns out of the content, and it always did so from a local cache of the site. That said, creating, managing, and running patterns was cumbersome. So we made a variety of improvements.

1. Just select the (full) pattern

Before you needed to separately select scope, pattern, and test. Now, instead of selecting "Full HTML", "Table", "Has", you just select "Tables". So you now interact with Full Patterns, which are a combination of scope (where to even look for the pattern), pattern (why to pull from the scope), and test (what single test, using a simple "does it have a value?").

2. Define your own full patterns.

You can now define your own full patterns. You can select a combination of scope, pattern, and test, like before (although now you can name them and re-use them):

You can even define the *components* of a pattern extraction:

3. New "meta-tag" type of scope.

Do you happen to be blessed with nice, clean metadata exposed like this?

If so, when defining a scope just select the new "Meta Tag" option and just enter DC.subject or the specific meta-tag you want to capture. For pattern, select either All (which would work in the example here) or Comma-Separated (which will pull out from comma-separated lists).

4. Test patterns

You can now test a pattern before unleashing it against the entire site. For example, here is a test against the content above, using a Meta Tag scope on DC.subject:

5. New "all" column.

Whenever you scrape a pattern, several columns are generated for your charting and decision-making. Now we added an "all" column. This will list all the values (actually, there are limits: it just captures the first 200 values or 16,000 characters, whichever comes first). We actually are actively testing sophisticated multi-value analysis as well, which will really ramp up the value of the "all" columns.

6. Scrape multiple patterns at once.

Every time you scrape a pattern, it gets added to your site* suite of patterns. These could all be re-scraped at once. In the future we plan on allowing a billing organization to create a suite of patterns that can then be run against any sites that organization is managing.

*Actually, you can add patterns to any extent. An extent is a site, group of sites, or client.

30 Mar 2020: New Forms

We have been working hard on an upcoming feature: multi-value analysis. This will allow analysis of tagging / topics for example (each piece of content can have multiple tags or topics applied to it). This is a technically-difficult task that will take some time to completely develop and deploy. For now, we have been making a variety of backend changes such as weaving in an entirely different database type for this multi-value analysis and better scraping of multi-values

Over the past weeks we have trickled out a variety of smaller changes, such as implementing a new approach logging to allow better visibility and fixing a bug where a second pass of a pattern scrape wasn't updating correctly.

We have developed a new approach for more quickly deploying new forms to control more in Content Chimera (not glamorous but should help us in the future). For now, we have added the following forms:

  • A new form to add a custom data source. Please contact us if you plan on doing this so we can help with this advanced feature.
  • A new Quotas & Users form (access via the pulldown under your name) where you can add more users and see the status of your quota usage.

31 Jan 2020: Chart Improvements

Various charting improvements

Advanced charting options were reorganized for clarity and for a bit more space. The primary charting options were rationalized between normal charting and treemap charting.

In normal charting:

In treemap view (since that is a true hierarchy, and coloring works differently in a treemap):

In addition, there were some bugs in random sampling that have been fixed. Also, the charting is slightly faster now.

Website improvements

Changes to the website (not the app itself):

  • A new home page
  • A separate features page
  • A global navigation

28 Jan 2020: Filtering Improvements

Various filtering improvements

  • Saving a chart also now stores the Group setting
  • Chart now refreshes every time a filter is changed on the chart page
  • Change the way empty values are treated (they are no longer considered as 0 — for instance when filtering on PageViews < 1, now a page that did not match a Google Analytics row at all will not be counted in that filter)
  • Can now remove a filter from a chart (without refreshing the page manually)
  • “is empty” operator now available in the front end

New data source + new patterns + more

  • OnPoint Auditor is now a data source for import
  • New patterns and scopes for scraping: including Drupal Paragraphs, Drupal Content Types, and Meta Description
  • The extent history page now displays key parameters for scrape and merge jobs, and has been formatted slightly better.
  • When you ran out of URLs, Content Chimera silently stopped the crawl but it didn't tell you that on the web page. This has been fixed.
  • Various backend improvements

18 Jan 2020: Site History

  • Site-level activity history! Now that more people are using the tool, we are running into the case of more than one person working on analysis of a site at the same time. So now you can click on "Show History" from the Assets & Metadata to see all major activity on that site (or sitegroup or client). The history also indicates who took each action.
  • CAT / CWRX Audit has been added as a data source. If you missed the news, Content Science bought Content Analysis Toolkit (CAT) so it is in the process of changing names. Regardless of the name, you can now import from the tool.
  • Darker color for single-color charts. During a screenshare demo to a remote conference room, everyone yelled "We can't see the chart!", so we made the bars a darker shade of gray so it's higher contrast when a monitor isn't calibrated well (or in strong ambient light).
  • More URL deduping options. Content Chimera does two types of duplication analysis: at the URL level (only looking at the URL and nothing else) and content level duplicate analysis (looking at the text itself). We already had options for whether URLs starting with www or not should be considered duplicate URLs or not, along with capitalization and http or https. We just added a configuration option of whether there's an index.html (actually index.*) at the end or not, or whether there is a slash at the end or not.
  • As usual, we made a variety of backend improvements as well, for us to better monitor the health of the systems and to troubleshoot more quickly.

2 December 2019: Reading Level Analysis!

Now, when you do RoT analysis, you can also do reading analysis. Then you can use the reading level data just like all other data, for instance to graph the distribution of reading levels across the pages of your site.

Also, a bunch of scaling, performance, monitoring, and bug fixes: Fix how the rules processing UI worked. Lots of mostly-invisible changes to RoT testing (better scalability, improved monitoring, improved error handling, handling more edge cases of encoding issues).

7 November 2019: Major Changes

There were a ton of related changes that we wanted to roll out together, so today's deployment was big. One theme is more consistent asset filters (rules to filter assets by, such as folder1 = articles), which are now used in rules and charts -- this is now generalized so will probably be added elsewhere as well.

Rules management

  • Now there is no left-right scrolling with large and complex rulesets
  • Now select rule operators (like "equals") from a pulldown, rather than the more error-prone method of having to type them
  • Ability to drag-and-drop to reorder rules

Charting

  • Can now click on the label at the bottom of the graph, useful if a particular element of the graph is too tiny to click
  • Filter can now be more complex (such as folder1 = article and folder2 = 2019)

UI improvements

  • Can now modify clients in the UI
  • Better deal with and report to the UI when there are errors
  • More consistent filtering
  • Make numbers better match up in progress circles
  • When selecting extent (for instance, which site you want to analyze), checkboxes act as radio buttons rather than multi-selects
  • In the free ROT onboarding, allow people to give their real names
  • When starting a job and their are already others running jobs, report how many jobs are in front of you

Crawling improvements

  • Don't get blocked by Sucuri application filtering
  • Always use Content Chimera user agent in requests
  • Can now add an asset filter to a crawl job (for now this is just in the backend)
  • New backend option for whether to treat www and non-www URLs as the same (whether they are treated as duplicates in the URL deduplication process)

Backend and monitoring

  • Even out resource utilization across types of jobs
  • Show bar at the top of environments to be clear what environment we are running (you will now see a thin red bar at the top of your screen, which indicates the main production environment
  • In the event that large DB writes don't work, report this and stop the job.
  • In addition to the PHP performance, server-level monitoring, logging, and alerting already in place, added more sophisticated annotations in monitoring to isolate root causes of performance issues in long-running jobs.
  • Better deal with bad character encoding at the time of ROT analysis
  • Better deal with emojis in text
  • Take a PHP performance snapshot every half hour of a very long-running job
  • Speed up file access, especially relevant for large jobs
  • Better report progress on a RoT job when there is a lot of trivial content
  • Fix bug when scraping against a field (rather than HTML or PDF)

Content Chimera is coming. Sign up for early access.