Early access is now available. Sign up for early access.
27 Apr 2020: Better Scraping and Heatmap Tables
As always, a variety of routine fixes, performance improvements, and backend
monitoring improvements were rolled out. In addition, many of the changes below
are toward stronger features in the future. But the big headliners for now
are better scraping and heatmap tables.
Heatmap tables show compare two categories, with darkness showing the amount in
each cell. For instance, this heatmap table shows how frequently different
Calls-To-Action are used by site section. We can see that Change Request Flowchart
is the most commonly-used CTA across the site, and that articles has the most
To use a heatmap table you need just need to specify what should be in the rows
and what should be in the columns (after first clicking the gear icon to go to advanced
charting options and the selecting Heatmap Table as the chart type):
We are also close to launching scatter charts, which is a good way to compare
categories with more values.
Better Scraping of Patterns
Content Chimera has always had the ability to scrape patterns out of the content,
and it always did so from a local cache of the site. That said, creating, managing,
and running patterns was cumbersome. So we made a variety of improvements.
1. Just select the (full) pattern
Before you needed to separately select scope, pattern, and test.
Now, instead of selecting "Full HTML", "Table", "Has", you just select "Tables".
So you now interact with Full Patterns, which are a combination of scope (where
to even look for the pattern), pattern (why to pull from the scope), and test (what
single test, using a simple "does it have a value?").
2. Define your own full patterns.
You can now define your own full patterns. You can select a combination
of scope, pattern, and test, like before (although now you can name them and
You can even define the *components* of a pattern extraction:
3. New "meta-tag" type of scope.
Do you happen to be blessed with nice, clean metadata exposed like this?
If so, when defining a scope just select the new "Meta Tag" option and just enter
DC.subject or the specific meta-tag you want to capture. For pattern, select either All (which would work in the
example here) or Comma-Separated (which will pull out from
4. Test patterns
You can now test a pattern before unleashing it against the entire site. For
example, here is a test against the content above, using a Meta Tag scope
5. New "all" column.
Whenever you scrape a pattern, several columns are generated for your
charting and decision-making. Now we added an "all" column. This will list
all the values (actually, there are limits: it just captures the first 200
values or 16,000 characters, whichever comes first). We actually are actively
testing sophisticated multi-value analysis as well, which will really ramp up
the value of the "all" columns.
6. Scrape multiple patterns at once.
Every time you scrape a pattern, it gets added to your site* suite of patterns.
These could all be re-scraped at once. In the future we plan on allowing a billing
organization to create a suite of patterns that can then be run against any sites
that organization is managing.
*Actually, you can add patterns to any extent. An extent is a site, group of sites,
30 Mar 2020: New Forms
We have been working hard on an upcoming feature: multi-value analysis. This will allow
analysis of tagging / topics for example (each piece of content can have multiple tags or
topics applied to it). This is a technically-difficult task that will take some time to
completely develop and deploy. For now, we have been making a variety of backend changes
such as weaving in an entirely different database type for this multi-value analysis and
better scraping of multi-values
Over the past weeks we have trickled out a variety of smaller changes, such as
implementing a new approach
logging to allow better visibility and fixing a bug where a second pass of a
pattern scrape wasn't updating correctly.
We have developed a new approach for more quickly deploying new forms to
control more in Content Chimera (not glamorous but should help us in the future).
For now, we have added the following forms:
- A new form to add a custom data source. Please contact us if you plan on doing this so we can help with this advanced feature.
- A new Quotas & Users form (access via the pulldown under your name) where you can add more users and see the status of your quota usage.
31 Jan 2020: Chart Improvements
Various charting improvements
Advanced charting options were reorganized for clarity and for a bit more space.
The primary charting options were rationalized between normal charting and treemap charting.
In normal charting:
In treemap view (since that is a true hierarchy, and coloring works differently in a treemap):
In addition, there were some bugs in random sampling that have been fixed. Also, the
charting is slightly faster now.
Changes to the website (not the app itself):
- A new home page
- A separate features page
- A global navigation
28 Jan 2020: Filtering Improvements
Various filtering improvements
Saving a chart also now stores the Group setting
Chart now refreshes every time a filter is changed on the chart page
Change the way empty values are treated (they are no longer considered as 0 — for instance when filtering on PageViews < 1, now a page that did not match a Google Analytics row at all will not be counted in that filter)
Can now remove a filter from a chart (without refreshing the page manually)
“is empty” operator now available in the front end
New data source + new patterns + more
OnPoint Auditor is now a data source for import
New patterns and scopes for scraping: including Drupal Paragraphs, Drupal Content Types, and Meta Description
The extent history page now displays key parameters for scrape and merge jobs, and has been formatted slightly better.
When you ran out of URLs, Content Chimera silently stopped the crawl but it didn't tell you that on the web page. This has been fixed.
Various backend improvements
18 Jan 2020: Site History
Site-level activity history! Now that more people are
using the tool, we are running into the case of more than one person
working on analysis of a site at the same time. So now you can click on
"Show History" from the Assets & Metadata to see all major activity on
that site (or sitegroup or client). The history also indicates who took each action.
- CAT / CWRX Audit has been added as a data source.
If you missed the news, Content Science bought Content Analysis Toolkit (CAT)
so it is in the process of changing names. Regardless of the name, you can now import
from the tool.
Darker color for single-color charts. During a screenshare
demo to a remote conference room, everyone yelled "We can't see the chart!",
so we made the bars a darker shade of gray so it's higher contrast when a
monitor isn't calibrated well (or in strong ambient light).
More URL deduping options. Content Chimera does two types of
duplication analysis: at the URL level (only looking at the URL and nothing
else) and content level duplicate analysis (looking at the text itself).
We already had options for whether URLs starting with www or not should be considered
duplicate URLs or not, along with capitalization and http or https. We just added
a configuration option of whether there's an index.html (actually index.*) at the end or not,
or whether there is a slash at the end or not.
As usual, we made a variety of backend improvements as well, for us to better
monitor the health of the systems and to troubleshoot more quickly.
2 December 2019: Reading Level Analysis!
Now, when you do RoT analysis, you can also do reading analysis. Then you can use the reading level data just like all other data, for instance to graph the distribution of reading levels across the pages of your site.
Also, a bunch of scaling, performance, monitoring, and bug fixes: Fix how the rules processing UI worked. Lots of mostly-invisible changes to RoT testing (better scalability, improved monitoring, improved error handling, handling more edge cases of encoding issues).
7 November 2019: Major Changes
There were a ton of related changes that we wanted to roll out together, so today's deployment was big.
One theme is more consistent asset filters (rules to filter assets by, such as folder1 = articles), which are now used in rules and charts --
this is now generalized so will probably be added elsewhere as well.
- Now there is no left-right scrolling with large and complex rulesets
- Now select rule operators (like "equals") from a pulldown, rather than the more error-prone method of having to type them
- Ability to drag-and-drop to reorder rules
- Can now click on the label at the bottom of the graph, useful if a particular element of the graph is too tiny to click
- Filter can now be more complex (such as folder1 = article and folder2 = 2019)
- Can now modify clients in the UI
- Better deal with and report to the UI when there are errors
- More consistent filtering
- Make numbers better match up in progress circles
- When selecting extent (for instance, which site you want to analyze), checkboxes act as radio buttons rather than multi-selects
- In the free ROT onboarding, allow people to give their real names
- When starting a job and their are already others running jobs, report how many jobs are in front of you
- Don't get blocked by Sucuri application filtering
- Always use Content Chimera user agent in requests
- Can now add an asset filter to a crawl job (for now this is just in the backend)
- New backend option for whether to treat www and non-www URLs as the same (whether they are treated as duplicates in the URL deduplication process)
Backend and monitoring
- Even out resource utilization across types of jobs
- Show bar at the top of environments to be clear what environment we are running (you will now see a thin red bar at the top of your screen, which indicates the main production environment
- In the event that large DB writes don't work, report this and stop the job.
- In addition to the PHP performance, server-level monitoring, logging, and alerting already in place, added more sophisticated annotations in monitoring to isolate root causes of performance issues in long-running jobs.
- Better deal with bad character encoding at the time of ROT analysis
- Better deal with emojis in text
- Take a PHP performance snapshot every half hour of a very long-running job
- Speed up file access, especially relevant for large jobs
- Better report progress on a RoT job when there is a lot of trivial content
- Fix bug when scraping against a field (rather than HTML or PDF)
Content Chimera is coming. Sign up for early access.