Chimera can now more effectively crawl larger sites (hundreds of thousands of URLs or more). In the backend, this means more parallel processing, more batch activities, and many other improvements to handle a variety of conditions that arise in large crawls. In the UI, this means you'll see more steady progress and fewer slowdowns on large crawls (although crawl speed naturally can change for example when the crawler hits a pocket of very large pages).
Content Chimera development always faces the tension of 1) adding bleeding edge features for the complex digital presences that David Hobbs Consulting analyzes and 2) making Chimera easier to use for new subscribers.
We've gotten a bunch of great feedback on the onboarding experience. We've made a a variety of UI changes:
Content Chimera now allows estimating per-content-item content transformation effort, with the assignments page now showing the aggregate effort per disposition:
This is based on rules (long-time functionality) and dispositions (new functionality).
The end result is that there is now an estimate, per content item, of effort to transform that piece of content. This is stored in the "effort" field, where number is the estimated number of manual minutes to transform that content.
You can now create your own reports in Content Chimera. Reports help you either build a story around a website toward driving change, simply pull together information to illustrate current state, or an ongoing dashboard that updates as the underlying data changes. Content Chimera reports:
Chimera reports are a blend of:
Content Chimera can import data from many data sources, including dedicated crawlers. Now one advantage of crawling with Content Chimera is that it will automatically do an analysis on encountered domains. Note that this happens automatically, and it considers links from all pages in the crawl. Read more in the documentation.
Although powerful, the shared filters to date have been a bit confusing. There's now a new way to define filters that's faster and easier:
We at David Hobbs Consulting have been making a lot of experimental improvements in our clients-only environment but haven't yet pushed them to the production environment. Today we made a variety of boring under-the-hood changes (version upgrades, etc) that will allow us to roll out a variety of improvements in the coming weeks and months.
We just deployed a bunch of improvements to Content Chimera. Most of them are small changes, such as: better optimizing for charting with a large number of columns, more backend logging in general for better debugging and monitoring, in calculated fields you can now use non-boolean functions to evaluate as booleans, scatter charts are now available (this will be more fully launched when included in the help documentation), in heatmap tables zero values are now in a very very light pink (so visually you can more quickly see cells that have values at all vs. those that do not), better handle CSVs where a column or multiple columns have one or more newlines within them, and for multi-value analysis deal with field names that have blanks in them.
The bulk of our recent work on Content Chimera has been toward new capabilities for ongoing inventories (the changes so far are backend and are not yet available). This is a significant new set of functionality so will take some time to implement. Some of the specific features are: time series charts, custom data pipelines (a sequence of steps in acquiring and massaging data, such as crawling the site then pulling in the latest analytics data), programmatically connext with analytics APIs, saved job parameters so they can be better run in the background without user intervention, feature to step back through time in charts, and scheduling the pipelines. This functionality may only be available for the top tier subscription level, and targets enterprises.
You can now take notes on pages, even on other sites (for instance, competitor site pages that will never be in your primary content analysis) or during a crawl (before the page has been fully processed by Content Chimera).
The intention is to capture your thoughts when looking at examples in order to better inform further analysis using Content Chimera. For instance, you may find an example of an issue but you aren't sure if it is pervasive or not. You can note this observation at a time when you can't dive into that analysis (when meeting with a client, during a crawl, etc), and then when you are ready for deeper analysis you can look through your annotations to do the subsequent graphing or adding data to answer your questions.
Very large sites expose a variety of problems that just do not occur for smaller sites. For instance, we recently resolved an issue that arose when some servers are extremely fast at serving very large files. Also, Content Chimera can now better handle multiple very large processing jobs at the same time.
There have also been some large performance improvements that are especially useful for large sites but even smaller sites will benefit. For instance, Content Chimera now processes URLs after a crawl ten times faster. In addition, scraping content off pages is two times faster.
By default key pages will now show a tour.
Content Chimera now captures screenshots. Whenever you go to the detailed view for a particular page, you will see a screenshot if it already exists. If the screenshot does not already exist, it will attempt to get one then. Content Chimera also collects approximately the first thousand screenshots during a crawl (this is a crawl configuration option, so screenshots can be turned off).
Users on Pro or Enterprise Subscriptions now get higher-priority processing for long-running processes like crawling. Starter plans get normal priority processing.
As usual, development was far more active than may be obvious in these updates. Monitoring and scaling is always being worker on. In addition, here are some other changes:
In this partnership, current ContentWRX Audit customers will move to Content Chimera. This will mean those customers will gain the more comprehensive content analysis features of Content Chimera from visualizations to making decisions based on rules. In addition, Content Science will provide support for Content Chimera.
If you are charting site sections, you can now drill down into sections directly without manually creating filters or changing the chart config. You simply click on a chart bar, then click to Drill down, and accept the defaults: then you see a chart with the subsections under that initial section.
In addition, you can now directly create a rule from the chart. You simply click on a chart bar and can then create a rule for that bar (or of course you can still create more sophisticated rules on the rules page.
One side-benefit of drill-downs is we had to make improvements to filtering in general. For example, in the list of existing filters, you can see whether the filter is already being used in a chart or a rule.
By default, Content Chimera dynamically selects a server (in Northern Virginia, USA) for crawls. This will generally lead to the best performance. But there are a couple reasons you may want to crawl from a specific server: 1) if you need to be crawling from a specific location (like a site that will not serve pages to another region) or 2) if you need to change firewall rules or robots directives to allow only certain IP addresses in. Now you can select to use a specific server for a crawl (and set it as the default for a particular site so crawls always use that server). Currently the servers you can select are in Sydney, Astralia and Newark, USA, but we can add other locations if necessary. Note that crawling from one of these servers will mean at least a slightly slower crawl, but this option is available if you need it.
We are always making scaling and monitoring improvements. Most are pretty technical and toward things "just working" for our customers. That said, two things stand out as improvements you may notice:
Content Chimera is optimized for large, international digital presences. That said, our work remains interesting since there are always new challenges encountered in complex crawls. Here are some of the things we've recently improved:
As always, we are working on a variety of small changes and bug fixes, including:
Launch day improvements! Although we made a ton of improvements behind the scenes getting ready for lauch, below are the most noticable ones.
We have added the ability to take quizzes in Content Chimera. The first thing we implemented was a wizard to select the right chart for content visualizations:
One of the core aspects of Content Chimera is that you can start long running processes and return to them later, even across devices (no need to be tied down to the device you happen to start a crawl on for example). Now, Content Chimera will also alert you when a process is done. When a long process completes normally you will both receive an email and a sound will ring (if you keep the browser tab open in the background). If a process completes in error, you will hear an error ring (so you can respond to any issues quicker).
For more: Walking away from long processes.
We completely changed our help system, including a new section on the principles of Content Chimera:
Quickstart paths allow you to enter a URL and then automatically Content Chimera will do several steps of analysis (including starting by crawling) and generate a multi-chart report. Note that currently this is only available for new users as a trial (only for the first thousand URLs). Start a quickstart path for initial migration analysis or a brief content analysis.
Pervasiveness tables compress a great deal of information into a small table. These show how pervasive different elements are across the site (in this case, the rows are content types and the columns are percentages of pages with at least one metadata value of that type):
For more on pervasiveness tables: Pervasiveness and Heatmap Tables: Visualizing the Big Picture
If you reach a point of being "done" in an analysis for a client, you can archive the analysis of their digital presence in Content Chimera. You will no longer be able to add more information or use rules, but you can still use the charts. This is useful for you to manage your maximum URLs and digital presences in your Content Chimera license.
A ton of improvements, especially around charting, scraping and crawling. Also, the backend we have been working toward the capability for a report/dashboard that contains many dynamic charts.
The highlight of the charting improvements are sankey diagrams and dynamic scatter charts.
Sankey diagrams visualize the flow between states. For migration planning, it can show how content will be treated. For example, in this diagram we see how content in different folders will be handled in the migration (most of the articles will be kept as is while some are rewritten, etc).
Scatter charts can pack a lot of information but always comparing two numeric values. For instance, we can plot effectiveness by comparing pageviews against number of pages. The ideal content in the chart below is toward the upper left (not many pages contributing to a lot of pageviews). In this case, the content is classified to two levels of content type, with the first level being represented by color. As you can see, by hovering over an item in the legend we highlight that primary content type.Other charting improvements
Content Chimera will follow combinations of URL parameters, and you can set what parameters not to follow. That said, especially for large crawls, it can be challenging to keep restarting a crawl with different parameters. Content Chimera now automatically stops crawling unproductive paths. Related, Content Chimera will now stop following a redirect chain after going 20 deep.
As always, a variety of routine fixes, performance improvements, and backend monitoring improvements were rolled out. In addition, many of the changes below are toward stronger features in the future. But the big headliners for now are better scraping and heatmap tables.
Heatmap tables show compare two categories, with darkness showing the amount in each cell. For instance, this heatmap table shows how frequently different Calls-To-Action are used by site section. We can see that Change Request Flowchart is the most commonly-used CTA across the site, and that articles has the most CTAs.
To use a heatmap table you need just need to specify what should be in the rows and what should be in the columns (after first clicking the gear icon to go to advanced charting options and the selecting Heatmap Table as the chart type):
We are also close to launching scatter charts, which is a good way to compare categories with more values.
Content Chimera has always had the ability to scrape patterns out of the content, and it always did so from a local cache of the site. That said, creating, managing, and running patterns was cumbersome. So we made a variety of improvements.
Before you needed to separately select scope, pattern, and test. Now, instead of selecting "Full HTML", "Table", "Has", you just select "Tables". So you now interact with Full Patterns, which are a combination of scope (where to even look for the pattern), pattern (why to pull from the scope), and test (what single test, using a simple "does it have a value?").
You can now define your own full patterns. You can select a combination of scope, pattern, and test, like before (although now you can name them and re-use them):
You can even define the *components* of a pattern extraction:
Do you happen to be blessed with nice, clean metadata exposed like this?
If so, when defining a scope just select the new "Meta Tag" option and just enter DC.subject or the specific meta-tag you want to capture. For pattern, select either All (which would work in the example here) or Comma-Separated (which will pull out from comma-separated lists).
You can now test a pattern before unleashing it against the entire site. For example, here is a test against the content above, using a Meta Tag scope on DC.subject:
Whenever you scrape a pattern, several columns are generated for your charting and decision-making. Now we added an "all" column. This will list all the values (actually, there are limits: it just captures the first 200 values or 16,000 characters, whichever comes first). We actually are actively testing sophisticated multi-value analysis as well, which will really ramp up the value of the "all" columns.
Every time you scrape a pattern, it gets added to your site* suite of patterns. These could all be re-scraped at once. In the future we plan on allowing a billing organization to create a suite of patterns that can then be run against any sites that organization is managing.
*Actually, you can add patterns to any extent. An extent is a site, group of sites, or client.
We have been working hard on an upcoming feature: multi-value analysis. This will allow analysis of tagging / topics for example (each piece of content can have multiple tags or topics applied to it). This is a technically-difficult task that will take some time to completely develop and deploy. For now, we have been making a variety of backend changes such as weaving in an entirely different database type for this multi-value analysis and better scraping of multi-values
Over the past weeks we have trickled out a variety of smaller changes, such as implementing a new approach logging to allow better visibility and fixing a bug where a second pass of a pattern scrape wasn't updating correctly.
We have developed a new approach for more quickly deploying new forms to control more in Content Chimera (not glamorous but should help us in the future). For now, we have added the following forms:
Advanced charting options were reorganized for clarity and for a bit more space. The primary charting options were rationalized between normal charting and treemap charting.
In normal charting:
In treemap view (since that is a true hierarchy, and coloring works differently in a treemap):
In addition, there were some bugs in random sampling that have been fixed. Also, the charting is slightly faster now.
Changes to the website (not the app itself):
Now, when you do RoT analysis, you can also do reading analysis. Then you can use the reading level data just like all other data, for instance to graph the distribution of reading levels across the pages of your site.
Also, a bunch of scaling, performance, monitoring, and bug fixes: Fix how the rules processing UI worked. Lots of mostly-invisible changes to RoT testing (better scalability, improved monitoring, improved error handling, handling more edge cases of encoding issues).
There were a ton of related changes that we wanted to roll out together, so today's deployment was big. One theme is more consistent asset filters (rules to filter assets by, such as folder1 = articles), which are now used in rules and charts -- this is now generalized so will probably be added elsewhere as well.