What's New:
Content Chimera

Early access is now available. Sign up for early access.

30 Mar 2020: New Forms

We have been working hard on an upcoming feature: multi-value analysis. This will allow analysis of tagging / topics for example (each piece of content can have multiple tags or topics applied to it). This is a technically-difficult task that will take some time to completely develop and deploy. For now, we have been making a variety of backend changes such as weaving in an entirely different database type for this multi-value analysis and better scraping of multi-values

Over the past weeks we have trickled out a variety of smaller changes, such as implementing a new approach logging to allow better visibility and fixing a bug where a second pass of a pattern scrape wasn't updating correctly.

We have developed a new approach for more quickly deploying new forms to control more in Content Chimera (not glamorous but should help us in the future). For now, we have added the following forms:

  • A new form to add a custom data source. Please contact us if you plan on doing this so we can help with this advanced feature.
  • A new Quotas & Users form (access via the pulldown under your name) where you can add more users and see the status of your quota usage.

31 Jan 2020: Chart Improvements

Various charting improvements

Advanced charting options were reorganized for clarity and for a bit more space. The primary charting options were rationalized between normal charting and treemap charting.

In normal charting:

In treemap view (since that is a true hierarchy, and coloring works differently in a treemap):

In addition, there were some bugs in random sampling that have been fixed. Also, the charting is slightly faster now.

Website improvements

Changes to the website (not the app itself):

  • A new home page
  • A separate features page
  • A global navigation

28 Jan 2020: Filtering Improvements

Various filtering improvements

  • Saving a chart also now stores the Group setting
  • Chart now refreshes every time a filter is changed on the chart page
  • Change the way empty values are treated (they are no longer considered as 0 — for instance when filtering on PageViews < 1, now a page that did not match a Google Analytics row at all will not be counted in that filter)
  • Can now remove a filter from a chart (without refreshing the page manually)
  • “is empty” operator now available in the front end

New data source + new patterns + more

  • OnPoint Auditor is now a data source for import
  • New patterns and scopes for scraping: including Drupal Paragraphs, Drupal Content Types, and Meta Description
  • The extent history page now displays key parameters for scrape and merge jobs, and has been formatted slightly better.
  • When you ran out of URLs, Content Chimera silently stopped the crawl but it didn't tell you that on the web page. This has been fixed.
  • Various backend improvements

18 Jan 2020: Site History

  • Site-level activity history! Now that more people are using the tool, we are running into the case of more than one person working on analysis of a site at the same time. So now you can click on "Show History" from the Assets & Metadata to see all major activity on that site (or sitegroup or client). The history also indicates who took each action.
  • CAT / CWRX Audit has been added as a data source. If you missed the news, Content Science bought Content Analysis Toolkit (CAT) so it is in the process of changing names. Regardless of the name, you can now import from the tool.
  • Darker color for single-color charts. During a screenshare demo to a remote conference room, everyone yelled "We can't see the chart!", so we made the bars a darker shade of gray so it's higher contrast when a monitor isn't calibrated well (or in strong ambient light).
  • More URL deduping options. Content Chimera does two types of duplication analysis: at the URL level (only looking at the URL and nothing else) and content level duplicate analysis (looking at the text itself). We already had options for whether URLs starting with www or not should be considered duplicate URLs or not, along with capitalization and http or https. We just added a configuration option of whether there's an index.html (actually index.*) at the end or not, or whether there is a slash at the end or not.
  • As usual, we made a variety of backend improvements as well, for us to better monitor the health of the systems and to troubleshoot more quickly.

2 December 2019: Reading Level Analysis!

Now, when you do RoT analysis, you can also do reading analysis. Then you can use the reading level data just like all other data, for instance to graph the distribution of reading levels across the pages of your site.

Also, a bunch of scaling, performance, monitoring, and bug fixes: Fix how the rules processing UI worked. Lots of mostly-invisible changes to RoT testing (better scalability, improved monitoring, improved error handling, handling more edge cases of encoding issues).

7 November 2019: Major Changes

There were a ton of related changes that we wanted to roll out together, so today's deployment was big. One theme is more consistent asset filters (rules to filter assets by, such as folder1 = articles), which are now used in rules and charts -- this is now generalized so will probably be added elsewhere as well.

Rules management

  • Now there is no left-right scrolling with large and complex rulesets
  • Now select rule operators (like "equals") from a pulldown, rather than the more error-prone method of having to type them
  • Ability to drag-and-drop to reorder rules

Charting

  • Can now click on the label at the bottom of the graph, useful if a particular element of the graph is too tiny to click
  • Filter can now be more complex (such as folder1 = article and folder2 = 2019)

UI improvements

  • Can now modify clients in the UI
  • Better deal with and report to the UI when there are errors
  • More consistent filtering
  • Make numbers better match up in progress circles
  • When selecting extent (for instance, which site you want to analyze), checkboxes act as radio buttons rather than multi-selects
  • In the free ROT onboarding, allow people to give their real names
  • When starting a job and their are already others running jobs, report how many jobs are in front of you

Crawling improvements

  • Don't get blocked by Sucuri application filtering
  • Always use Content Chimera user agent in requests
  • Can now add an asset filter to a crawl job (for now this is just in the backend)
  • New backend option for whether to treat www and non-www URLs as the same (whether they are treated as duplicates in the URL deduplication process)

Backend and monitoring

  • Even out resource utilization across types of jobs
  • Show bar at the top of environments to be clear what environment we are running (you will now see a thin red bar at the top of your screen, which indicates the main production environment
  • In the event that large DB writes don't work, report this and stop the job.
  • In addition to the PHP performance, server-level monitoring, logging, and alerting already in place, added more sophisticated annotations in monitoring to isolate root causes of performance issues in long-running jobs.
  • Better deal with bad character encoding at the time of ROT analysis
  • Better deal with emojis in text
  • Take a PHP performance snapshot every half hour of a very long-running job
  • Speed up file access, especially relevant for large jobs
  • Better report progress on a RoT job when there is a lot of trivial content
  • Fix bug when scraping against a field (rather than HTML or PDF)

Content Chimera is coming. Sign up for early access.