In Matador Jobs Pro hot fix 3.7.7, we explained a bug we now call the “Candidate Description API Bug”. This blog post will explain the bug in more detail, as we understand it, and share what we learned during the process of researching and resolving the issue in partnership with the engineers at Bullhorn.

Hints of Trouble in Late September

In late September, we began receiving reports from users that their sites were not working well. The issues ranged from:

  • sites operating slowly
  • sites disconnecting from Bullhorn regularly
  • sites creating abnormally large log files, including ones that caused sites to exceed disk space limits by web hosts
  • and, most critically, several applications were failing to process into Bullhorn.

We began, as we tend to do when things aren’t going well, with asking some of our users to grant us access to their web sites to look at the issues ourselves and begin our research. What we found was troubling, but confusing.

HTMLTagBalancer.java:1000 x 1000

Our Matador Jobs error logs were showing us that on some–but not all–applications, the API call where we save Candidate data returned a massive error output. We’d get an HTTP 500 “Internal Server Error” and 1100 lines of error output from Bullhorn. After the first few lines of error output, one line would be repeated over and over:

HTMLTagBalancer.java:1000

We tried to narrow things down, and in doing so we were able to identify the following triggers:

  • The application included either an invalid/non-processable resume (highly formatted resumes are sometimes not able to be processed) or had no resume attached, and,
  • When processing the application, Matador found an existing candidate during its duplicate prevention routine, and,
  • The candidate had a non-empty description field with html formatted description.

The thing is, even at this point, these conditions would sometimes trigger the error, and sometimes not. We were very confused, but we had enough example data to start a ticket with Bullhorn Marketplace Partner support.

The Bug Took Advantage of Matador’s Design

Let’s pause to partially explain how Matador Jobs Pro handles errors.

First, whenever an API call to Bullhorn occurs, if an error occurs, we read the HTTP error code that is returned. Generally, “400” type errors tell us that Bullhorn thinks we gave them bad data, and Matador is trained to stop processing the call and either tell us, tell you, or ignore the call. These errors are rare. We have learned how to format data, so it is always properly ready for submission via the API.

Sometimes we see “500” type errors. These are “Internal Server Errors” for the API. In our six years of writing code for the Bullhorn API, a “500” type error was always due to a planned or an unplanned Bullhorn downtime. So, for these types of errors, Matador is trained to retry syncs later. Up to this point, we’ve never had “500” type errors that weren’t resolved with a retry that took place later.

Further, whenever error data is provided to us, we log it. Usually, error data is presented as one sentence, not 1100 lines of output. Generally, logging error data helps us make Matador better. We can read logged error (and success) messages on your site’s logs and use it to help us know what is happening.

This API bug took advantage of both of those design features and turned them into flaws. The consequences were sometimes extreme. For some users, over a hundred applications had become “stuck” in a loop of failing and retrying, with each failure logging 1100 lines of error data. This resulted with sites experiencing a constant loop of attempting to retry sync for these, over and over, with the side effect(s) of:

  • Generating massive Matador log files. One user hit their allowed disk space limit due to a Matador log that was 100,000+ lines of data!
  • Causing disconnections and stale job data.
  • Causing Matador to fail to dynamically fetch new jobs accessed via a /{ID} link (Indeed integration users rely on this).
  • Causing generally slower sites.
  • Most importantly, data about applicants that could lead to placements were not getting to the user’s recruitment teams because they were failing.

I want to pause here to make an important point: The indirect impacts of this bug are the ones that took advantage of Matador Jobs Pro error handling weaknesses and created the domino effect of issues throughout the site, but these indirect impacts were in fact a key reason we were alerted to the issue in the first place. That said, the most impactful aspect of this bug to our user’s business was the direct impact of blocking the application syncing, in other words causing applicant data to not funnel to the recruiting teams. No degree of improved error handling by Matador could have changed this outcome.

Temporary Work-Around

Even before Bullhorn engineers were able to confirm the bug, we at Matador Software developed temporary workarounds to help our most impacted users recover. Workarounds were installed as each user reached out to us while we continued to work with Bullhorn to identify the true cause of the problem and develop a fix.

Bug Confirmed

After several back-and-forth emails over 10 days, Bullhorn engineers were able to confirm the bug and recreate it in their systems. They also gave us an answer as to the cause (and explained why our three conditions were inconsistent in recreating the bug):

  • When the existing candidate’s html-formatted description had a <style> tag with inline CSS, a built-in HTML validator that ran before an API call save would be confirmed was failing. Text and HTML descriptions without a <style> tag did not trigger the bug!

That explained a lot. We finally understood how some but not all the candidates who had met the conditions we had identified replicated the problem. We were missing the final condition!

The Temporary Work-Around Becomes Part of Matador Jobs

Though the bug was now confirmed at Bullhorn, we were informed a fix would be weeks away. Due to the size of Bullhorn’s operation, it would be a while for the bug fix to get out to everyone. A timeline was offered in weeks or even a few months. The engineering team suggested we offer a work-around. While we protested because we preferred to have the bug fixed at Bullhorn, we agreed that we could offer a wide-spread resolution for our users in the short-term.

The work-around prevents the candidate description from being updated when a resume was not processed (as HTML from the resume processor appeared to not ever cause an issue). If a resume is processed in the application, the description will be updated from the resume, otherwise, the description would be unset from the found candidate data prior to data save.

This does mean that any routines where a Matador Jobs Pro install modifies the description with custom developer filters (like to add data from custom questions at the end) will temporarily not work. That said, the default install of Matador Jobs Pro does not use this feature anyway, and therefore few users will be impacted by this temporary change. We are aware of fewer than five users who run custom code to modify candidate descriptions, and we have notified them of this impact to their sites. Moving forward until the “work-around” can be removed from the code, we will discourage new installs from using this custom code point.

What We Learned

We learned some things from this experience, namely that we need to assume the worst when a “500” error is encountered, even if we think we understand the cause. We will be implementing the following protections into a future version of Matador Jobs, slated for 3.8.0:

  • Limiting error output to the logs in the rare case we find another 1100+ line bug in the future. This will prevent Matador from making giant log files that eat up disk space. The errors will be limited to a sufficient but reasonable output (let’s say instead 1110 lines of error output, we limit it to 25 lines).
  • Limiting “retries” for syncs. If a sync fails more than a few times due to the same reason, even if Matador thinks it is a recoverable failure, Matador should stop retrying.
  • In the event of continued failures due to hitting a retry limit, we need to add adequate notifications to the administrator so they’re aware of Matador’s decision to stop retrying and advise them to explore the reason why.

While those adjustments will help minimize the impact of future occurrences of bugs like this, it is important to repeat that while Matador may have room for improvement in handing a bug/error like this in the future, the fact that it mishandled this error is part of how we were able to discover and debug it as quickly as we did.

Therefore, any adjustments we make to suppress future errors should not be so strong that the suppression could in fact hide a big problem from us and our users and elongate response time!

Update ASAP!

In conclusion, this has been a frustrating and long process for us but situations like this are also why you keep your subscription active! Your initial subscription and annual renewals keep Paul and I “on the job” with four eyes watching over 500 Matador Jobs Pro user’s sites, emails, and phone calls. For no extra charge, you all received a necessary update to your sites within a few weeks of the discovery of a business-impacting emergent issue.

That said, please go update! The one thing we can’t do is force your sites to run the updates, so unless your web host provides automatic WordPress plugin updates, please go into your WordPress admin and update to Matador Jobs Pro 3.7.7.