Today’s public cloud infrastructure is built on elasticity as a core value proposition which brings incredible benefits of being dynamic. However, failure is inevitable, occurs regularly, and often in unpredictable ways. To use a clichéd saying, the computing forecast for tomorrow is “cloudy with a chance of failure.”
In the face of such inevitable and unpredictable failure, how can you write a reliable program that provides the high level of availability your users want?
The good news is that the cloud service providers have done a very good job at providing a framework that enables us to design around such failures and create highly available and resilient applications in the cloud.
Obviously, we still have to design and write our programs to make use of it all.
In this blog, we will look at two things – how we made our Halo event connector highly available and techniques we have used for achieving high throughputs to enable the connector to handle volumes of events generated by large customer deployments.
The Original Halo Event Connector
A few months ago we released the Halo Event Connector which enables users to extract Halo events and publish them to a variety of target systems – SIEM tools, syslog, local files, etc. To quote from the original connector documentation,
“The Connector is a Python script that is designed to execute repeatedly, keeping the external tool up-to-date with Halo events as time passes and new events occur:
The first time the Connector runs, by default, it retrieves all logged events from a single Halo account. Then the Connector creates a file, writes the timestamp of the last-retrieved event in it, and saves it in the current directory”
What that means is that in the event of a server failure, state information related to events being streamed is lost. A new instance of the script wouldn’t know where to start and would end up streaming duplicate events to the consumer system.
Making the Halo Event Connector Highly Available
CloudPassage has now modified this function to make it highly available through an enhancement where it now supports the ability to store program “state” in a separate location from where the connector instance(s) are executing.
The new version of the connector supports storing of program state information to an Amazon Web Services (AWS) S3 bucket or a Redis cluster, both highly available storage mediums. Program state comprises of writing a lock file and a timestamp file. The timestamp file is periodically updated by the connector with the timestamp of the last event retrieved. The lock file is what is used to arbitrate between multiple competing connectors as to who fetches events. The connector that acquires the lock is the only one which actively fetches events from the Halo Grid.
Scaling the Halo Event Connector
Often times, when you are requesting data using the Halo API, endpoints that return bulk data only return a subset of the data. This is because some responses could contain thousands and thousands of objects, and so most responses are paginated by default. In such cases, just as you have to turn the pages of a book to read the whole book, you may need to “page” through the data returned to get more.
Sequentially traversing pages limit the throughput of the Halo Event Connector and can lead to the connector “falling behind”. Halo Event Connector now supports multi-threading to relieve the retrieval lag and ensure that the events are in the correct order.
Let’s say you start the Halo connector with 10 threads. When the process starts each thread, it is assigned a number from 1 to 10. First thread is 1, next thread is 2, and so on until the last thread, which is 10. Each thread starts with the page number being same as its thread number, then adds 10 to get the next page it should fetch. So, the first thread gets page 1, then page 11, then page 21, etc. The second thread gets page 2, then page 12, then 22, etc. Last thread gets page 10, then 20, 30, etc.
Each thread is responsible for the same number of pages, except in the last cycle, when some of the threads may finish 1 cycle earlier, depending on the total number of pages currently available to fetch.
With multiple threads fetching events simultaneously, we ensure that the events are ordered before being sent to their destination. There’s a separate thread that works to put all the events back in the correct order. The thread that starts with page 1, waits for it to be added to the output queue, then formats it, and writes it to the destination (stdout, file, syslog, Splunk pipeline, etc.). Then it looks for page 2, and so on. If the next page is not yet available, it will sleep for 1/10 of a second. It goes through all the pages in sequence repeating the same step.
Using this technique, we have been able to achieve an order of magnitude more throughput than the previous non-multithreaded version.
Support the Demand
As more and more enterprises adopt the cloud, it is becoming more important to deliver solutions that are architected to take advantage of the distributed and dynamic nature of cloud computing, and at the same time are resilient enough to withstand inevitable and frequent failures.
They also have to be able to support massive throughput in order to keep up with large-scale deployments of cloud servers.
Keeping ahead of the issues, CloudPassage hopes that this upgrade to the Halo Event Connector helps to ensure high availability and throughput for Halo API integrations and any customized security data feeds for business.