Introduction to Fault Tolerance
Higher Fault Tolerance
Larger environments especially benefit from Graylog’s fault-tolerant architecture. Expanding the infrastructure to consume high log volumes and ensuring the system is up and available for collection and analysis can all be done with Graylog Enterprise.
How It Works
Using a load balancer, you can have multiple Graylog servers ingesting logs and providing additional interfaces for the analysts. You can configure the MongoDB and Elasticsearch databases to be redundant to ensure no data loss. Additionally, all processing pipelines utilize a message journal, allowing quick collection and storage to disk in case of a power loss, so no messages are lost.
Fault tolerance is also built into many of the agents we support via Sidecar. This approach allows hosts to locally spool their logs in case of a network outage, and once the network connection is established, the logs are sent.
Frequently Asked Questions
Can I monitor the nodes' health/state?
Yes, using our REST API, you can query the status of the hosts, with an ALIVE or DEAD response. You can also manually change the state for Zero Downtime upgrades.
Can I have more than one web interface?
You can have more than one web interface for redundancy purposes, where you can front-end the access with a load balancer or proxy. This architecture also allows for centralized SSL/TLS termination and certificate management.
Can I adjust my journal size depending on my log volume?
You can adjust your journal size as desired, keeping in mind disk requirements. This flexibility allows for longer network outages or upgrade times, while not losing any messages.
Will I need to configure roles/permissions independently at each Graylog server?
No. All configurations are replicated, including SAML/SSO role mappings.
Does all this fault tolerance incur a performance penalty?
No. In fact, when properly configured, it will both increase your ingest performance and decrease the time to return the results of a search.
Can Graylog handle multi-site replication?
Yes. There are multiple ways to provide multi-site replication using Graylog.
Can Graylog notify another system if it detects a fault?
Yes. Graylog supports multiple notification mechanisms. In version 3.0, Graylog supports actions triggered by an alert that can invoke a script to act outside of Graylog.
What other methods can be used to make my environment more fault-tolerant?
Due to Graylog’s open nature and support of other open technologies, you can leverage message queuing technologies to increase your control over the flow of data. You can provide an easier method for developers to deliver logs while increasing fault tolerance.