Graylog Success Story

Gameforge

With more than 20 titles and over 450 million registered players, Gameforge is the leading provider of free-to-play massively multiplayer online games (MMOs) in the western hemisphere. Headquartered in Karlsruhe, Germany, the group offers its online games in more than 75 countries and employs 450 members of staff. The portfolio includes client- based games such as the award winning AION Free-to-Play, TERA: Fate of Arun, Metin2 (Europe’s most successful MMO), HEX: Shards of Fate, Orcs Must Die! Unchained, Runes of Magic, Elsword and Wizard101. Popular browser-based games such as OGame and Ikariam as well as a growing portfolio of mobile games complete their offering. For more information please visit http://corporate.gameforge.com/.


Challenge:

To keep its gamers happy and coming back as active users, Gameforge must provide a great gaming experience at all times. This means ensuring that the games are always up and performing optimally. Gameforge knows that monitoring and analyzing application log data is a critical part of staying one step ahead of application performance issues. Gameforge’s team of 80 application developers and 18 system administrators manages more than 20 games being delivered across more than 2,000 servers. That’s a challenging daily task.

Gameforge’s infrastructure includes both Linux and Windows servers. For Linux, both the developers and administrators found it very time consuming to manually write suitable bash scripts to grep the log data they needed across all servers. On the Windows side, the system administrators built their own system to store, analyze, and search event logs, This was expensive in terms of developer resources and infrastructure costs, and it was also a challenge to maintain. Meanwhile, the developers could not access the log files due to security policies, so they had to go through the admin team every time they needed log data.

“We needed a better log management solution to quickly and easily troubleshoot issues before our gamers noticed any problems in the games or systems,” said Felix Oechsler, Lead Windows System Administrator at Gameforge. “We also wanted an all-in-one solution that would be easy to use and manage across our different departments.”

Why Graylog:

Gameforge chose Graylog for five reasons: user access management, intuitive search and filtering, visualizations, alerting, and flexible log data extraction.

User access management: Graylog helped eliminate the information bottleneck between Gameforge’s application development and system administration teams. System administrators are now able to give other teams access to the Graylog web interface so they can search through and analyze the log data themselves. “Developers no longer need server access to work with log data, so there’s no concern about security policy violations. Our system administrators are also relieved because there are fewer requests from the developers to get access to log data,” said Felix.

Intuitive search and filtering: Graylog eliminated the need to manually search through log data on multiple servers. With all logs centralized in one place, Gameforge simply uses Graylog’s powerful search functionality to quickly find and filter the exact logs they need. “Graylog’s search tool is so easy to use that our teams quickly became self-sufficient at analyzing the log data themselves,” said Felix.

Visualizations: Gameforge uses Graylog’s dashboards to visualize the information from their log data, making it easier to spot anomalies in the 10 million event messages being processed daily. In one dashboard view, Gameforge monitors the number of player login errors, In-Game Shop payment issues, and Match Calculation (the mechanism that ensures a player is always able to find an opponent to play) problems. They also keep track of what time of day a game is generating more issues than usual. Having these different elements on a dashboard helps the Gameforge team spot and troubleshoot problems. “In one case, some of our developers were able to spot an anomaly in the In-Game Shop log data which was indicating significant problems in the game, and were able to proactively troubleshoot the issue before our users even noticed,” said Dennis Simon, Lead Developer at Gameforge.

Alerting: Graylog uses Streams to process incoming messages in real time based on customizable conditions. Streams can trigger an alert in real time when a message comes in that matches your alert rules. “My development team created our own streams and set alerts for the games we were working on. We’re able to get real time alerts on critical errors in their games and applications so we can react faster than before,” said Dennis. “For example, we set up a stream to monitor error messages for our payment system in the game server, and we set up alerts to notify us when the stream processed more error messages than normal. In one case, the mobile game team received an alert that our connection to the Apple payment system was down. We were able to resolve the problem quickly and avoided the bad user experience of a payment not going through.”

Flexible log data extraction: There are thousands of devices and endpoints that have their own format of log messages. Instead of writing custom message inputs and parsers for each unique data source, Graylog users can use the REST API or web interface wizard to create extractors to pull data from any text in a message, regardless of format. Gameforge’s admin team extracts important information out of non-indexed log events from several applications. Some examples include firewalls, DNS data, and Java/Glassfish logs.


Gameforge
“My development team created our own streams and set alerts for the games we were working on. We’re able to get real time alerts on critical errors in games and applications so we can react faster than before. In one case, our mobile game team received an alert that our connection to the Apple payment system was down. We were able to resolve the problem quickly and avoided the bad user experience of a payment not going through.”

Dennis Simon, Lead Developer


Results

Gameforge has been using Graylog since early 2014. Graylog has been implemented as a standard function in Gameforge’s code development library. Since most of the developers use this library, sending log messages to Graylog is easy because they can use the library and don’t have to implement much code. Graylog has also been integrated into the daily workflows of the application development, system administration, and security teams.

Impact for DevOps: “We now monitor our entire server environment with Graylog and have total visibility, enabling us to search for internal application problems across multiple backend servers. Also, our Developers and DevOps people rely on Graylog’s alerting functionality to react to critical errors in our games or applications.”

Impact for IT Operations: Gameforge’s system administrators use Graylog to monitor every server to detect problems even when they are occurring on a very small level. “Graylog is a powerful tool for every administrator to have a single pane of glass to view exactly what is happening on their servers.”

Impact for Security: “Graylog gave us the ability to build almost infinite methods to detect security problems on our servers and in our network. We could search for vulnerabilities using threat intelligence from external sources, and we gained valuable insight into our environment with event correlation and intrusion detection.”

“Using Graylog, we are able to provide a better user experience for our gamers. Our reaction time is much faster than before, and our players experience less outages and other error-related problems.”

As icing on the cake, Gameforge employees using Graylog have shaved hours off of their log management workflows, freeing them up to focus on other priorities like releasing awesome games for the more than 450+ million Gameforge gamers.


Gameforge
“Using Graylog, we are able to provide a better user experience for our gamers. Our reaction time is much faster than before, and our players experience less outages and other error-related problems.”

Felix Oechsler, Lead Windows System Administrator