Want to know all about parsing log files? Here's a useful guide for you.
Log file parsing is the process of analyzing log file data and breaking it down into logical syntactic components. In simple words - you’re extracting meaningful data from logs that can be measured in thousands of lines.
There are multiple ways to perform log file parsing: you can write a custom parser or use parsing tools and/or software. Parsers can be written in many programming languages; some are better for this task than others, but the choice often depends on what language you are most comfortable with. In this article, we will talk about log file parsing in Graylog and give examples of parsers in several different languages, as well as compare Graylog with Splunk in terms of parsing.
How to Parse a Log File in Graylog
Since not all devices follow the same logging format, it is impossible to develop a universal parser. For devices that don’t comply with syslog format rules, Graylog overrides this issue using extractors. Log file parsing is done by a combination of raw/plaintext message inputs and extractors. The built-in raw/plaintext inputs allow you to parse any text that you can send via TCP or UDP. No parsing is applied at all by default until you build your own parser using custom extractors. Let’s discuss what extractors are and why they were created in the first place.
Syslog (RFC3164, RFC5424) has been a standard logging protocol since the 1980s, but it comes with some shortcomings. Syslog has a clear set of rules in its RFCs that define how a log should look like. Unfortunately, there are a lot of devices such as routers and firewalls that create logs similar to syslog but non-compliant with its RFC rules. For example, some use localized time zone names or omit the current year from the timestamp, which causes wrong or failed parsing.
One possible solution was to have a custom message input and parser for every format that differs from syslog, which would mean thousands of parsers. Graylog decided to address this problem by introducing the concept of Extractors in the v0.20.0 series.
The extractors allow users to instruct Graylog nodes about how to extract data from any text in the received message (no matter which format or if an already extracted field) to message fields. This allows for more elaborate queries like searching for all blocked packages of a given source IP or all internal server errors triggered by a specific user.
You can create extractors via Graylog REST API calls or via the web interface using a wizard.
The Graylog Extended Log Format (GELF) is a log format made to improve some standard syslog flaws.
Plain syslog Shortcomings:
- Limited to 1024 bytes
- No data types in structured syslog
- Too many syslog dialects to successfully parse all of them
- No compression
Improvements on these issues make GELF a great choice for logging from within applications. Graylog marketplace offers libraries and appenders for easily implementing GELF in many programming languages and logging frameworks. GELF can be sent via UDP so it can’t break your application from within your logging class.
Graylog Sidecar acts as a supervisor process for other programs, such as nxlog and Filebeats, built specifically to collect log messages from local files and ship them to remote systems like Graylog. You can also use any program supporting the GELF or syslog protocol (among others) to send your logs to Graylog.
Streams are a core feature of Graylog and may be thought of as a form of tagging for incoming messages. Streams are a mechanism used to route messages into categories in real-time. Stream rules instruct Graylog which message to route into which streams, and are also used to control access to data, route messages for parsing, enrichment or other modification, and determine which messages will be archived.
Processing Pipelines are a Graylog feature that enables the user to run a rule, or a series of rules, against a specific type of event. Pipelines are tied to streams and allow for routing, blacklisting, modifying, and enriching messages as they flow through Graylog. This feature lets users parse, change, convert. add to, delete from, or drop a message.
Writing Custom Log File Parsers
If you don’t want to use Graylog or any other tool, you can write your own custom parser using a number of languages. Here are some commands and methods used in Java, Linux, Python, and PowerShell:
Java Parse Log File
This is the method to use if you do your own parsing using Java:
The Split method splits a string around matches of the given regular expression. The array returned by this method contains each substring of this string that is terminated by another substring that matches the given expression or is terminated by the end of the string. The substrings in the array are in the order in which they occur in this string. If the expression does not match any part of the input then the resulting array has just one element, namely this string.
If you don’t want to spend time writing your own parser, there are many parsing tools available for Java. You can use a library to implement GELF in Java for all major logging frameworks: log4j, log4j2, java.util.logging, logback, JBossAS7, and WildFly 8-12.
Linux Parse Log File
You can perform command line log analysis in Linux, and these are some of the most useful commands:
Head and Tail
If you want to display a certain number of lines from the top or bottom of a log file, you can use head or tail to specify that number. If you don’t add a value, the default value is 10 lines.
The grep tool is used to search a log file for a particular pattern of characters. This tool, in combination with regular expressions, is the basis for more complex searches.
Surround Search with Grep
One example of advanced search using Grep is surround search. The B flag determines the number of lines before the matching line and the A flag determines the number of lines after the matching line you want to show.
Python Parse Log File
You can create a custom log file in Python by using regex. This is an example of parsing by line.
Open instruction opens the defined log file path using read-only access (“r”) and assigns the data to the file variable. The for loop goes through the data line by line, and if the text in line matches regex, it gets assigned to the match variable as an object.
To parse more than one line at a time, you can assign the whole file’s data to a variable using data = f.read().
In this example, you can choose whether to parse by line (if read_line is True) or by file (if read_line is False). The matching algorithm is the same, with the only difference in the data that is compared with regex (line or whole text).
For more complex parsing, there are a plethora of parsing tools you can use for free. There is a great list of Python parsing tools on GitHub you can check out here.
PowerShell Parse Log File
If you perform log file parsing with PowerShell, this is arguably the most useful command to write a custom parser:
To display only lines containing specific keywords, you can use the Pattern command. If you’re looking for more than one keyword, you can list them after the Pattern command separated by a comma.
Splunk Parse Log File
In Splunk, data that enters the data pipeline gets indexed through event processing. The event processing consists of two stages: parsing and indexing. During parsing, data chunks are broken down into events and handed off to the indexing pipeline for final processing.
Splunk vs. Graylog Parsing Log File
Siloed vs. Single Interface
Splunk has IT categories organized in a siloed fashion, meaning that there is a separate screen for each category. If you’re looking to solve a problem that needs information across several topics, the siloed approach makes the whole process slow and complicated.
Graylog uses a single interface to display IT information, reflecting not IT silos, but service performance by business goals or metrics. For more advanced dashboards, Graylog is easy to integrate with 3rd party offerings.
Query Language vs. Natural Language
In Splunk, searches are conducted via a formal query language. If you are not familiar with the language or if you fail to precisely formulate your query, the search results won’t necessary provide the answer you are looking for.
Graylog allows users to search for terms through standard GUI queries. Not only is this approach easier and more natural than using a formal query language, but it also eliminates the need for user training and helps save time and money.
Each Splunk process is limited to an arbitrary number of threads that can be sent to each processor. This design creates a bottleneck which prevents the system from providing the fastest possible response to search queries.
Graylog is better optimized for hardware compared to Splunk - it supports extensive multithreading within a system as well as query distribution across systems. This helps speed up the query process and facilitate problem-solving.
Flexibility and Scalability
Splunk’s approach to scalability challenge is to encourage users to deploy additional software.
Graylog already includes both a message journal and data replication/recovery, which ensure that data is safely stored/duplicated in accordance with its business priority across as many systems as necessary.
As an organization focused on open-source software, Graylog puts the emphasis on features and performance instead of profit margins. Our pricing is more competitive than Splunk’s and we offer as much flexibility as you need. We will keep your environment up and running even if unpredictable workloads exceed contractual levels, and will work with you to make proper adjustments to continue with the service.
Graylog Advantages Over Splunk:
- Single interface for faster problem solving and root cause analysis
- Allows natural language queries instead of a formal query language
- Optimized for hardware to support multithreading and cross-system query distribution
- Competitive pricing and flexible customer approach