通过 Python 脚本快速统计分析日志

从朋友那里拿到一个需求,根据日志分析统计并发情况,统计自定义时间段的用户流量,具体需求如下:

  1. 晒选某个时间点的数据:查找某天下午1:30到2:30这个时间段或者自己定义,看看到底有多少
  2. 同时筛选出两个项目标号的文档,统计和去重统计17/18
  3. babyhealth:[2016-05-03 19:21:23] INFO orderinfo:472 - v41/nbCode/getauth||1508031:17

上面第三行的内容就是日志的格式,简单分析了一下需求,shell 脚本、Excel 分列筛选透视、Python 都 可以解决,考虑到日志数量较大,通过 Excel 效率可能偏低,因此这里采用 Python 进行处理。

处理的主要思路是:

  • 对日志文件内容进行分解。每条日志的关键信息有日期(2016-05-03)、时间(19:21:23)、值(1508031)、块(17),应用正则对日志进行逐行的匹配,提取关键信息,放到数据库中
  • 在数据库中通过SQL语句进行统计

对于第一部分,代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
#! /usr/bin/env python
# -*- coding: utf-8 -*-
import re
import sqlite3
#说明
print u''
print u'-----------------使用说明-----------------'
print u'将脚本放入日志所在目录,运行脚本,输入日志\n文件名(含扩展名),将在相同目录生成log.db\n数据库'
print u'------------------------------------------'
print u''
#读取日志文件、连接数据库
name = raw_input('Please input the log name:')
log_db = sqlite3.connect('./log.db')
cursor = log_db.cursor()
#建表
cursor.execute('DROP TABLE IF EXISTS log')
cursor.execute('CREATE TABLE log (front, log_date, log_time, log_path, value, level)')
#设定正则规则
data_re = re.compile(r'正则规则不放了,会和代码编辑器冲突',re.S|re.M)
#读取日志文件
log = open('./%s' %name)
index = 0
#进行转换
try:
for line in log:
found = data_re.findall(line)
if found != None and len(found) != 0:
index = index + 1
cursor.execute('INSERT INTO log (front, log_date, log_time, log_path, value, level) VALUES (?,?,?,?,?,?)', (found[0][0], found[0][1], found[0][2], found[0][3], found[0][4], found[0][5]))
if index == 100:
log_db.commit()
#print '100 records has been submitted.'
log_db.commit()
print 'job done!'
input()
except:
pass

以上程序在 Linux 和 Windows 下都可以执行,生成 log.db 数据库, Linux 下可以直接使用 Sqlite3 对数据库进行操作, Windows 下可以安装 SQLite Expert 进行操作和查询,查询语句如下:

1
2
3
4
5
6
SELECT COUNT(*) FROM log WHERE log_date = '2016-05-03' and log_time between '13:30:00' and '14:30:00';
SELECT COUNT(*) FROM log WHERE log_date = '2016-05-03' and log_time between '13:30:00' and '14:30:00' and level = '17';
SELECT COUNT(*) FROM log WHERE log_date = '2016-05-03' and log_time between '13:30:00' and '14:30:00' and level = '18';
SELECT COUNT(DISTINCT value)FROM log WHERE log_date = '2016-05-03' and log_time between '13:30:00' and '14:30:00';
SELECT COUNT(DISTINCT value) FROM log WHERE log_date = '2016-05-03' and log_time between '13:30:00' and '14:30:00' and level = '17';
SELECT COUNT(DISTINCT value) FROM log WHERE log_date = '2016-05-03' and log_time between '13:30:00' and '14:30:00' and level = '18';

为了提高方便程度,可以用 Python 操作 SQLite3 进行查询:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
#! /usr/bin/env python
# -*- coding: utf-8 -*-
import sqlite3

date = raw_input('Please input the date(eg.2016-05-03):')
start_time = raw_input('Please input the start time(eg.08:30:00):')
end_time = raw_input('Please input the end time(eg.09:30:00):')
stat_db = sqlite3.connect('./log.db')
cursor = stat_db.cursor()
cursor.execute('SELECT COUNT(*) FROM log WHERE log_date = \'%s\' and log_time between \'%s\' and \'%s\'' %(date,start_time,end_time))
amount_all = cursor.fetchall()[0][0]
cursor.execute('SELECT COUNT(DISTINCT value) FROM log WHERE log_date = \'%s\' and log_time between \'%s\' and \'%s\'' %(date,start_time,end_time))
amount_na = cursor.fetchall()[0][0]
cursor.execute('SELECT COUNT(*) FROM log WHERE log_date = \'%s\' and log_time between \'%s\' and \'%s\' and level = \'17\'' %(date,start_time,end_time))
month_all = cursor.fetchall()[0][0]
cursor.execute('SELECT COUNT(DISTINCT value) FROM log WHERE log_date = \'%s\' and log_time between \'%s\' and \'%s\' and level = \'17\'' %(date,start_time,end_time))
month_na = cursor.fetchall()[0][0]
cursor.execute('SELECT COUNT(*) FROM log WHERE log_date = \'%s\' and log_time between \'%s\' and \'%s\' and level = \'18\'' %(date,start_time,end_time))
year_all = cursor.fetchall()[0][0]
cursor.execute('SELECT COUNT(DISTINCT value) FROM log WHERE log_date = \'%s\' and log_time between \'%s\' and \'%s\' and level = \'18\'' %(date,start_time,end_time))
year_na = cursor.fetchall()[0][0]
print(u'')
print(u'------------统计------------')
print(u'%s %s~%s\n----------------------------\n总点击次数/人:%s|%s\n年用户:%s|%s\n月用户:%s|%s' % (date, start_time, end_time, amount_all, amount_na, year_all, year_na, month_all, month_na))
print(u'')
input('Press any key to exit...')

以上程序均基于 Python 2.7+ 环境。


通过 Python 脚本快速统计分析日志
https://blog.yuhaogao.com/2016/05/04/通过-Python-脚本快速统计分析日志/
作者
宇皓
发布于
2016年5月5日
许可协议