用python分析nginx的access日志

desert3

浏览: 2139673 次
性别:
来自: 合肥

最近访客更多访客>>

novagx

bijian1013

Done_apple

流浪鱼

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

Python
Server.Nginx

Access Python nginx OS 正则表达式

项目正式发布后，有需求要分析下nginx的access日志内容，于是写了如下脚本：

#! /usr/bin/env python 
# -*- coding: utf-8 -*- 
#@author zcwang3@gmail.com
#@version 2011-04-12 16:34
#Nginx日志分析，初始做成 

import os
import fileinput
import re

#日志的位置
dir_log  = r"D:\python cmd\nginxlog"

#使用的nginx默认日志格式$remote_addr - $remote_user [$time_local] "$request" $status $body_bytes_sent "$http_referer" "$http_user_agent" "$http_x_forwarded_for"'
#日志分析正则表达式

#203.208.60.230 
ipP = r"?P<ip>[\d.]*";

#[21/Jan/2011:15:04:41 +0800]
timeP = r"""?P<time>\[           #以[开始
            [^\[\]]* #除[]以外的任意字符  防止匹配上下个[]项目(也可以使用非贪婪匹配*?)  不在中括号里的.可以匹配换行外的任意字符  *这样地重复是"贪婪的“ 表达式引擎会试着重复尽可能多的次数。
            \]           #以]结束
        """

#"GET /EntpShop.do?method=view&shop_id=391796 HTTP/1.1"
requestP = r"""?P<request>\"          #以"开始
            [^\"]* #除双引号以外的任意字符 防止匹配上下个""项目(也可以使用非贪婪匹配*?)
            \"          #以"结束
            """

statusP = r"?P<status>\d+"

bodyBytesSentP = r"?P<bodyByteSent>\d+"

#"http://test.myweb.com/myAction.do?method=view&mod_id=&id=1346"
referP = r"""?P<refer>\"          #以"开始
            [^\"]* #除双引号以外的任意字符 防止匹配上下个""项目(也可以使用非贪婪匹配*?)
            \"          #以"结束
        """

#"Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"'
userAgentP = r"""?P<userAgent>\"              #以"开始
        [^\"]* #除双引号以外的任意字符 防止匹配上下个""项目(也可以使用非贪婪匹配*?)
        \"              #以"结束
            """

#原理：主要通过空格和-来区分各不同项目，各项目内部写各自的匹配表达式
nginxLogPattern = re.compile(r"(%s)\ -\ -\ (%s)\ (%s)\ (%s)\ (%s)\ (%s)\ (%s)" %(ipP, timeP, requestP, statusP, bodyBytesSentP, referP, userAgentP), re.VERBOSE)

def processDir(dir_proc):
    for file in os.listdir(dir_proc):
        if os.path.isdir(os.path.join(dir_proc, file)):
            print "WARN:%s is a directory" %(file)
            processDir(os.path.join(dir_proc, file))
            continue

        if not file.endswith(".log"):
            print "WARN:%s is not a log file" %(file)
            continue

        print "INFO:process file %s" %(file)
        for line in fileinput.input(os.path.join(dir_proc, file)):
            matchs = nginxLogPattern.match(line)
            if matchs!=None:
                allGroups = matchs.groups()
                ip = allGroups[0]
                time = allGroups[1]
                request = allGroups[2]
                status =  allGroups[3]
                bodyBytesSent = allGroups[4]
                refer = allGroups[5]
#                userAgent = allGroups[6]
                userAgent = matchs.group("userAgent")
                print userAgent
                
                #统计HTTP状态码的数量
                GetResponseStatusCount(userAgent)
                #在这里补充其他任何需要的分析代码
            else:
                raise Exception
                
        fileinput.close()

allStatusDict = {}
#统计HTTP状态码的数量
def GetResponseStatusCount(status):
    if allStatusDict.has_key(status):
        allStatusDict[status] += 1;
    else:
        allStatusDict[status] = 1;
    
        
if __name__ == "__main__":
    processDir(dir_log)
    print allStatusDict
    #根据值进行排序（倒序）
    print sorted(allStatusDict.items(), key=lambda d:d[1], reverse=True)
    print "done, python is great!"

得到的HTTP状态码的数量如下：

{'200': 287559, '302': 6743, '304': 4074, '404': 152918, '499': 887, '400': 14, '504': 93, '502': 300, '503': 5, '500': 88353}

各IP访问网站的次数如下（前10的IP）：

[('220.178.14.98', 323230), ('220.181.94.225', 120870), ('203.208.60.230', 14342), ('61.135.249.220', 6479), ('203.208.60.88', 5426), ('61.135.249.216', 4867), ('123.125.71.94', 1290), ('123.125.71.104', 1282), ('123.125.71.108', 1280), ('123.125.71.110', 1278),  余下不显示]

从原始信息中提取IP后可以做一些额外的分析工作：如访问量前10的IP等 数据量大时采用hashIp后取模再统计

0
顶

0
踩

分享到：

oracle中排序和分页的相互影响 | 常见数据结构

2011-04-13 13:52
浏览 14998
评论(1)
分类:编程语言
查看更多

1 楼 dacoolbaby 2016-10-31

非常棒的正则表达式，非常适用。
万分感谢。

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

用python分析nginx的access日志

评论

发表评论

相关推荐

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

用python分析nginx的access日志

评论

发表评论

相关推荐

python sftp&ftp&ssh2

python 子进程Subprocess & windows cmd当前目录和python当前目录的区分

http长连接与nginx resin相关配置

Nginx的防盗链配置（转）

nginx访问本机目录下的文件列表

tomcat nginx默认的post大小限制

http 状态码 504 502

（转）python 函数参数的传递(参数带星号的说明)

(转)Nginx出现“413 Request Entity Too Large”错误解决方法

ConfigParser读取记事本修改后的配置文件出错问题解决

把图片列表合成一个GIF动画图片

LOB variable no longer valid after subsequent fetch

pydev打包后的程序运行报【没有找到 MSVCP71.dll】的错误

客户端机器TCP端口被占满导致mysql报Can't connect to MySQL server on 'computername' (10048)

Python WindowsError

PIL使用过程中的异常处理

用python给文件夹下所有图片进行缩放处理

http代理测速程序

python连接oracle

python处理csv数据

最近访客更多访客>>