PIWIK Log ANalysis

Hi,

We have started testing the PIWIK log analysis feature launched in the 1.8.2 version of piwik. We have tried to do some log analysis and though we managed to get data, we face the following problems:

  1. Log analysis did not happen with w3svc format of iis logs. We had to convert them to .log.ncsa format? Isnt log analysis supported with w3svc format?
  2. Can we do log processing with load balance servers?
  3. Is reprocessing possible from a certain date if logs are available?
  4. Can we create custom reports like “visitor vs browser” or “visito vs country”?
  5. Can we provide access/remove access to new users? I mean is ther user management available?
  6. Can the reports available be auto scheduled?

Please let us know the answers of the above queries. Also is there any help documentation on the log analysis feature of PIWIK?

Interestnig questions

  1. please post the report and sample log Log analytics list of improvements · Issue #3163 · matomo-org/matomo · GitHub
  2. Sure see the faq: New to Piwik - Analytics Platform - Matomo
  3. it’s possible to import more logs in the past, but not yet to reprocess certain dates. It’s been suggested in Log analytics list of improvements · Issue #3163 · matomo-org/matomo · GitHub
  4. See: List of Features in Piwik Analytics - Analytics Platform - Matomo for list of reports
  5. see Manage Users - Analytics Platform - Matomo
  6. Yes, see Manage Email Reports - Analytics Platform - Matomo

Hi Matt,

Thanks for the response. One more query. We are using WebTrends for web analytics and are trying to opt for Piwik now. There is a feature in WebTrends wherein we can integrate a lookup table and then create multidimensional custom reports based on the fields in the lookup table. For example the lookup table has details of all employee, their id, mail, career level, country, city , etc and we can create reports like username vs country or username vs career level etc. Is such kind of lookup table integration possible in Piwik? Is there any way of creating custom reports in Piwik? If not , is there any feature which is somewhat close to what i mentioned above?

I would recommend using Custom Variables: Custom Variables Analytics - Analytics Platform - Matomo

If you need new features, custom dev, it is possible if you hire Pro Services consultants: http://piwik.org/consulting/

I’ve attempted to search out the --log-format-regex= options and have had little success. I’ve read the forums, readme, etc…

I’m looking to use the import_logs.py on some tomcat6 log files with the following option:

which results in logs that look like:

10.88.168.235 - [01/May/2013:18:38:09 +0000] 304 - 0 GET /Contracts/js-v4466/app.js Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:19.0) Gecko/20100101 Firefox/19.0
10.88.168.235 - [01/May/2013:18:38:11 +0000] 404 1084 0 GET /Contracts/ext-4.2.1.736/locale/ext-lang-.js Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:19.0) Gecko/20100101 Firefox/19.0
10.88.168.235 - [01/May/2013:18:38:11 +0000] 304 - 1 GET /Contracts/ext-4.2.1.736/resources/themes/images/azzurra/form/exclamation.gif Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:19.0) Gecko
/20100101 Firefox/19.0

what would the option be for the --log-format-regex= {options}

so I stop getting:

Logs import summary

0 requests imported successfully
0 requests were downloads
1450 requests ignored:
    1450 invalid log lines
    0 requests done by bots, search engines, ...
    0 HTTP errors
    0 HTTP redirects
    0 requests to static resources (css, js, ...)
    0 requests did not match any known site
    0 requests did not match any requested hostname

using the following regex I was able to get my logs imported from tomcat

Log format in my server.xml file
pattern=’%h %S %t %s %b %D %m %U “%{User-Agent}i”’

python import_logs.py --url=http://piwik --dry-run --add-sites-new-hosts --log-format-regex=’(?P\S+) \S+ [(?P.?) (?P.?)] (?P\S+) (?P\S+) \S+ (?P.?) "(?P.?) (?P<user_agent>.*?)"’ --idsite=2 /tmp/localhost_access_log.2013-05-07.log -d

Do you use default tomcat logs? If so, it would be great to add this log format to the list of default log formats…

Matt, this is not a “standard” format but, it is composed of the standard variables plus a few others. It would be extremely useful to identify the valve components that pair to the pattern specific log-format-regex

Values for the pattern attribute are made up of literal text strings, combined with pattern identifiers prefixed by the “%” character to cause replacement by the corresponding variable value from the current request and response. The following pattern codes are supported:

%a - Remote IP address
%A - Local IP address
%b - Bytes sent, excluding HTTP headers, or ‘-’ if zero
%B - Bytes sent, excluding HTTP headers
%h - Remote host name (or IP address if resolveHosts is false)
%H - Request protocol
%l - Remote logical username from identd (always returns ‘-’)
%m - Request method (GET, POST, etc.)
%p - Local port on which this request was received
%q - Query string (prepended with a ‘?’ if it exists)
%r - First line of the request (method and request URI)
%s - HTTP status code of the response
%S - User session ID
%t - Date and time, in Common Log Format
%u - Remote user that was authenticated (if any), else ‘-’
%U - Requested URL path
%v - Local server name
%D - Time taken to process the request, in millis
%T - Time taken to process the request, in seconds
%I - current request thread name (can compare later with stacktraces)
There is also support to write information from the cookie, incoming header, the Session or something else in the ServletRequest. It is modeled after the Apache HTTP Server log configuration syntax:

%{xxx}i for incoming headers
%{xxx}o for outgoing response headers
%{xxx}c for a specific cookie
%{xxx}r xxx is an attribute in the ServletRequest
%{xxx}s xxx is an attribute in the HttpSession
The shorthand pattern name common (which is also the default) corresponds to ‘%h %l %u %t “%r” %s %b’.

The shorthand pattern name combined appends the values of the Referer and User-Agent headers, each in double quotes, to the common pattern described in the previous paragraph.

If this could be identified/paired with the piwik regex example, I’m sure the log analysis function for tomcat would become extremely flexible and easy to use.

Thanks!

If you need professional help getting this to work please contact: http://piwik.org/consulting/#contact-consultant

The previous post worked, I was merely suggesting more documentation around how to implement the regex format.

Thanks, Bageera

no problem and what jvinci is saying would be helpful.

I was just guessing to see if the regex would pick up those variables based on another post or forum I was reading. I didn’t see anything specifically from piwik stating that this is what you can use for parsing logs.

Ok for more docs, if you can suggest such docs, I 'd be very happy adding it to the website ! :slight_smile:

http://tomcat.apache.org/tomcat-6.0-doc/config/valve.html

This is the location of the official Tomcat valve reference. If you were able to pair that with specific regex examples, that would open many doors.

Is this similar to what was reported in (#3163 new New feature) for the 2.0 version?

Still failed for me… any updates for that?

Still failed for me… any updates for that?

I hope this is helpful! (This is still a work in progress and I encourage feed back or help! )

This was most useful in working the live regex custom log format option:

http://ksamuel.pythonanywhere.com/

if you know the valve variables from server.xml (tomcat), like:

common - %h %l %u %t “%r” %s %b
combined - %h %l %u %t “%r” %s %b “%{Referer}i” “%{User-Agent}i”

in my case I used:
pattern=’%h %S %t %s %b %D %m %U “%{User-Agent}i”’

I identified what was currently in the code pulling this from the import_log.py (so I had a clue about what I was attempting to do):

_HOST_PREFIX = ‘(?P[\w-.])(?::\d+)? ‘
_COMMON_LOG_FORMAT = (
’(?P\S+) \S+ \S+ [(?P.
?) (?P.?)] ‘
’"\S+ (?P.
?) \S+" (?P\S+) (?P\S+)’
)
_NCSA_EXTENDED_LOG_FORMAT = (_COMMON_LOG_FORMAT +
’ “(?P.?)" "(?P<user_agent>.?)”’
)
_S3_LOG_FORMAT = (
’\S+ (?P\S+) [(?P.?) (?P.?)] (?P\S+) ‘
’\S+ \S+ \S+ \S+ “\S+ (?P.?) \S+" (?P\S+) \S+ (?P\S+) ‘
’\S+ \S+ \S+ "(?P.
?)” “(?P<user_agent>.*?)”’
)
_ICECAST2_LOG_FORMAT = ( _NCSA_EXTENDED_LOG_FORMAT +
’ (?P<session_time>\S+)’
)

FORMATS = {
‘common’: RegexFormat(‘common’, _COMMON_LOG_FORMAT),
‘common_vhost’: RegexFormat(‘common_vhost’, _HOST_PREFIX + _COMMON_LOG_FORMAT),
‘ncsa_extended’: RegexFormat(‘ncsa_extended’, _NCSA_EXTENDED_LOG_FORMAT),
‘common_complete’: RegexFormat(‘common_complete’, _HOST_PREFIX + _NCSA_EXTENDED_LOG_FORMAT),
‘iis’: IisFormat(),
‘s3’: RegexFormat(‘s3’, _S3_LOG_FORMAT),
‘icecast2’: RegexFormat(‘icecast2’, _ICECAST2_LOG_FORMAT),
}

Then pieced this together:

(?P[\w-.])(?::\d+)? \S+ [(?P.?) (?P.?)] (?P\S+)? \S+ (?P\S+) (?P\S+) (?P.?) “(?P<user_agent>.*?)”

and looking at one log line:

Raw:
10.88.168.198 - [15/May/2013:19:55:38 +0000] 302 - 64 GET / “Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.64 Safari/537.31”

match.group():
u’10.88.168.198 - [15/May/2013:19:55:38 +0000] 302 - 64 GET / “Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.64 Safari/537.31”’

match.groupdict():
{u’date’: u’15/May/2013:19:55:38’, u’host’: u’10.88.168.198’, u’length’: u’64’, u’path’: u’/’, u’request’: u’GET’, u’status’: u’302’, u’timezone’: u’+0000’, u’user_agent’: u’Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.64 Safari/537.31’}

and adding it back to --log-format-regex=’(?P[\w-.])(?::\d+)? \S+ [(?P.?) (?P.?)] (?P\S+)? \S+ (?P\S+) (?P\S+) (?P.?) “(?P<user_agent>.*?)”’

BOOM … Logs imported. although I’m having an issue with the actual browser type. I’ll update the final once I have it.

Matt

these aren’t standard tomcat logs out of the box, however if you were using awstats this would be standard format for switching over from awstats to piwik

See the list of Log Analytics feature request: http://dev.piwik.org/trac/query?status=!closed&component=Log+Analytics+(import_logs.py)

Tracking Bytes is now covered in this ticket: Log Analytics: Monitor Bandwidth for each page, download, and measure overall traffic in bytes · Issue #5248 · matomo-org/matomo · GitHub