Import_logs.py fails when trying to parse a big apache access log (bigger than 900 MB)

Hi all,

I am trying to parse big apache logs but import_logs.py fails with this error:

2017-01-23 18:30:39,245: [INFO] Max number of attempts reached, server is unreachable!_
Fatal error: HTTP Error 500 Internal Server Error, response: {“status”:“error”,“tracked”:0,“invalid”:0,“invalid_indices”:[]}
_You can restart the import of “/awdata/piwik/cosmote/geratgweb04/www.cosmote.gr_with_ssl-access.log-20170123” from the point it failed by specifying --skip=78454 on the command line.

I use a VM with 8 cpus and 8 gb mem.
My OS version is RHEL 7.3, my apache version is Apache/2.4.6 (Red Hat Enterprise Linux), php is PHP 7.0.14 and mariadb version is 5.5.52.
Piwik version 3.0.1

The command I am using is:

/var/www/html/zak/piwik/misc/log-analytics/import_logs.py --url=http://middlinf.ote.gr:81/piwik/ --idsite=6 --recorders=8 --enable-http-errors --enable-http-redirects --enable-static /awdata/piwik/cosmote/geratgweb04/www.cosmote.gr_with_ssl-access.log-20170123 --retry-max-attempts=20

Apache error log is showing:

[Mon Jan 23 17:55:42.416455 2017] [:error] [pid 19040:tid 140544500012800] [client 172.18.20.26:44952] Error in Piwik (tracker): Error query: SQLSTATE[40001]: Serialization failure: 1213 Deadlock found when trying to get lock; try restarting transaction In query: UPDATE piwik_log_visit SET idvisitor = ?, user_id = ?, visit_last_action_time = ?, visit_exit_idaction_url = ?, visit_total_actions = visit_total_actions + 1 , visit_total_interactions = visit_total_interactions + 1 , visit_total_time = ? WHERE idsite = ? AND idvisit = ? Parameters: array ( 0 => ’ \xa6\x96\xbc\xef\xb9\xde\xf5’, 1 => ‘""’, 2 => ‘2017-01-22 04:05:05’, 3 => 278568, 4 => 0, 5 => 6, 6 => 10133, )

and mariadb log is showing:

# Time: 170123 18:00:46
# User@Host: root[root] @ localhost [127.0.0.1]
# Thread_id: 1691  Schema: piwik_db  QC_hit: No
# Query_time: 3.858223  Lock_time: 0.000060  Rows_sent: 1  Rows_examined: 1
SET timestamp=1485187246;
SELECT visit_last_action_time, visit_first_action_time, idvisitor, idvisit, user_id, visit_exit_idaction_url, visit_exit_idaction_name, visitor_returning, visitor_days_since_first, visitor_days_since_order, visitor_count_visits, visit_goal_buyer, location_country, location_region, location_city, location_latitude, location_longitude, referer_name, referer_keyword, referer_type, idsite, visit_entry_idaction_url, visit_total_actions, visit_total_interactions, visit_total_searches, config_device_brand, config_device_model, config_device_type, visit_total_events, visit_total_time, location_ip, location_browser_lang, custom_var_k1, custom_var_v1, custom_var_k2, custom_var_v2, custom_var_k3, custom_var_v3, custom_var_k4, custom_var_v4, custom_var_k5, custom_var_v5   FROM piwik_log_visit WHERE visit_last_action_time >= '2017-01-22 05:14:21' AND visit_last_action_time <= '2017-01-22 06:14:21' AND idsite = '6' AND idvisitor = ' 

                                                                           ORDER BY visit_last_action_time DESC

I did some research at the forums but did not find anything intersting.
Do you have any suggestions?

Thank you in advance.

-Thanassis

Actually the 2 main errors I am constanlty getting at apache error log are:

[Tue Jan 24 13:44:33.855134 2017] [:error] [pid 17456:tid 140544525190912] [client 172.18.20.26:16610] Error in Piwik (tracker): Error query: SQLSTATE[40001]: Serialization failure: 1213 Deadlock found when trying to get lock; try restarting transaction In query: UPDATE piwik_log_visit SET idvisitor = ?, user_id = ?, visit_last_action_time = ?, visit_exit_idaction_url = ?, visit_total_actions = visit_total_actions + 1 , visit_total_interactions = visit_total_interactions + 1 , visit_total_time = ? WHERE idsite = ? AND idvisit = ? Parameters: array ( 0 => ’ \xa6\x96\xbc\xef\xb9\xde\xf5’, 1 => ‘""’, 2 => ‘2017-01-23 08:14:43’, 3 => 47298, 4 => 0, 5 => 4, 6 => 11224, )

And:

[Tue Jan 24 13:51:58.582401 2017] [:error] [pid 18419:tid 140544525190912] [client 172.18.20.26:18232] Error in Piwik (tracker): Error query: SQLSTATE[HY000]: General error: 1205 Lock wait timeout exceeded; try restarting transaction In query: UPDATE piwik_log_visit SET idvisitor = ?, user_id = ?, visit_last_action_time = ?, visit_exit_idaction_url = ?, visit_total_actions = visit_total_actions + 1 , visit_total_interactions = visit_total_interactions + 1 , visit_total_time = ? WHERE idsite = ? AND idvisit = ? Parameters: array ( 0 => ’ \xa6\x96\xbc\xef\xb9\xde\xf5’, 1 => ‘""’, 2 => ‘2017-01-23 08:42:33’, 3 => 49791, 4 => 242, 5 => 4, 6 => 11371, )

I added INDEX(idsite, idvisitor, visit_last_action_time) to piwik_log_visit table and now at least when I imported a 1.1gb log file it finished succesfully (after 2nd attemp with --skip flag).
Import command:

import_logs.py --url=http://middlinf.ote.gr:81/piwik --idsite=2 --recorders=8 --enable-http-errors --enable-http-redirects --enable-static ssl-access.log-20170130 --recorder-max-payload-size=300 --retry-max-attempts=20 --skip=7100

Still getting the errors :

Error in Piwik (tracker): Error query: SQLSTATE[HY000]: General error: 1205 Lock wait timeout exceeded; try restarting transaction In query: UPDATE piwik_log_visit SET idvisitor = ?, user_id = ?, visit_last_action_time = ?, visit_total_actions = visit_total_actions + 1 , visit_total_time = ? WHERE idsite = ? AND idvisit = ? Parameters: array ( 0 => ’ \xa6\x96\xbc\xef\xb9\xde\xf5’, 1 => ‘""’, 2 => ‘2017-01-30 01:48:12’, 3 => 179, 4 => 2, 5 => 20654, )

and

Mon Jan 30 23:30:08.788723 2017] [:error] [pid 15812:tid 140021503817472] [client 10.101.0.138:35534] Error in Piwik (tracker): Error que
ry: SQLSTATE[40001]: Serialization failure: 1213 Deadlock found when trying to get lock; try restarting transaction In query: UPDATE piwik
_log_visit SET idvisitor = ?, user_id = ?, visit_last_action_time = ?, visit_total_actions = visit_total_actions + 1 , visit_total_time =
** ? WHERE idsite = ? AND idvisit = ? Parameters: array ( 0 => ’ \xa6\x96\xbc\xef\xb9\xde\xf5’, 1 => ‘""’, 2 => '2017-01-29 **
15:42:28’, 3 => 189, 4 => 2, 5 => 17650, )

Finally the time to import the log is too slow…

Logs import summary
-------------------

** 2256634 requests imported successfully**
** 9788 requests were downloads**
** 747865 requests ignored:**
** 0 HTTP errors**
** 0 HTTP redirects**
** 733420 invalid log lines**
** 0 requests did not match any known site**
** 0 requests did not match any --hostname**
** 14445 requests done by bots, search engines…**
** 0 requests to static resources (css, js, images, ico, ttf…)**
** 0 requests to file downloads did not match any --download-extensions**

Website import summary
----------------------

** 2256634 requests imported to 1 sites**
** 1 sites already existed**
** 0 sites were created:**

** 0 distinct hostnames did not match any existing site:**

Performance summary
-------------------

** Total time: 52278 seconds**
** Requests imported per second: 43.17 requests per second**

Any suggestions?

Are you using Piwik 3?

Hi Matthieu,

I use a VM with 8 cpus and 8 gb mem.
My OS version is RHEL 7.3, my apache version is Apache/2.4.6 (Red Hat Enterprise Linux), php is PHP 7.0.14 and mariadb version is 5.5.52.
Piwik version 3.0.1

Could you please post in our related ticket: https://github.com/piwik/piwik/issues/6398

note there is a workaround offered, maybe it works for you? Would be great if you can comment in the ticket there