1
0
mirror of https://github.com/restic/restic.git synced 2024-06-24 07:46:38 +02:00
restic/src
Zlatko Čalušić d8f0e7cbd1 Fix REST backend HTTP keepalive
This is subtle.  A combination od fast client disk (read: SSD) with lots
of files and fast network connection to restic-server would suddenly
start getting lots of "dial tcp: connect: cannot assign requested
address" errors during backup stage.  Further inspection revealed that
client machine was plagued with TCP sockets in TIME_WAIT state.  When
ephemeral port range was finally exhausted, no more sockets could be
opened, so restic would freak out.

To understand the magnitude of this problem, with ~18k ports and default
timeout of 60 seconds, it means more than 300 HTTP connections per
seconds were created and teared down.  Yeah, restic-server is that
fast. :)

As it turns out, this behavior was product of 2 subtle issues:

1) The body of HTTP response wasn't read completely with io.ReadFull()
   at the end of the Load() function.  This deactivated HTTP keepalive,
   so already open connections were not reused, but closed instead, and
   new ones opened for every new request.  io.Copy(ioutil.Discard,
   resp.Body) before resp.Body.Close() remedies this.

2) Even with the above fix, somehow having MaxIdleConnsPerHost at its
   default value of 2 wasn't enough to stop reconnecting.  It is hard to
   understand why this would be so detrimental, it could even be some
   subtle Go runtime bug.  Anyhow, setting this value to match the
   connection limit, as set by connLimit global variable, finally nails
   this ugly bug.

I fixed several other places where the response body wasn't read in
full (or at all).  For example, json.NewDecoder() is also known not to
read the whole body of response.

Unfortunately, this is not over yet. :( The check command is firing up
to 40 simultaneous connections to the restic-server.  Then, once again,
MaxIdleConnsPerHost is too low to support keepalive, and sockets in the
TIME_WAIT state pile up.  But, as this kind of concurrency absolutely
kill the poor disk on the server side, this is a completely different
bug then.
2016-11-10 09:32:07 +01:00
..
cmds selectFunc: handle nil 2016-11-05 12:38:33 +01:00
restic Fix REST backend HTTP keepalive 2016-11-10 09:32:07 +01:00