Once the change was deployed cheery load testing results followed: processing latency increased by around 20ms - apparently a penalty for TCP handshake on each request, but there were no failed requests. Having assumed that I had no other option that to try disabling connection reuse: as the presence of useless stale instances is out of our control, the only way to avoid connection pool exhaustion is to avoid having an active connection in them. I don’t have another explanation why it refuses to process new events with existing instances. It looks like Lambda creates a set of instances according to concurrency and keeps them alive for several minutes, but apparently puts them to sleep much earlier. Such results combined with the fact that the service was unresponsible even after the load spike gave me an idea. I tried to play with timeouts settings - both proxy and pooling library. In other words that means that each incoming connection is associated with exactly one database connection and that one-to-one relation makes proxy useless. The proxy can't reuse this connection until the session ends. The client session was pinned to the database connection That did not make any sense until I found the following in the proxy logs: Unfortunately it did not change anything in the results. I quickly set up the proxy, adjusted connection string and ran tests again, hoping for the best. If no db connection is available to handle a request, it waits for up to a specified timeout and responds with an error. While looking for a way to implement the second approach I found out that there is a AWS service exactly for that - RDS Proxy.Īs stated in the description, it takes care of database connection entirely - maintains a pool of connections to the database and essentially multiplexes incoming connections to that pool. Also that didn’t seem to being able to address the entirely - as mentioned above, the service was unresponsible for several minutes after the load abates when concurrency is essentially zero. The former was swept aside: such behavior would mean making clients aware that they should possibly retry. I saw 2 ways to address the issue: either using reserved Lambda concurrency to prevent it from going higher than the exhaustion limit or somehow making an instance to wait for a connection to become available rather than to through connection error right away. ![]() That lead to database connection exhaustion at a certain concurrency level. In order to foster database connections reuse each instance maintained its own pool of connections, storing a reference in the global function scope. Soon the cause started to seem obvious - spikes of load result in concurrent executions, forcing Lambda to create new instances. ![]() Simple load testing on staging environment with Apache Bench revealed that even after load abates the number of database connections only goes down after 4-5 more minutes, and the server is not able to process new requests during that time either. Since the service was a critical part of the platform, I dived into debugging. ![]() ![]() A while ago the database started to run out of available connections during load spikes. One of the services I’m maintaining is built on AWS Lambda and uses PostgreSQL database hosted on AWS RDS.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |