Somehow it seems that every edge case for TLS or TCP communications winds up in my queue. I don't do much Python development personally (almost nothing in fact). That's a language that fits into the I can read or debug it category, rather than the speak fluently and know the lexicon. So, when and end user had a problem that they couldn't resolve in Python it took me quite some time to figure it out.
In a nutshell, the problem was that uploads of a file and data using POST and MIME multipart/form-data in Python weren't working for files over a particular size but were for smaller files. I couldn't reproduce the problem on my local system, or in my development environment, so this was very troubling. We'd tested this with files of the size this user provides, and so were pretty certain it would work, but in actual practice, they couldn't make it work.
We'd had others upload using other platforms, Java, C#, bash scripts using CURL, and browser-based forms, so we felt that there was nothing specifically wrong in the implementation and seen nothing like this problem.
Symptomatically, we'd see the entire request (which took a while to upload), and it would be processed (which also took a while), and we also had a log of the outbound response, so we knew a 200 OK success response had been sent back, but the user wasn't getting it. The read was timing out. This very much looked to me like a firewall problem, and we had the networking folks check it out, but clearly the firewall on the service wasn't closing the socket prematurely. The request takes about 30 minutes to upload and process, so not a great candidate for debugging via Wireshark, and as I said, the problem did not occur in my development or local environments.
I finally tracked this down yesterday. It really was a networking issue, but not in the firewall. Instead, it's a problem in the network gateway. In the AWS environment, if the network gateway does not have activity on a connection after 350 seconds, the connection is dropped. Large file processing takes time on the order of file size, and the large files the user was sending took about 10 minutes to process. While sending the file there was activity, but once it was being processed, there were several minutes of latency, and so the connection was dropped.
Why did this work in Java, Browsers, and with curl? Well, b/c there's something called TCP Keep Alive packets, which exist entirely to address this sort of issue. Those application environments have keep alive turned on sufficiently so that it's NOT a problem. But Python does not.
The fix is to add this to your imports:
from requests_toolbelt.adapters.socket_options import TCPKeepAliveAdapter
And this to the Session where you are doing long uploads:
s = grequests.Session()
keep_alive = TCPKeepAliveAdapter(idle=120, count=50, interval=30)
s.mount('https://', keep_alive)
This, finally resolve the problem. You'll note I'm using grequests instead of requests. The two are generally API compatible, and grequests solves another problem I don't admittedly understand or have much time to debug. In any case, what I've seen elsewhere indicates that if you've got connection timeout exceptions you cannot otherwise track down, and you are connecting to an AWS endpoint, and you know there's a long delay in processing, this might just be the solution you are looking for.