Graham Dumpleton: mod

Showing posts with label mod_python. Show all posts

Thursday, June 17, 2010

The mod_python project is now officially dead.

At the Apache Software Foundation June board meeting which took place on the 16th June 2010, the following resolutions were passed unanimously.

B. Terminate the Apache Quetzalcoatl Project (to Attic)

This was the umbrella project for mod_python and as such mod_python is now no more.

Long live mod_wsgi. Or at least until I get bored of it. ;-)

Thursday, May 27, 2010

The mod_python project soon to be officially dead.

The mod_python project last had a release February 2007. There has more or less been no developer activity since. Under the rules of the Apache Software Foundation if a project becomes inactive then it can be moved into what is called the Apache Attic. Come the next meeting of the ASF board a proposed resolution will be put up to dissolve the Quetzalcoatl project management committee and projects which it overseas, namely mod_python. Thus, in some respect one can say that mod_python will be officially dead.

What will this really mean. Well, no more development on mod_python, but then that isn't happening now. Linux distributions will likely still carry the package, but they will need to apply any changes themselves to ensure that it continues to work for newer versions of Apache. This actually is already being done as a bug in mod_python was exposed by some changes to the Apache runtime library and since no mod_python release has been made since then, only option was for distributions to patch it themselves.

I suspect though that this patching by distributions will only extend to Apache 2.2.X as Apache 2.4 changes the internal APIs enough that getting mod_python to compile will need more significant changes than a minor patch. You also will never see a version of mod_python for Python 3.X as that is going to require a radical rewrite.

For platforms like MacOS X where the last release of mod_python doesn't build properly, the only option will be to checkout mod_python source code from the subversion repository as that incorporates a fix for those issues, although it has since been shown to not be a complete fix and so you may still have some problems. In other words, the subversion repository will still exist, but it will be made read only and the location may potentially move.

What options exist if you want to move away from mod_python?

If you are only using mod_python as a means of hosting within Apache a distinct Python web framework or application and it supports the WSGI interface, then the obvious candidate is to move to mod_wsgi instead.

If you are using mod_python to implement custom access, authentication or authorization handlers, then you may also be able to get away with using mod_wsgi. You may have to make compromises though as mod_wsgi doesn't currently allow you to write full blown Apache style handlers and instead only implements the Apache authentication and authorization provider interfaces. This actually makes it easier to do most things, but you loose the ability to do some more complicated stuff which depends on using different error status values or custom error pages. You can partly get away with using ErrorDocument directive to return custom error pages, but not always.

Although proper support for Apache style handlers could be added to mod_wsgi, and a lot of work has been done in that direction already through the implementation of SWIG bindings for Apache, whether the remaining work to allow the actual hooking of handlers into mod_wsgi will be done will really depend on interest in it. Up till now, there hasn't been sufficient interest to justify doing the final bit of work and there has actually been some resistance put up to the idea of extending mod_wsgi too far beyond the core goal of providing WSGI hosting.

If you are not using the basic handler mechanisms of mod_python and are instead using the CGI handler, the publisher handler or PSP handler, then there aren't really any options at this point except for rewriting your application on top of a WSGI framework. If you have to do such a rewrite, but like the low level that one works at when using mod_python, then would strongly recommend you perhaps look at Werkzeug and Flask.

If you are using mod_python input or output filters there simply isn't any replacement. Frankly though I always thought that writing input or output filters in mod_python was a really bad idea. Yeah it may work, but it wouldn't exactly be efficient. You would be much better writing a proper Apache module in C to do what is required, certainly if performance is an issue.

Finally, the ability to use Python code with Apache server side include mechanism also has no real equivalent. But then, not sure anyone ever actually used that feature anyway. Something similar could possibly be implemented in mod_wsgi but not sure there would really be a point. If you are only using ability to include Python code, you would be much better off using a proper templating system that works with a WSGI framework.

So, the writing is on the wall so to speak and if you are using mod_python, you really should be starting to plan how to move away from it as eventually it will likely not be an option you can use if you want to keep up to date with Apache and/or Python.

Could it yet be saved by a white knight. Well, yes it could as the Apache Attic does allow projects to be resurrected or forked with it then being maintained outside of the ASF. If someone does do the latter though, you would need to be mindful of the name as the ASF in some respects has rights over the mod_python name. As such, you would likely need to rename the project when you take it over.

If you fork mod_python, you would also want to think about capturing all the details of the significant number of still open bug reports for mod_python in its issue tracker, as am not sure what will happen to that and whether it will be made read only or whether it will be completely closed down.

Anyway, will be interesting times ahead. I should point out though that according to Google trends (currently borked), searches on WSGI finally overtook those for mod_python recently. Searches for mod_wsgi haven't yet, but not far off occuring. It is also the common belief that WSGI is the way to go for Python web applications now, so perhaps it is simply time to just move on from mod_python. If we could only just get newbies, who even now keep using mod_python even though better options exist, to understand this, then maybe it can finally be left to die in peace.

Sunday, November 15, 2009

Authorship of mod_wsgi and mod_python.

I wrote 100% of mod_wsgi for Apache, it does not borrow code from mod_python. I was not the author of mod_python for Apache although I did contribute to the 3.2.X and 3.3.X releases. I also was not the author of the confusingly named mod_wsgi for nginx although the author of that package did borrow some code from mod_wsgi for Apache.

Why I am saying this? Because I am getting a bit tired of the incorrect posts about this on various blogs, forums and IRC channels related to the major Python web frameworks and other Python topics. I really wish you guys would do your research properly before commenting on something you don't really know much about.

Especially those answering questions on the IRC channels, it would also be really good if you at least read the mod_wsgi documentation. It gets very frustrating to keep seeing posts in the IRC logs where you give out incorrect information about usage of mod_wsgi. As to people asking the questions, please go and read the documentation on the actual mod_wsgi web site, especially where someone does provide a link to some specific part of it by way of an answer, rather than just believing that someone else's one line answer is all there is to know or even correct. Why if you have a mod_wsgi question you can't just go to the mod_wsgi mailing list in the first place I have no idea.

And since I am having a rant, also stop using the term WSGI to generically refer to mod_wsgi running on Apache. WSGI is a specification of an API for Python web applications. The mod_wsgi module for Apache is just one implementation of the WSGI specification, and a lot more besides, for the Apache web server. The mod_wsgi module for Apache is not the only implementation of the WSGI specification in respect of providing a hosting solution, both generally and for Apache. It gets really confusing some times when you use the term WSGI in isolation when talking about a specific hosting mechanism as one doesn't know what hosting mechanism you are actually talking about even though most of the time it is mod_wsgi for Apache that you mean.

Friday, April 17, 2009

Improving Commercial Python/WSGI Hosting Options

I'd like to think that through my work with mod_python and mod_wsgi that Python web hosting options have improved, but truth is that neither mod_python nor mod_wsgi (at this stage) are really suitable for mass virtual hosting. As such, for low cost commodity Python web hosting the only real options are still CGI and FASTCGI.

In the case of FASTCGI this usually means mod_fastcgi or mod_fcgid under Apache, and although many web hosting companies do use these modules and so can provide support for Python, they often don't, or the support provided is less than ideal.

In taking the view that support for Python isn't very good, one does have to be careful however. This is because when you read support forums and irc channels, you obviously are only going to see the complaints and the calls for help to get things working. It may well be the case that this is an outspoken minority and the bulk of people are having no problem at all. Either way, there is still a perception that the Python community isn't being well serviced by web hosting companies and that something better is required.

As I have previously described in the mod_wsgi roadmap, the intention is to support features that would allow mod_wsgi to be used in mass virtual hosting, but there is a lot more to it than just providing yet another option that they might be able to use. In fact, there is no real reason why good Python web hosting couldn't be offered using FASTCGI right now.

I tend to think that the real problem is in part one of education. That is, lack of good documentation on how to setup FASTCGI for running Python within a commercial web hosting operation, and a clear indication of what the Python communities expectations are as to what should be available.

Some of the problems which arise are web hosting companies that provide only woefully out of date Python versions, no easy ability to install Python modules/packages, and in the case of FASTCGI, not even providing flup or some other FASTCGI bridge. End result is that although one may be able to use Python, it isn't necessarily easy and a lot of the hard work is pushed onto the user, rather than the web hosting company providing an environment which is easy to use to begin with.

With that in mind I am currently contemplating whether to start up a distinct uber project which has the specific goal of improving commercial Python/WSGI hosting options. This would not be done with the intent of just pushing my separate mod_wsgi software, but would look at all available software and come up with guidelines and other documentation on how best to use whatever is available, including CGI and FASTCGI.

I can also see this going beyond just documentation, with it also producing code libraries and applications. For example, at the moment for someone to host a Python WSGI web application under CGI they need to know about what CGI/WSGI adapters are available. Similarly for FASTCGI you need to know about what FASTCGI/WSGI adapters are are available. That or you need for the Python web application being used to internally somehow support CGI or FASTCGI directly.

Frankly, with WSGI, these days it is pretty stupid for Python web applications themselves to be worried about CGI or FASTCGI. At the same time, the user also should not have to need to know about them either. What would be much better is that no matter what underlying Python hosting mechanism is used, that the web hosting company provide a means of hosting WSGI applications themselves.

As example, when using mod_wsgi all you need to do is provide a WSGI script file which contains an 'application' object as entry point for the WSGI application. That WSGI script can also include any other code required to set up the environment for the WSGI application. There is no reason why this couldn't also be applied to CGI and FASTCGI.

So, instead of a user having to provide a .cgi or .fcgi file, they would provide a .wsgi file. It would then be up to the web hosting company to automatically ensure that the right thing happens.

Obviously, web hosting companies are going to be clueless as how to make that work and this is where one product of the project would be to provide a small set of Python wrapper applications which perform that mapping along with the instructions on how a web hosting company would integrate that into their systems. This would therefore need to include guidelines on how to set up Apache, including how to integrate it into suexec or cgiwrap as appropriate.

One of the problems that this wrapper application can solve is fixing up WSGI variables like SCRIPT_NAME and PATH_INFO. At the moment Python web applications often have hacks in them, or the user themselves are forced to have hacks in the WSGI script file, to adjust these variables where they aren't passed through correctly from the web server.

Another problem than that can be solved here is ensuring that logging from Python web applications ends up somewhere where the user can actually see and make use of it. One often sees instances where people are having trouble with something like FASTCGI, but due to how the system is set up, any error messages output when the FASTCGI script is being started disappear, making it really hard to debug problems. Because the wrapper application is in control of loading the WSGI script file, it can ensure that any log files are setup properly. It could even provide a feature to capture the errors and return them in a error page to the browser rather than them going to the log only.

So, that is the dream. In part I need to indirectly do some of the ground work for this in order to work out what features I need to add to make mod_wsgi more useful in a mass virtual hosting setup. It would be nice though if there are others out there who have some measure of passion for seeing Python web hosting options improved contribute as well. Most of all, I would dearly like to get the web hosting companies themselves directly involved.

In respect of dealing with web hosting companies, to date my experiences in dealing with them have not been very inspiring. Where I have actively tried to contact them to try and learn how they run things, so I can work out what features mod_wsgi should provide to make it easy for them to use, they have been quite unwilling to give up any information. Even when web hosting companies have contacted me about mod_wsgi, it seems the contact is coming from managers or sales people and not the technical people. Even at the requests of these same people, their own technical people aren't necessarily forthcoming with the information I really need. Overall it has been quite frustrating to say the least.

Hopefully then if this project were to get off the ground and were seen to have active backing from the Python community, we might be able to make some progress. We may even be able to make web hosting companies see that there is more than just PHP out there.

Right now any feedback you may want to give on the whole idea and whether there is a need for it would be most helpful. Maybe I am barking up the wrong tree and all is actually wonderful after all. As much as I may believe there is a problem here needing to be solved, am sure that existing mod_wsgi users would prefer I concentrate on just mod_wsgi and not worry about all this other stuff. :-)

Thursday, March 19, 2009

Future roadmap for mod_wsgi.

Because of family commitments, progress on mod_wsgi has been slower than I would have liked for the past year. I have a bit more spare time these days, so time to talk a little about where I see mod_wsgi going from here.

Up till now, the little time I have been able to spend on mod_wsgi which hasn't been chewed up in answering questions on the mailing list and other forums, has been going towards mod_wsgi version 3.0. This has mainly consisted of a lot of bug fixes and minor refinements, however there are also a few interesting new features, the main ones being described below.

Support for Python 3.0. Now support Python 3.0 based on proposed amendments to WSGI specification for this latest incarnation of Python. The mod_wsgi package was the first major WSGI server to support Python 3.0, albeit that you have had to use it from the subversion repository. Just a pity that it seems it will be a while before any of the larger frameworks support Python 3.0.
User chroot environment. Specific daemon process groups can be delegated to run in the context of a chroot environment. Direct support for chroot environments by mod_wsgi means that you do not have to run Apache as a whole in the chroot environment and you could support many WSGI applications running in different daemon process groups as different users each in their own chroot environment.
Ownership of WSGI script files. Enforce that the WSGI script files corresponding to a WSGI application delegated to a specific daemon process group, are owned by or are a member of a specific group. The permissions of the directory containing the WSGI script file are also checked to ensure they are consistent. This is an additional security measure that can be applied in the case where all WSGI scripts in a directory are being delegated to run in a daemon process as a specific user, to ensure that sloppy directory permissions don't allow arbitrary other users to place WSGI scripts in that directory and therefore have it run as a different user.
Chunked request content. This is not something that the WSGI specification actually allows, but if you are happy to step slightly outside of the WSGI specification you can support clients which use this feature of HTTP. This appears to becoming more and more important as some mobile phone devices are automatically using chunked request content on HTTP POST requests where data is greater than a certain size. At the moment WSGI applications are ham strung in not being able to support this unless the underlying WSGI adapter uses the rather hacky method of reading in the whole request up front, calculating the CONTENT_LENGTH and passing it through as if it wasn't a chunked request.
Internal web server redirection. For mod_wsgi daemon mode, now support the CGI method of being able to return a HTTP 200 response with the Location header defined, to trigger an internal server redirection. The target URL in this case could be within the same Python web application, another web application, another web server proxied via the web server or a static file. The latter would give something akin to X-Sendfile, although actually more like nginx X-Accel-Redirect as the target is a URL and not a physical file.
Limits on processor CPU time. For mod_wsgi daemon mode, this allows one to trigger an automatic restart of a daemon process when the accumulated CPU time used by the process exceeds a specified amount. The intent of providing this feature is as a fail safe to capture when a process may go berserk and starts chewing up processor time due to some programming error.
Override application error pages. For mod_wsgi daemon mode, now allow any error response page returned by the WSGI application to be ignored and the default Apache error page, or any defined by Apache ErrorDocument directive, to be used instead. This is for where multiple applications, implemented in different systems or programming languages, are hosted and they must present error pages in the same style.
Authentication and HTTP headers. The HTTP headers are now available to authentication providers allowing them to qualify their behaviour based on information in the headers, such as cookies.
Preloading of WSGI applications. The process group and application group to which a WSGI application is to be delegated can now be defined as parameters to the WSGIScriptAlias directive. As a side effect of this, the WSGI script so designated will be preloaded automatically and do not have to separately use WSGIImportScript to preload it.

Although these have all been implemented and mod_wsgi 3.0 is pretty well all ready to go, have decided at this point to first back port as many as possible of the bug fixes from mod_wsgi 3.0 back to mod_wsgi 2.X stream and release mod_wsgi 2.4. This is to close off the mod_wsgi 2.X stream with a stable version, given that many people don't like jumping up to a new major version due to the perception that it will introduce its own new problems.

Once mod_wsgi 2.4 has been released, and last tidy up tasks related to mod_wsgi 3.0 are completed, would expect mod_wsgi 3.0 to follow not long after.

In respect of mod_wsgi 3.0, we did have a bit of a discussion on the mailing list about whether to disable embedded mode in mod_wsgi 3.0 by default and force that it be enabled to be used. The reasoning here was that on UNIX systems most people unknowingly use embedded mode without realising that in doing so they really should be tuning the Apache MPM settings to values more appropriate for fat Python web applications. This is necessary as the default MPM settings are more appropriate for PHP and static file serving and certainly not for Python web applications. Don't adjust things properly and you could see yourself suffering memory issues and load spikes. So, the thought was that maybe disabling it by default may act as a good flag to people to say, 'do this and you better know what you are doing'. In the end, decided to leave it as is, because without also configuring daemon mode, they would just get an error message and most likely be more confused about what is going on.

What is really needed in order to be able to disable embedded mode, or get rid of it all together, is for daemon mode to support a means of dynamically creating daemon processes without the need for any up front configuration to specifically enable it. For example, the default might be that first time that WSGI application is triggered which is associated with a specific virtual host, that a daemon process be automatically started for that virtual host. If multiple WSGI applications are mounted under that same virtual host, they would execute within the context of different sub interpreters of that daemon process.

Such a feature along with other mechanisms for supporting transient daemon processes has been on the TODO list for a while and had originally wanted to include it in mod_wsgi 3.0, but the lack of time meant I had to defer trying to implement it.

As such, one of the main tasks down for mod_wsgi 4.0 is the ability to dynamically create daemon process groups based on some template or parameterised configuration definition. This may see a daemon process created for each virtual host as above, or maybe one for each authenticated user, with each of those daemon process groups automatically running under the account of that authenticated user. Obviously, lots of possibilities exist here, it just depends on how flexible one can make it and from where one draws the input values to fill out the parameterised parts of the configuration.

Originally the thought had been to just focus on that one task for mod_wsgi 4.0, but some recent feedback from Doug Napoleone about some hacks he had done with mod_wsgi to make it possible to host Python web applications where each used a different version of Python, rekindled some old plans I had for exactly that.

The problem to be solved here is that mod_wsgi is compiled against a specific version of Python and so you are stuck using that one version of Python. If you wanted to use a newer version of Python, you would need to recompile mod_wsgi and thus upgrade all the Python web applications you host to use the newer version of Python at the same time.

Obviously this is a big drawback in a shared environment, be it a university setting or a web hosting provider. This is in part why FASTCGI is still seen as a much better solution than mod_wsgi for Python web hosting, ignoring even the fact that FASTCGI can also be used for other languages.

Now, in order to be able to support using multiple versions of Python, one has to do away with embedded mode. This is necessary as to make embedded mode reasonably efficient on memory and startup costs, one has to load the Python interpreter into the Apache parent process and first initialise it there. That way the Python interpreter is ready for use immediately after the Apache child server processes are forked. Having the Python interpreter linked in and initialised at such an early stage though prevents forked daemon processes from using different versions of Python.

Next issue is that mod_wsgi as it is implemented now, uses the Apache parent process as the monitor process. That is, it is the Apache parent process from which daemon mode processes are directly forked. As such, it was also benefiting from Python being preinitialised in the Apache parent process.

What we would now need to do instead, is to separate out from the Apache parent process the function of creating and monitoring the daemon processes. To that end, a separate monitor process would be forked from the Apache parent process and it would be that process which would then in turn create the daemon processes.

Now that we have a separate monitor process, the whole linking in of and intialising of the Python interpreter can be delayed until the monitor process has been created. To support multiple versions of Python at the same time, we just create multiple monitor processes, one for each version of Python to be supported and with each loading up a specific instance of the mod_wsgi code for that version of Python.

It is just then a matter of specifying which version of Python one wants to use for a specific daemon process group, and the appropriate monitor process would take on the responsibility for ensuring the daemon process is created and subsequently managed.

Because it is problematic to support embedded mode at the same time as supporting the use of multiple versions of Python, the intent would thus be that mod_wsgi 4.0 deliver two distinct Apache module variants. These will be the existing mod_wsgi module and a new mod_wsgid module.

The mod_wsgi module would provide the same level of functionality as is currently provided, but with the addition of dynamically created daemon processes as explained above. It would be bound to a specific version of Python.

The mod_wsgid module would only support daemon mode. Thus, no embedded mode and no ability to implement Apache authentication providers or group authorisation in Python. Using this version one would be able to use different versions of Python at the same time for different WSGI applications. This will be possible because the mod_wsgid module wouldn't actually utilise any part of Python directly. Instead there would be companion plugin modules for each different version of Python. It would be these companion modules which would be loaded by the monitor process corresponding to the desired version of Python. Since nothing to do with Python would be done in the Apache parent process, the Apache child server processes would thus be a bit slimmer as a result. A default configuration would also exist which would automatically create a daemon process group for each virtual host using the version of Python designated as the primary version to be used.

Note that mod_wsgid and mod_wsgi would not be able to be loaded into Apache at the same time, nor would mod_python be able to be used at the same time as mod_wsgid.

Because use of embedded mode is the more specialised case and daemon mode the preferred deployment scenario, the mod_wsgid module would actually likely become the recommended module to use.

Now, if you think that is about as far as one could take mod_wsgi then you would be wrong. For mod_wsgi 5.0 would like to revisit embedded mode and look at adding in support for some of the features of mod_python that mod_wsgi doesn't provide. This would include looking at support for Apache input and output filters implemented using Python, plus exposing the internal Apache APIs for use by Apache style handlers.

A reasonable amount of work has already been done on creating SWIG bindings for Apache APIs but so far it looks like one would be better off using hand crafted bindings instead. This is what mod_python does, but mod_python doesn't really follow that closely the Apache APIs. Would prefer that the hand crafted bindings be SWIG like in the sense of being a much closer mapping of the actual C APIs. By doing this it would be much easier for people to apply any knowledge they have of the C APIs into their Python equivalents. At this point it is also unknown what is going to happen with SWIG and Python 3.0. It may just be easier to support hand crafted bindings for Python 2.X and Python 3.X at the same time.

So, that be the current vision of where mod_wsgi is heading. This is by no means final and always open to suggestions. If you really want to get into a discussion about it, do suggest though using the mod_wsgi mailing list hosted on Google Groups for that, rather than trying to use comments to this post to carry out a conversation.

Monday, March 9, 2009

Load spikes and excessive memory usage in mod_python.

A common complaint about mod_python is that it uses too much memory and can cause huge spikes in processor load. Fact is that this isn't really caused by mod_python itself, but indirectly by virtue of how, or more so how not, Apache has been configured for the type of web application that is being run.

Where the problem stems from is the choice of Multi-Processing Module (MPM) chosen for the Apache installation and the default settings for that MPM.

On UNIX systems there are two main MPMs that are used. These are the prefork MPM and the worker MPM. The prefork MPM implements a multi process configuration where each process is single threaded. The worker MPM implements a multi process configuration but where each process is multi threaded.

Which MPM is used is a compile time option and not something that can be changed dynamically at run time. Thus, your decision has already been made by the time you have installed Apache from source code or from the binary operating system package. Often the choice is already made for you by what the operating system supplied as the default.

Traditionally the MPM used for an Apache installation has been prefork. This is because that is all the older Apache 1.3 supported, but also partly because modules for web development, such as PHP, were not generally multi thread safe and so required that prefork MPM be used.

With the MPM having being selected, either explicitly or through ignorance of there being a choice, that is where the majority of people stop. What most do not realise is that the default settings for an MPM will generally need to be modified based on what you are using Apache for and how much memory your system has available. Customising these settings is even more important for Python web applications as I will explain.

Lets first look at the default settings for the prefork MPM. The values for these as shipped with the original Apache source code is:

# prefork MPM
# StartServers: number of server processes to start
# MinSpareServers: minimum number of server processes which are kept spare
# MaxSpareServers: maximum number of server processes which are kept spare
# MaxClients: maximum number of server processes allowed to start
# MaxRequestsPerChild: maximum number of requests a server process serves
<IfModule mpm_prefork_module>
StartServers          5
MinSpareServers       5
MaxSpareServers      10
MaxClients          150
MaxRequestsPerChild   0
</IfModule>

What this all means is that when Apache starts up it will create 5 child server processes for handling of requests. The number of child server processes used isn't a fixed number however. Instead, what will happen is that Apache will dynamically create additional child server processes when the load increases. Exactly when this occurs is dictated by the setting for the minimum number of idle spare servers. Such additional child server processes may be created up to a number determined by the maximum number of allowed clients. In this case, because each child server process is single threaded, that means a maximum of 150 child server processes may be created.

This is actually quite a lot of child server processes that can be created. If Apache is being used only to serve static files then this number is quite reasonable however. This is because each child server process should be quite small in size. Even when using PHP the server child processes shouldn't grow to be overly large. This is because PHP is CGI like in the sense that each application script is reconstructed on each request and then thrown away at the end of the request. Thus, nothing of the application persists between requests and thus any memory use is always transient.

The other key aspect of PHP which means that memory use of the individual child server processes is kept down, is that the extensions available to the PHP user is fixed when PHP is initialised. Further, all the PHP internal libraries and any optional extensions are preloaded from shared libraries/objects in the Apache parent process before any child server processes are even created from it. Thus, all the code which a PHP application uses is not only preloaded, but shared between all child server processes and isn't counting as private memory to the child server processes. This is significant when one considers that the PHP library alone, not counting optional extensions, can be about 7MB in size.

We now need to contrast what happens with PHP to what happens with mod_python and Python web applications.

When using mod_python the only thing that happens in the Apache parent process is that the Python interpreter is initialised. There is no preloading of any modules which a Python web application may want to use. This is the case as Python works the opposite way to PHP in that it does as little as possible up front, only importing specific modules when actually used by an application.

The next difference with Python web applications is that once application code is loaded it remains loaded for the life of the process. That is, unlike PHP which throws away the application between requests, everything persists between requests in Python web applications. If a Python web application spans a large set of URLs, the application code may not even all get loaded upon the initial request. Instead it may only get progressively loaded as different URLs are accessed.

The important thing to realise here is that all this loading of Python application code is occuring in the child server processes which handle the requests and not the Apache parent process. Except where Python modules are implemented as C extension modules, all the code that is loaded is going to use up private memory of the process. It is not unheard of for even small to moderate sized Python web applications to consume 30MB or more of private memory in each child server process.

It is this significant amount of memory per process which is where problems start to occur. If you remember, the default settings for the prefork MPM were such that up to 150 child server processes could be created. This means that for such a small to moderate sized Python web application, if Apache decided to create up to the maximum number of child server processes, you would need in excess of 4GB of memory.

If you are running a small VPS system with an allocation of only 256MB, you can see that it just isn't going to work very well. You might just squeak by with having all the initial 5 child server processes having loaded up your application, but as soon as you get a sudden increase in requests and Apache decides that it needs to create more child server processes, your system will quickly run out of memory.

So, although the default settings for the prefork MPM may be reasonable for handling of static file requests or PHP, they are going to be completely inapproriate for any sizable Python web application, especially if running a system with only limited memory.

This then addresses one of the main complaints one often sees made against mod_python. That is that it consumes huge amounts of memory. In reality it isn't mod_python at all here which is the problem.

First off the memory is being consumed by the Python web application and not mod_python. If you ran the same Python web application in a standalone process on top of a Python web server, that single instance of the application would still use about the same amount of memory.

The real problem here is that so many instances of the application have been allowed to be created by not changing the default settings for the prefork MPM. Specifically, the number defined for the maximum clients should be dropped comensurate with the amount of memory available to run it and how big the application gets.

A very crude measure for determining the maximum number of clients, and therefore how many child server processes will be created, is to divide the maximum amount of memory you want to allow the web server as a whole to use, by the amount of memory a single instance of the Python web application consumes.

Do note though that this is a very crude measure, things are in practice a bit more complicated than that. One thing that complicates the issue is whether keep alive is enabled for connections and what the keep alive timeout is set to.

Whether keep alive is enabled or not isn't going to change what the maximum number of clients should be set to, but it does in practice limit how many concurrent requests you will be able to effectively handle. This is because the Apache request handler threads will be busy waiting to see if a subsequent request is going to arrive over the same connection. Eventually the request handler thread will timeout, but during that time it will not be able to handle completely new requests.

If keep alive is a problem, one course often taken which can help out is to offload serving of static media files to a separate web server. Keep alive can then be turned off for the Apache instance running the Python web application where it generally isn't as beneficial as for static file requests. Web servers such as nginx and lighttpd are arguably better at serving static files anyway, and so you will actually get better performance when serving them that way. Offloading the static files also allows you to configure Apache properly for the specific Python web application being hosted, rather than having conflicting requirements.

As to the load spikes which can occur, what this comes down to is the startup costs of loading the Python web application being run. Here the problem is that Apache will create additional child server processes to meet demand. Because Python web applications these days generally have a lot of dependencies and need to load a lot of code they will not start up quickly. That startup is costly actually serves to multiply the severity of the problem, because although the additional processes are starting up, if they take too long, Apache will decide that it still doesn't have enough processes and will start creating more. In the worst case this can snowball until you have completely swamped your machine.

The solution here is not to create only a minimal number of servers when Apache starts, but create closer to what would be the maximum number of processes you would expect to require to handle the load. That way the processes always exist and are ready to handle requests and you will not end up in a situation where Apache needs to suddenly create a huge number of processes.

The catch here to watch out for is that the startup cost of the Python web application is simply transferred to when Apache is being started in the first place. If you find that even when a larger number of processes are created at startup, the initial burst of traffic and the subsequent loading of the actual Python web application strains the resources of your system, then you need to seriously look at whether you are creating many more processes than you need anyway.

First off, don't run PHP on the same web server. That way you can run worker MPM instead of prefork MPM. This immediately means you drop down drastically the number of processes you require because each process will then be multithreaded rather than single threaded and can handle many concurrent requests. To see how this works one can look at the default MPM settings for the worker MPM.

# worker MPM
# StartServers: initial number of server processes to start
# MaxClients: maximum number of simultaneous client connections
# MinSpareThreads: minimum number of worker threads which are kept spare
# MaxSpareThreads: maximum number of worker threads which are kept spare
# ThreadsPerChild: constant number of worker threads in each server process
# MaxRequestsPerChild: maximum number of requests a server process serves
<IfModule mpm_worker_module>
StartServers          2
MaxClients          150
MinSpareThreads      25
MaxSpareThreads      75
ThreadsPerChild      25
MaxRequestsPerChild   0
</IfModule>

The important thing to note here is that although the maximum number of clients is still 150, each process has 25 threads. Thus, the maximum number of processes that could be created is 6. For that 30MB process that means you only need 180MB in the worst case scenario rather than the 4GB required with the default MPM settings for prefork.

Keep that in mind and one has to question how wise the advice in the Django documentation is that states "you should use Apache’s prefork MPM, as opposed to the worker MPM" when using mod_python.

All well and good if you run your own computer with huge amounts of memory and little traffic, but a potential recipe for disaster if you don't know that you should be changing the default MPM settings and you are using a memory constrained VPS and your site becomes popular or becomes subject to the Slashdot effect.

With Django 1.0 now believed to be multithread safe, which was in part why prefork was recommended previously, that advice should perhaps be revisited, or it made obvious that one would need to consider tuning your Apache MPM settings if you intend using prefork MPM.

Now, it needs to be stated that all of the above about mod_python equally applies to embedded mode of mod_wsgi. Thus, using mod_wsgi isn't necessarily some magic pill which will solve all your problems overnight.

Most people who change to using mod_wsgi don't actually have a problem though, but that that is the case is usually more by accident rather than design. This is because they see the additional benefits they get from using daemon mode of mod_wsgi and choose to use it over embedded mode. By this simple decision they have escaped the main issue with embedded mode, which is that Apache can lazily create processes and that for prefork MPM the maximum number of processes is excessive.

Some have realised that mod_wsgi daemon mode seems to offer a more predictable memory usage profile and performance curve and as a result fervently recommend it, but at the same time they still don't seem to understand what the problems with embedded mode, as outlined above actually were. So, hopefully the explanation above will help in clearing up why, not just in the case of mod_wsgi daemon mode vs mod_wsgi embedded mode, but also for the much maligned mod_python.

So, what should you be using? The simple answer is that if you don't understand how to configure Apache and see it as some huge beast then you should certainly be tossing out mod_python. Instead, you would be much better off using mod_wsgi daemon mode.

Should one ever use embedded mode? Technically running embedded mode with prefork MPM should offer the best performance, especially for machines with many cpus/cores. If however you don't have huge amounts of memory, don't dedicate the system to just the dynamic Python web application and you don't change the default MPM settings for Apache, then you are potentially setting yourself up for disaster.

In practice one also has to realise that the underlying web server is never usually going to be the bottleneck. Instead the bottleneck will be your Python web application and the database it uses. So, just because mod_wsgi embedded mode and prefork MPM may be the fastest solution out there for Apache and saves you a few milliseconds per request, that gain is going to be completely swallowed up by the overheads elsewhere in the system and not end up giving you any significant advantage.

You see a lot of people though still obsessing about the underlying raw performance of the web server. Frankly, you are just wasting your time. You will get greater benefits from concentrating on the performance of your application and using techniques such as caching and database query optimisation and indexing to make things run faster.

The final answer? Stop using mod_python, use mod_wsgi and run it with daemon mode instead. You will save yourself a lot of headaches by doing so.

A Python interpreter is not created for each request in mod_python.

There are various myths about mod_python floating around the net and it gets a bit annoying when one keeps seeing them posted again and again, especially when used to support some conjecture that some other system is so much better as a result. It gets more annoying though when they are posted on sites whose commenting feature doesn't work, or you have to jump through hoops to register for the site. As such, often one can't even correct the misconceptions and so the misinformation just keeps propagating and you can never kill it.

The latest instance of this is at www.pypi.info, and it isn't the first time I have seen this person making incorrect statements about mod_python and how it works.

So, let me set the record straight on one such myth about mod_python. That myth is that mod_python creates a separate Python interpreter instance for each request. This is just not the case.

What actually happens is that the first time a request arrives for an application which hasn't yet been loaded, and which hasn't been marked to run in the main Python interpreter, is that a new sub interpreter is created. That sub interpreter is then persistent within that process though, not being destroyed until the process itself is destroyed. Any subsequent requests for that application are then handled within the context of that same sub interpreter which has already been created, so no new sub interpreter is created. Even when multithreading is used, all the request handler threads execute within the context of that one sub interpreter. So, neither is a separate sub interpreter created for each distinct request handler thread in the Apache thread pool, as is also sometimes claimed.

As to why that poster had the problems he had with mod_python, they are more likely because of which Apache MPM was selected when Apache was compiled, as well as not changing the default MPM settings used. Especially with prefork MPM, the default settings are more appropriate for static file serving and PHP. If you do not change the default MPM settings as well as tweak HTTP keep alive settings to make them more appropriate for a large persistent Python web application, you will see load spikes and memory issues, especially when a system is put under load.

From what I have seen, the majority of people setting up mod_python don't understand the need to change the default Apache configuration. It may well have been the case that back in time when mod_python was somewhat newer that you didn't have to, but this is because back then Python web applications were typically much smaller and so didn't have significant start up times or use large amounts of memory. These days though they do, so you need to change how Apache is configured to have it perform adequately.

Now, can we please stop trying to do performance and memory usage comparisons if you don't know how to setup the systems being compared properly. :-(

BTW, everything above also applies to mod_wsgi when using embedded mode, as it is implemented in a similar manner to mod_python. Because I'd rather not see these misconceptions about mod_python transfers onto mod_wsgi, I will blog later more about the real source of these load spikes and memory issues. In the mean time, if using mod_wsgi use daemon mode and you will not have to think about it, as these issues in the main only relate to embedded mode.

Monday, July 2, 2007

Web hosting landscape and mod_wsgi.

At the end of last year I described on the mod_python mailing list various ideas I had for how one could improve the situation with Python web hosting. These ideas were detailed in:

A subsequent discussion at the first SyPy meetup in January gave me the drive to follow up on the ideas and since then I have been furiously hacking away, with the result being the mod_wsgi module I spoke of in those posts.

As I described in those posts I saw mod_wsgi as only being a first step. Before considering again what one might do beyond mod_wsgi though, it is worthwhile to look at what mod_wsgi has become and how the result fits into the web hosting landscape. In particular, does it actually have the potential to improve the lot of Python developers by providing a compelling solution which will be attractive to companies providing commodity web hosting.

To understand this, one needs to look at what features mod_wsgi provides and specifically the two different modes of operation that have been implemented.

The first mode of operation I tend to refer to as 'embedded' mode. This is where your Python web application runs in the context of the standard Apache child processes. At least in terms of how Python sub interpreters are used, this is the same as how things work with mod_python. Thus, if you have both mod_python and mod_wsgi loaded, applications running under each will share the same process, although they generally would at least run in distinct Python sub interpreters. As far as sharing goes, the process may also be host to PHP or mod_perl applications as well.

Running applications in the Apache child processes would generally always result in the best performance possible when compared to other alternatives available for using Python with Apache such as mod_fastcgi and mod_scgi or even a second web server behind mod_proxy. This is because the Python application is running in the same process that is accepting and performing the initial parsing of the request from a client. In other words, overhead is as low as it can be as everything is done together in the one process.

In addition to the low overhead, there are also other positive benefits deriving from how Apache works when using this mode. The first is that Apache uses multiple child processes to handle requests. As a result, any contention for the Python GIL within the context of a single process is not an issue, as each process will be independent. Thus there is no impediment when using multi processor systems.

That said, the GIL is not as big a deal as some people make out, even when using Apache with only one multi-threaded child process for accepting requests. This is because the code which handles accepting of requests, determines which Apache handler should process the request, along with the code for reading the request content and writing out the response content, is all written in C and is in no way linked to Python. As a consequence there are large sections of code where the GIL is not being held. On top of that, the same web server may also be serving up static files where again the GIL doesn't even come into the picture. So, more than enough opportunity for making good use of those multiple processors.

The second major benefit comes from Apache's ability to scale up to meet increases in load. The way this works is that Apache will only initially create a certain number of child processes to handle requests. If however the number of requests builds up to the point that the processes wouldn't be able to keep pace, Apache will create additional child processes to meet the demand. It will keep doing this as needs be, although eventually it will stop based on whatever the maximum number of child process is set to, so as not to totally overload your system.

When the number of requests finally starts to drop down once more, to recover resources Apache will start to kill off any child processes now deemed as unnecessary, eventually getting back to the starting level. So it is that Apache is able to comfortably deal with the ebb and flow of demand without unduly choking.

So there is a lot of good to be had from how Apache works when using mod_wsgi in this mode. At the same time however a number of issues also arise.

The first is that the child processes generally run as a special non privileged user. This means that this user needs to be given access to the files which make up an application or which the application in turn needs to read. This user will also need to be given special access to files or directories the application needs to write any data to. Because Apache may be used to host a number of different applications, it means however that all applications can read files making up any other application and make changes to any writable directories or files used by those other applications which are writable to the user.

The second problem is that although in mod_wsgi distinct Python sub interpreters are used to keep different applications separate, this isn't fool proof. Problems can arise where different applications attempt to use different versions of a particular C extension module, as Python only loads C extensions once for the whole process and not separately for each sub interpreter. Thus, which application gets to load their version first wins out and when subsequent applications load it, they will get the correct version of any Python wrappers, but that code may not match the API provided by the C extension module itself.

A third more serious problem however, is that since Python supports C extension modules, it would be possible for someone with nefarious intent to load a module which gives them access to other sub interpreters data and code thereby bypassing the firewalls put in place by mod_wsgi. Such a module would thus allow them to spy into another application, change how it works or steal private information. A very wily hacker may take this even further and poke into the internals of Apache, possibly inserting special handler code into various phases of the request processing cycle, or modifying configuration data used by other modules.

All up, what this means is that although mod_wsgi goes to great lengths to try and ensure that applications can't interfere with each other, it can't be made completely bullet proof. As a result, 'embedded' mode of mod_wsgi would only be suitable in situations where the owners of the web servers are also the owners of the applications running under it. At no time would it ever be recommended that 'embedded' mode would be suitable as a basis for running applications owned by different users in a web hosting environment.

Do note that these problems aren't the fault of mod_wsgi specifically. Some derive from the way Apache works and others from how Python works. Using mod_python as an alternative will not offer anything better. In fact mod_python actually has more problems due to the open nature of how it hooks into Apache, thus making it easier to modify the behaviour of Apache and potentially access into other applications or steal private information.

Originally the intent in writing mod_wsgi was to only target users who also controlled the web server they were using. As a consequence, these issues weren't specifically seen as being a problem that needed to be countered. During the development of mod_wsgi however, that the existence of mod_wsgi seemed to be raising the hopes of many that a suitable simple solution for commodity Python web hosting might not be far away, meant that it was necessary to look at how one could address the problems. The end result of this was the addition of 'daemon' mode to mod_wsgi.

The main difference between 'daemon' and 'embedded' mode is that in 'daemon' mode the actual application code is not run within the context of the Apache child processes, but within separate daemon processes able to be run as a distinct user. Although there is a performance penalty resulting from having to proxy the request through to the distinct daemon process which is to handle the request, because the application is now isolated into a separate process the problems described above for 'embedded' mode are eliminated.

In the first instance, because the daemon process runs as a distinct user, only that user and not the user that the Apache child processes run as will need access to the Python code files that make up the application. The same applies to writable directories or files with them only needing to be modifiable by the user that the daemon process runs as. Thus, any actual Python code or private data pertaining to the application is protected and safe from access by other users of the system.

The only files which would still need to be readable to the user that the Apache child process runs as are any static files such as HTML pages, graphics or media files. This is because the main Apache child process would still provide the service of serving up these files.

The problem with C extension modules being global to a process is also eliminated with 'daemon' mode by the fact that multiple daemon processes can be created and each application assigned to their own process. This ability to isolate an application from others by assigning them to different processes, also prevents hackers from interfering with another users running application.

As a consequence, although 'embedded' mode would not be suitable for a server environment where applications owned by different users need to be hosted together, 'daemon' mode has the necessary protections available to make it safe to use in such a hostile environment and thus it would be suitable for shared web hosting environments.

When one looks at mod_wsgi a whole, the result is a package which is suitable both for building both high performance web sites and for commodity web hosting. In both cases configuration is simple, with the one application script file being suitable for use in both modes. A complex Python web application may even make use of both modes at the same time. For example, application components requiring better performance could be run in 'embedded' mode, but with other application components requiring special access privileges, which are memory hungry or processor intensive, being delegated off to distinct daemon processes.

In the end, this combination of abilities makes mod_wsgi a somewhat more flexible platform than other available solutions for developing WSGI applications using Apache. At the same time, because everything is in a single package all managed through Apache, configuration is much simpler and there is no need to install or manage any distinct back end infrastructure.

So, although my original plans didn't envision incorporating a 'daemon' mode, the effort in adding it has been quite worthwhile, with the elusive goal of a way of providing commodity web hosting for Python applications now perhaps being achievable after all. :-)

Friday, March 30, 2007

Reloading of Python code into web applications.

One of the major complaints with Python web frameworks is the need to restart the application whenever changes are made to the code. To try and avoid restarts or to make it easier to manage, different Python web frameworks and hosting technologies employ a number of different techniques. These include reloading Python modules into the existing running application, using a supervisor process to monitor for code changes and restart the actual server process automatically, or simply providing a means for a normal user, as opposed to a super user, to completely restart the web server.

As far as reloading Python modules into the existing running application goes, the most developed example of this technique is the module importer present in mod_python 3.3. The module importer in mod_python is different to other more simplistic module importers as it tracks the parent/child relationships between imported modules. This information means the module importer can determine that it needs to reload a top level request handler module even though the module itself hasn't changed but where some other module it is dependent on has changed.

Whereas module importers normally keep modules in sys.modules and they must all be uniquely named, the mod_python module importer also avoids a whole host of problems by not holding the web applications modules in sys.modules but in a separate caching system whereby they are identified by the absolute path name of the modules code file. This means it is possible for the same name to be used for a code file in multiple directories without the need to artificially hold modules in a Python package to avoid name space collisions. When a module is reloaded it is also not reloaded on top of the existing module as is done for modules in sys.modules, therefore eliminating problems with module pollution when attributes are deleted from the code but still exist in the loaded module, as well as various multi-threading problems which can arise due to reloading new code and data on top of existing code and data that may be getting used at the time.

All these features do come with some cost in performance. More of an issue though is that the importer is bound to mod_python and thus is only of use in web applications which themselves bind closely to the mod_python API. As such, although mod_python has this quite sophisticated module importer, it is absolutely useless if you are running some WSGI based application on top of mod_python as by the very nature of what WSGI is, such an application will only use features that can be hosted on top of multiple hosting technologies so can't make use of it.

One could separate the module importer from mod_python and make it a standalone package, but even then you are unlikely to see it adopted by any of the existing Python web frameworks. This is because these Python web frameworks already have their own way of doing things, and even if the module importer may be a better solution, to use the module importer would more than likely break compatibility of older applications and require users to perform some restructuring of their code. Use of the module importer may also not be able to be made totally transparent and thus one would be forcing new concepts on the user that they have to deal with. Finally, as good as the mod_python module importer may be, it still isn't going to be suitable in all situations, and thus you will still end up with some subset of modules that cannot be safely reloaded into an existing running application.

So, all in all it is very unlikely that one will see any form of sophisticated module importer system for reloading modules on the fly that works properly and is used as some sort of standard across the various Python web frameworks. Instead one will continue to see half baked solutions which don't really work. This will not necessarily be from lack of trying on the efforts of the people implementing them, it is just that reloading code safely into Python is hard for the general case, if not impossible.

Given the above, the only real practical solution that will work with all Python web frameworks is to throw away the interpreter contents and start over whenever one needs to pick up any code changes. To date this has always meant killing off the whole process and restarting it. This brute force approach may be fine where you manage and control you own web hosting environment, but isn't really practical for all those who rely on shared web hosting implemented using Apache for their web services. This type of service is problematic because the company managing the web server is hardly likely to be amendable to a constant stream of user requests to restart the web server every time they make a change to their code.

Packages for Apache such as mod_fastcgi, mod_scgi and mod_proxy_ajp have tried to address this problem by moving the actual web application into a specialised back end process and merely using Apache as a proxy for requests. Again using proxying, one could even use another web server as the back end process.

By using a back end process in this way a number of problems can be solved. The first is that because the back end process can be dedicated to a specific user and run as that user, control for restarting it can then be handed to the user. The second problem that is solved is that any code is no longer executing as the user that Apache would run as. This eliminates problems with user code accessing parts of the file system they potentially shouldn't and with user code interfering with a different users code as they can be given their own dedicated file system space to write to.

Although these provide the control that a user needs, a solution that doesn't just use another Apache server instance as the back end process is going to be foreign to most web hosting companies and can as a result be be hard for them to setup. This is because not only does one need to build and install the required Apache module, there are multiple choices as to how to implement the back end parts of the system and it may not be obvious as to why one should be used over another. This isn't made any better by virtue of a lack of good solid documentation and less than adequate support for running and managing such systems. For a web hosting company that wants something that is quick and cheap to get working and requires little management they currently appear not to be particularly attractive solutions.

In terms of how else one can solve this problem, the only other alternative for Python is that instead of killing off the whole process which is hosting a web application, one could just destroy the particular interpreter instance within the running process. If one is to pursue this approach, what is required though is the ability to be able to create and control additional Python sub interpreters and be able to run the whole web application or parts of it, inside of a sub interpreter rather than the main Python interpreter. Having that, it would then be possible to kill off a particular sub interpreter thereby destroying that part of the web application and recreate it in a new sub interpreter using the new code base.

At present your standard Python runtime doesn't support such manipulation of Python sub interpreters. Using Python sub interpreters is not new though, with mod_python using sub interpreters to provide separation between web applications or parts of them. What mod_python doesn't allow though is for new sub interpreters to be able to be created and used from a web application itself in some way.

That there is currently no way of manipulating Python sub interpreters from within a Python application doesn't mean it can't be done though. All that is required is a C extension module that provides a means of creating the sub interpreters and then mediating a way of making a call from one sub interpreter to another. Although it sounds simple in practice, there are various gotchas in getting it to work correctly. It also potentially opens up a big can of worms due to issues that can arise with sharing objects between sub interpreters as well as safely managing the destruction of interpreters.

That is it for now though. In the next blog installment I'll go more into this idea of using disposable sub interpreters within Python web applications, explaining the various problems and also showing examples of some working code. Will also discuss how one could constrain the idea so as to make it a moderately safe technique to make use of in mod_wsgi and possibly other WSGI application stacks.