apache2 wsgi socket rotaion (by quqi99)

作者:张华 发表于:2023-05-22

版权声明:可以任意转载,转载时请务必以超链接形式标明文章原始出处和作者信息及本版权声明

问题

for lp bug https://bugs.launchpad.net/ubuntu/+source/mod-wsgi/+bug/1863232

reproducer

lxc launch ubuntu:focal focal
lxc exec focal bash
su - ubuntu
sudo apt update

sudo apt-get install apache2 libapache2-mod-wsgi -y
sudo sed -i 's/^KeepAlive Off/KeepAlive On/g' /etc/apache2/apache2.conf
sudo sed -i '/^KeepAliveTimeout/ s/ .*/ 15/' /etc/apache2/apache2.conf
cat <<EOF | sudo tee /var/www/html/hello-world.py
def application(environ, start_response):
    status = '200 OK'
    output = b'Hello World!\n'
    response_headers = [('Content-type', 'text/plain'),
                        ('Content-Length', str(len(output)))]
    start_response(status, response_headers)
    return [output]
EOF
cat <<EOF | sudo tee /etc/apache2/conf-available/wsgi.conf
WSGIScriptAlias /hello-world /var/www/html/hello-world.py
WSGIDaemonProcess 127.0.0.1 processes=2 threads=16 display-name=%{GROUP}
WSGIProcessGroup 127.0.0.1
EOF
sudo a2enconf wsgi
curl 127.0.0.1/hello-world

用mpm_event + WSGISocketRotation=on不会有问题

题外话,Apache服务器一共有三种稳定的MPM(Multi-Processing Module,多进程处理模块)模式:

  • prefork 中没有线程的概念,是多进程模型,一个进程处理一个连接;稳定;响应快。其缺点是在连接数比较大时就非常消耗内存
  • worker 是多进程多线程模型,一个进程有多个线程,每个线程处理一个连接。与prefork相比,worker模式更节省系统的内存资源。不过,需要注意worker模式下的Apache与php等程序模块的兼容性
  • event 是worker模式的变种,它把服务进程从连接中分离出来,在开启KeepAlive场合下相对worker模式能够承受的了更高的并发负载。event模式不能很好的支持https的访问(HTTP认证相关的问题)

用默认的mpm_event (sudo a2dismod mpm_worker && sudo a2enmod mpm_event && sudo systemctl restart apache2), 即使WSGISocketRotation=On也不会有问题。

sudo a2dismod mpm_worker
sudo a2enmod mpm_event
sudo systemctl restart apache2

$ ls -1 /var/run/apache2/wsgi.*.sock
/var/run/apache2/wsgi.4570.0.1.sock
$ sudo systemctl reload apache2.service 
$ ls -1 /var/run/apache2/wsgi.*.sock
/var/run/apache2/wsgi.4570.1.1.sock

#send two HTTP requests in the same connection (keep-alive used)
cat <<EOF >http-request
GET /hello-world HTTP/1.1
Host: 127.0.0.1
Connection: keep-alive

EOF

$ (cat http-request; sleep 1; cat http-request; sleep 9) | telnet 127.0.0.1 80 2>&1 | while read line; do echo "$(date +'%T') == $line"; done
07:12:03 == Trying 127.0.0.1...
07:12:03 == Connected to 127.0.0.1.
07:12:03 == Escape character is '^]'.
07:12:03 == HTTP/1.1 200 OK
07:12:03 == Date: Mon, 22 May 2023 07:12:03 GMT
07:12:03 == Server: Apache/2.4.41 (Ubuntu)
07:12:03 == Content-Length: 13
07:12:03 == Vary: Accept-Encoding
07:12:03 == Keep-Alive: timeout=15, max=100
07:12:03 == Connection: Keep-Alive
07:12:03 == Content-Type: text/plain
07:12:03 == 
07:12:03 == Hello World!
07:12:04 == HTTP/1.1 200 OK
07:12:04 == Date: Mon, 22 May 2023 07:12:04 GMT
07:12:04 == Server: Apache/2.4.41 (Ubuntu)
07:12:04 == Content-Length: 13
07:12:04 == Vary: Accept-Encoding
07:12:04 == Keep-Alive: timeout=15, max=99
07:12:04 == Connection: Keep-Alive
07:12:04 == Content-Type: text/plain
07:12:04 == 
07:12:04 == Hello World!
07:12:13 == Connection closed by foreign host.

用mpm_worker + WSGISocketRotation=on在bionic时会有问题

用mpm_worker + WSGISocketRotation=on时在bionic时会有问题,focal不会有问题

sudo a2dismod mpm_event
sudo a2enmod mpm_worker
apache2ctl -M |grep mpm
sudo systemctl restart apache2
(cat http-request; sleep 1; cat http-request; sleep 9) | telnet 127.0.0.1 80 2>&1 | while read line; do echo "$(date +'%T') == $line"; done

$ rmadison mod-wsgi |grep -E 'bionic|focal'
 mod-wsgi | 4.5.17-1               | bionic          | source
 mod-wsgi | 4.5.17-1ubuntu1.1      | bionic-security | source
 mod-wsgi | 4.5.17-1ubuntu1.1      | bionic-updates  | source
 mod-wsgi | 4.6.8-1ubuntu3         | focal           | source
 mod-wsgi | 4.6.8-1ubuntu3.1       | focal-security  | source
 mod-wsgi | 4.6.8-1ubuntu3.1       | focal-updates   | source
 
 hua@t440p:/bak/work/apache2/mod_wsgi$ git tag --contains 13169f2a0610d7451fae92a414e8e20b91e348c9 |head -n2
4.6.0
4.6.1

使用WSGISocketRotation=off时也不会有问题

echo 'WSGISocketRotation off' |sudo tee -a /etc/apache2/conf-available/wsgi.conf
$ sudo systemctl reload apache2.service 
$ ls -1 /var/run/apache2/wsgi.*.sock
/var/run/apache2/wsgi.5187.u33.1.sock
$ sudo systemctl reload apache2.service 
$ ls -1 /var/run/apache2/wsgi.*.sock
/var/run/apache2/wsgi.5187.u33.1.sock

(cat http-request; sleep 1; cat http-request; sleep 9) | telnet 127.0.0.1 80 2>&1 | while read line; do echo "$(date +'%T') == $line"; done

直接测8754端口

为什么客户用mpm_worker时还出现这个问题呢?考虑到KeepAliveTimeout=5, 所以直接在客户环境的nova-cloud-controller/0中测试8754端口(nova-api)

cat <<EOF >http-request
GET / HTTP/1.1
Host: 127.0.0.1
Connection: keep-alive

EOF
(cat http-request; sleep 1) | telnet 127.0.0.1 8754
(cat http-request; sleep 6; cat http-request; sleep 9) | telnet 127.0.0.1 8754 2>&1 | while read line; do echo "$(date +'%T') == $line"; done
for i in {1..100}; do echo $i; date; (cat http-request; sleep 1; cat http-request; sleep 9) | telnet 127.0.0.1 8754 2>&1 | while read line; do echo "$(date +'%T') == $line"; done >>output.txt; done

20231219 - 更新 - keystone performance

客户有一个keystone + k8s环境,通过cluster-api-provider-openstack来访问openstack, 也通过gophercloud SDK来调用openstack api, 但遇到了keystone的性能瓶颈。客户看到下列错误报了一个bug (下面的53.xxx.193.59是一个keystone unit):

(keystone.server.flask.application): 2023-04-06 20:57:24,736 WARNING Could not recognize Fernet token
(keystone.server.flask.application): 2023-04-06 21:00:18,425 WARNING Authorization failed. The request you have made requires authentication. from 53.xxx.193.59

2023-04-21 01:10:37.082 133954 WARNING keystonemiddleware.auth_token [req-bb74db36-92e6-4ef7-8b21-54ef67c1786a 9d8880924c9e437eb2f5cf82dbc5ee53 3a6b67aac35a4756ad7a9a1d7de1d3f3 - 06a56cbc3f1944dfbdded97d4880d5a3 06a56cbc3f1944dfbdded97d4880d5a3] Identity response: <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>503 Service Unavailable</title>
</head><body>
<h1>Service Unavailable</h1>
<p>The server is temporarily unable to service your
request due to maintenance downtime or capacity
problems. Please try again later.</p>
<hr>
<address>Apache/2.4.41 (Ubuntu) Server at keystone.xxx.net Port 35357</address>
</body></html>
: keystoneauth1.exceptions.http.ServiceUnavailable: Service Unavailable (HTTP 503)

1, 打日志的方法。 apache默认情况下,是只会记录请求的头部信息,如方法、路由、GET参数和UA等,保存在/var/log/apache2/access.log。我们需要开启apache的post数据记录功能:

1 - Enable the dump_io module:
sudo a2enmod dump_io

2, add following lines to each of virual hosts in the /etc/apache2/sites-enabled/openstack_https_frontend.conf and /etc/apache2/sites-enabled/wsgi-openstack-api.conf:
CustomLog /var/log/apache2/keystone_access.log "internal-endpoint %h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\" %i X-Auth-Token: %{X-Auth-Token}i X-Subject-Token: %{X-Subject-Token}i"
ErrorLog /var/log/apache2/keystone_error.log
DumpIOInput On
DumpIOOutput On
LogLevel dumpio:trace7

3 - Restart apache

2, curl测试的方法, 直接用curl也有问题,所以应该不是provider 的bug ( https://github.com/kubernetes-sigs/cluster-api-provider-openstack/issues/1647)

curl -g -i --cacert "/home/ubuntu/deploy/certs/combined_CA.crt" -X GET https://nova.xxx.net:8774/v2.1/os-simple-tenant-usage/1d507296db5e4f749679377d7f4d256b?start=2023-10-01T00:00:00&end=2023-11-01T00:00:00 -H "Accept: application/json" -H "User-Agent: python-novaclient" -H "X-Auth-Token: {SHA256}00204d308bcac25eaaf98fa9d03b6e3385218b09485964a7045f8afa8f7ccdb6" -H "X-OpenStack-Nova-API-Version: 2.1"

for i in {1..100}; do date; echo $i; curl -H "User-Agent: curl-igortest5-$i" https://gnocchi.ixxx:8041; sleep 2; echo; done

3, 通过gophercloud SDK模拟的方法:

#go away by disabling keepalive for proxy pooling, ie.
#SetEnv proxy-nokeepalive 1

#On each keystone server:
echo "SetEnv proxy-nokeepalive 1" | sudo tee /etc/apache2/conf-available/proxy-nokeepalive.conf
sudo ln -s ../conf-available/proxy-nokeepalive.conf /etc/apache2/conf-enabled/
sudo systemctl restart apache2
#sudo rm /etc/apache2/conf-enabled/proxy-nokeepalive.conf
#sudo systemctl restart apache2

package main

import (
"fmt"
"github.com/gophercloud/gophercloud/openstack"
"sync"
)

func get_token(wg *sync.WaitGroup) {
defer wg.Done()
opts, err := openstack.AuthOptionsFromEnv()
_, err = openstack.AuthenticatedClient(opts)
if err != nil {
fmt.Println(err)
} else {
fmt.Println("got token OK")
}
return
}

func main (){
var wg sync.WaitGroup
for i:=1; i<200; i++{
wg.Add(1)
go get_token(&wg)
}
wg.Wait()
}

4, 有可能是haproxy-*-timeout的问题吗?
client(gophercloud) -> haproxy -> apache2 -> keystone
https://bugs.launchpad.net/charm-nova-cloud-controller/+bug/1827397

5, 除了503(503 Service Unavailable), 也有看到下列502错误(502 Proxy Error),所以需尝试 proxy-nokeepalive

echo "SetEnv proxy-nokeepalive 1" | sudo tee /etc/apache2/conf-available/proxy-nokeepalive.conf
sudo ln -s ../conf-available/proxy-nokeepalive.conf /etc/apache2/conf-enabled/
sudo systemctl restart apache2

Fri Aug 18 05:55:39.262631 2023] [proxy:error] [pid 145071:tid 140225363826432] [client 53.129.13.61:50990] AH00898: Error reading from remote server returned by /v3/auth/tokens
[Fri Aug 18 05:55:39.994612 2023] [proxy_http:error] [pid 145071:tid 140225229608704] (20014)Internal error (specific information not available): [client 53.129.13.61:50762] AH01102: error reading status line from remote server localhost:4980

6, 其他可能的调优:

1. proxy-initial-not-pooled
juju run -a gnocchi -- '
echo SetEnv proxy-initial-not-pooled 1 | tee /etc/apache2/conf-available/gnocchi.conf
a2enconf gnocchi
systemctl reload apache2

2. proxy-nokeepalive
juju run -a gnocchi -- '
echo SetEnv proxy-nokeepalive 1 | tee /etc/apache2/conf-available/gnocchi.conf
a2enconf gnocchi
systemctl reload apache2
'

3. mpm_worker and MaxConnectionsPerChild
juju run -a gnocchi -- '
a2dismod mpm_event
a2enmod mpm_worker
echo MaxConnectionsPerChild 25 | tee /etc/apache2/conf-available/gnocchi.conf
a2enconf gnocchi
systemctl restart apache2
'

4, another thing we tested with the customer was adjusting
the WSGI worker values here in <IfModule mpm_event_module> for apache2's config:
---
ServerLimit 300
ThreadsPerChild 1000
MaxRequestWorkers 300000
---

7, 路径如下:client(gophercloud) -> haproxy -> apache2 -> keystone

vim etc/haproxy/haproxy.cfg
frontend tcp-in_public-port
bind *:5000
...
use_backend public-port_192.168.45.45 if net_192.168.45.45
backend public-port_192.168.45.45
balance leastconn
server keystone-1 192.168.45.45:4990 check
server keystone-0 192.168.45.13:4990 check
server keystone-2 192.168.45.53:4990 check

$ grep -E '192.168.45.45' sos_commands/networking/ip_-o_addr
51: eth0 inet 192.168.45.45/23 brd 192.168.45.255 scope global eth0\ valid_lft forever preferred_lft forever

vim etc/apache2/sites-available/openstack_https_frontend.conf
<VirtualHost 192.168.45.45:4990>
ServerName keystone-internal.xxx.net
ProxyPass / http://localhost:4980/
KeepAliveTimeout 75
MaxKeepAliveRequests 1000
...
</VirtualHost>

$ grep -r '*:4980' etc/apache2/sites-enabled/wsgi-openstack-api.conf -A9
<VirtualHost *:4980>
WSGIDaemonProcess keystone-public processes=12 threads=1 user=keystone group=keystone \
display-name=%{GROUP} lang=C.UTF-8 locale=C.UTF-8
WSGIProcessGroup keystone-public
WSGIScriptAlias /krb /usr/bin/keystone-wsgi-public
WSGIScriptAlias / /usr/bin/keystone-wsgi-public
WSGIApplicationGroup %{GLOBAL}
WSGIPassAuthorization On
KeepAliveTimeout 75
MaxKeepAliveRequests 1000

上面的4980中的"processes=12 threads=1"说明它用它thread模型并发只能处理1x12=12个(https://modwsgi.readthedocs.io/en/develop/configuration-directives/WSGIDaemonProcess.html) (而在/etc/apache2/sites-enabled/openstack_https_frontend.conf中的其他vhost若没有用thread模型将默认用 /etc/apache2/mods-enabled/mpm_event.conf 中的mpm_event模型. 下面脚本可以根据 'grep -c’来测量每个端口打开的数量,并且通过’ps -o thcount’来计算这些进程中打开的线程数。

#!/usr/bin/env bash

for i in {1..1000}; do
  echo ""
  date
  echo "apache2 active connections:"
  echo "tcp"
  echo -n "port 80:    "
  grep -c :0050 /proc/net/tcp
  echo -n "port 4980:"
  grep -c :1374 /proc/net/tcp
  echo -n "port 4990:  "
  c4990=$(grep -c :137E /proc/net/tcp)
  echo "${c4990}"
  if [ "${c4990}" -gt 1000  ]; then
    date >> /tmp/tcp.out
    cat /proc/net/tcp >> /tmp/tcp.out
  fi
  echo -n "port 5000:"
  grep -c :1388 /proc/net/tcp
  echo -n "port 35337:"
  grep -c :8A09 /proc/net/tcp
  echo -n "port 35347: "
  grep -c :8A13 /proc/net/tcp
  echo -n "port 35357:"
  grep -c :8A1D /proc/net/tcp
  echo "tcp6"
  echo -n "port 80:    "
  grep -c :0050 /proc/net/tcp6
  echo -n "port 4980:"
  grep -c :1374 /proc/net/tcp6
  echo -n "port 4990:  "
  c4990=$(grep -c :137E /proc/net/tcp6)
  echo "${c4990}"
  if [ "${c4990}" -gt 1000  ]; then
    date >> /tmp/tcp6.out
    cat /proc/net/tcp6 >> /tmp/tcp6.out
  fi
  echo -n "port 5000:"
  grep -c :1388 /proc/net/tcp6
  echo -n "port 35337:"
  grep -c :8A09 /proc/net/tcp6
  echo -n "port 35347: "
  grep -c :8A13 /proc/net/tcp6
  echo -n "port 35357:"
  grep -c :8A1D /proc/net/tcp6
  echo ""
  echo "apache2 processes and threads:"
  ps -o pid,comm,user,thcount -u www-data -u keystone
  echo "======================"
  sleep 5
done

同时在nova-cloud-controller /var/log/nova/nova-api-wsgi.log 中grep是否有“keystoneauth1.exceptions.http.ServiceUnavailable: Service Unavailable (HTTP 503)”。

juju run -a nova-cloud-controller -- 'sudo grep "keystoneauth1.exceptions.http.ServiceUnavailable: Service Unavailable (HTTP 503)" /var/log/nova/nova-api-wsgi.log | wc -l'

通过下列脚本可以找到连接到5000端口最多的IP, 并且根据这些数据是可以画图的。

# top 10 remote IP addresses connected to port 5000 (public keystone endpoint) sorted by connection count (all intervals)
grep -v -e Nov -e local_addres tcp.unit0.out | grep ":1388" | awk '{ print $3 }' | awk -F : '{ print $1 }' | sort | uniq -c | sort -nr | head

并且发现连接在keystone/0上的连接最多,通过上面的实验能发现这些错误在下列两种情况下能大幅降低:

  • 客户对他们的应用程序做一些优化减少对keystone的访问
  • 用了"SetEnv proxy-nokeepalive 1"会大幅增加连接数,当然也会大幅增加错误率, 所以继续建议下列优化:
1. update /etc/apache2/sites-enabled/wsgi-openstack-api.conf line 6 (for <VirtualHost *:35337> section) so 'processes=4' is changed to 'processes=8' (increase by 100%)
2. update /etc/apache2/sites-enabled/wsgi-openstack-api.conf line 34 ( for <VirtualHost *:4980> section) so the 'processes=12' is changed to 'processes=20' (increase by 66%)
3. sudo systemctl restart apache2

8, 但是否可以针对不同的keystone endpoint来做实验呢?, 如:5000(haproxy) 4990(apache proxy) 4980(apache wsgi). 注意:对于4980它不支持https所以需要将https换成http(export OS_AUTH_URL=http://10.5.0.143:4980/v3)

9, nova, cinder, keystone等也支持memcache以减小对keystone的压力

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

quqi99

你的鼓励就是我创造的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值