PostgreSql优化思路
背景
yanxuan-ianus网关使用kong进行架构,而kong的DB支持两种:postgresql及cassandra。
实际落地时,选取了postgresql。(选型原因:1,两个DB均无严选应用场景,DBA无相关经验,需开发自行运维;2,对严选网关应用场景而言,两者性能优劣差别不大)
演进阶段
基础功能
- 单点部署db,支持正常的db操作;(8C,16G)
集群分表
- 根据大集群进行表隔离;
主从复制
- 新增从节点,实时备份主节点数据,保证存储数据不丢失;
读写分离
- 通过增加多个从节点,并且将读操作分摊到从节点上,缓减主节点的读写压力;
应用优化
- 将原来的每小时过期,全量拉取,优化为永不失效;
配置优化
- 随着集群增加,调整max_connection配置;
- 当前最大连接数1500,MEM占70%,CPU达60%,已经达到系统瓶颈,此时部分节点会因为取不到连接而更新不够及时;
应用优化
- 将全量数据拉取,限定到子集群维度的事件更新和拉取,减少数据更新量以及并发节点数;
- 数据加载优化,基于分页拉取;
- 发布方式优化,不进行并行发布;
连接优化
- 需引入pgbouncer,使用连接池管理;
高可用支持
- 当前从节点仅仅做了数据备份,实际故障时,还无法进行自动切换。引入pgpool,对故障迁移等进行支持;
关键配置
- shared_buffers:使用默认值128MB
- work_mem:使用默认值4MB
- wal_buffers:使用默认值4MB
- max_connections:1500
- superuser_reserved_connections:使用默认值3
慢请求优化
EXPLAIN命令
执行explain (query text)或explain (buffers true, analyze true, verbose true) (query text)命令,查看SQL的执行计划(注意,前者不会实际执行SQL,后者会实际执行而且能得到详细的执行信息),对其中的Table Scan涉及的表,建立索引。
explain (analyze,verbose,costs,buffers,timing,summary) SELECT (extract(epoch from created_at)*1000)::bigint as created_at, "config", "id", "name", "cluster", "enabled", "consumer_id", "api_id", "group_id" FROM plugins;
explain (analyze,verbose,costs,buffers,timing,summary) SELECT "enabled", (extract(epoch from created_at)*1000)::bigint as created_at, "config", "id", "group_id", "name", "api_id", "consumer_id" FROM plugins;
调整pgsql的log statement输出
2019-11-21 16:24:16.715 CST [16911] LOG: statement: SELECT (extract(epoch from created_at)*1000)::bigint as created_at, "config", "id", "name", "cluster_name", "enabled", "consumer_id", "api_i
d", "group_id" FROM plugins
2019-11-21 16:24:43.546 CST [16911] LOG: duration: 26831.257 ms
2019-11-21 16:30:59.129 CST [17490] LOG: statement: SELECT "enabled", (extract(epoch from created_at)*1000)::bigint as created_at, "config", "id", "group_id", "name", "api_id", "consumer_
id" FROM plugins
2019-11-21 16:31:00.340 CST [17490] LOG: duration: 1211.576 ms
调整pgsql log输出更具体信息
分析时间损耗点,在elapsed
2019-11-21 20:41:57.408 CST [6409] LOG: 00000: EXECUTOR STATISTICS
2019-11-21 20:41:57.408 CST [6409] DETAIL: ! system usage stats:
! 0.359945 s user, 0.140978 s system, 1.075173 s elapsed
! [0.361944 s user, 0.140978 s system total]
! 0/0 [0/0] filesystem blocks in/out
! 0/115 [0/1492] page faults/reclaims, 0 [0] swaps
! 0 [0] signals rcvd, 0/0 [0/0] messages rcvd/sent
! 539/573 [541/576] voluntary/involuntary context switches
2019-11-21 20:41:57.408 CST [6409] LOCATION: ShowUsage, postgres.c:4479
2019-11-21 20:41:57.408 CST [6409] STATEMENT: SELECT "enabled", (extract(epoch from created_at)*1000)::bigint as created_at, "config", "id", "group_id", "name", "api_id", "consumer_id" FR
OM plugins
2019-11-21 20:41:57.408 CST [6409] LOG: 00000: duration: 1077.131 ms
2019-11-21 20:44:00.733 CST [6549] LOG: 00000: EXECUTOR STATISTICS
2019-11-21 20:44:00.733 CST [6549] DETAIL: ! system usage stats:
! 0.410938 s user, 0.088987 s system, 22.086449 s elapsed
! [0.412937 s user, 0.089986 s system total]
! 0/0 [0/0] filesystem blocks in/out
! 0/23779 [0/25139] page faults/reclaims, 0 [0] swaps
! 0 [0] signals rcvd, 0/0 [0/0] messages rcvd/sent
! 1561/181 [1563/181] voluntary/involuntary context switches
2019-11-21 20:44:00.733 CST [6549] LOCATION: ShowUsage, postgres.c:4479
2019-11-21 20:44:00.733 CST [6549] STATEMENT: SELECT (extract(epoch from created_at)*1000)::bigint as created_at, "config", "id", "name", "cluster", "enabled", "consumer_id", "api_id", "g
roup_id" FROM plugins
2019-11-21 20:44:00.733 CST [6549] LOG: 00000: duration: 22088.423 ms
排查具体elapsed耗时
请求的处理,实际卡在ClientWrite事件,也就是数据回写操作上
SELECT * FROM pg_stat_activity WHERE wait_event IS NOT NULL AND backend_type = 'client backend';
datid | datname | pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | xact_start | query_start
| state_change | wait_event_type | wait_event | state | backend_xid | backend_xmin |
query | backend_type
-------+-------------+------+----------+----------+------------------+----------------+-----------------+-------------+-------------------------------+-------------------------------+---------------------
----------+-------------------------------+-----------------+------------+--------+-------------+--------------+--------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------+----------------
16385 | ianus | 3252 | 16384 | ianus | | 10.200.178.45 | | 47928 |
2019-11-22 09:24:57.558639+08 | 2019-11-22 09:24:57.562216+08 | 2019-11-22 09:24:57.562216+08 | 2019-11-22 09:24:57.562223+08
| Client | ClientWrite | active | | 7968336 |
SELECT "cluster_name", "config", "id", "name", "enabled", "consumer_id", (extract(epoch from created_at)*1000)::bigint as created_at, "api_id", "group_id" FROM plugins | client backend
pg_stat_statements插件
定位慢SQL
pg_stat_activity
查看当前长时间执行,一直不结束的SQL
问题汇总
- 连接数耗尽
FATAL: remaining connection slots are reserved for non-replication superuser connections
注意事项
- 连接数约多,占用的内存越大,单个连接占用内存:work_mem(该大小会影响查询效率),因此总内存占用为:connections * work_mem
- 索引名称在单个数据库中是唯一的!
- 索引名称不能与统一模式中的其他索引、表、视图、序列、复合类型等重名;
- 统一模式中的两个表不能具有相同名称的索引。
DO $$
BEGIN
IF to_regclass('myschema.my_name') IS NULL THEN
CREATE INDEX my_name ON myschema.mytable (mycolumn);
END IF;
END$$;
参考
https://www.cybertec-postgresql.com/en/tuning-max_connections-in-postgresql/ https://wiki.postgresql.org/wiki/Replication,_Clustering,_and_Connection_Pooling https://girders.org/2012/09/scaling-postgresql-with-pgpool-and-pgbouncer.html https://blog.pythian.com/comparing-pgpool-ii-and-pgbouncer/