我到处都找不到明确的信息。如何使Ceph集群在删除osd后再次健康?我只是删除了4个osd中的一个。删除的方式与手册中的相同。
kubectl -n rook-ceph scale deployment rook-ceph-osd-2 --replicas=0
kubectl rook-ceph rook purge-osd 2 --force
2023-02-23 14:31:50.335428 W | cephcmd: loaded admin secret from env var ROOK_CEPH_SECRET instead of from file
2023-02-23 14:31:50.335546 I | rookcmd: starting Rook v1.10.11 with arguments 'rook ceph osd remove --osd-ids=2 --force-osd-removal=true'
2023-02-23 14:31:50.335558 I | rookcmd: flag values: --force-osd-removal=true, --help=false, --log-level=INFO, --operator-image=, --osd-ids=2, --preserve-pvc=false, --service-account=
2023-02-23 14:31:50.335563 I | op-mon: parsing mon endpoints: b=10.104.202.63:6789
2023-02-23 14:31:50.351772 I | cephclient: writing config file /var/lib/rook/rook-ceph/rook-ceph.config
2023-02-23 14:31:50.351969 I | cephclient: generated admin config in /var/lib/rook/rook-ceph
2023-02-23 14:31:51.371062 I | cephosd: validating status of osd.2
2023-02-23 14:31:51.371103 I | cephosd: osd.2 is marked 'DOWN'
2023-02-23 14:31:52.449943 I | cephosd: marking osd.2 out
2023-02-23 14:31:55.263635 I | cephosd: osd.2 is NOT ok to destroy but force removal is enabled so proceeding with removal
2023-02-23 14:31:55.280318 I | cephosd: removing the OSD deployment "rook-ceph-osd-2"
2023-02-23 14:31:55.280344 I | op-k8sutil: removing deployment rook-ceph-osd-2 if it exists
2023-02-23 14:31:55.293007 I | op-k8sutil: Removed deployment rook-ceph-osd-2
2023-02-23 14:31:55.303553 I | op-k8sutil: "rook-ceph-osd-2" still found. waiting...
2023-02-23 14:31:57.315200 I | op-k8sutil: confirmed rook-ceph-osd-2 does not exist
2023-02-23 14:31:57.315231 I | cephosd: did not find a pvc name to remove for osd "rook-ceph-osd-2"
2023-02-23 14:31:57.315237 I | cephosd: purging osd.2
2023-02-23 14:31:58.845262 I | cephosd: attempting to remove host '\x02' from crush map if not in use
2023-02-23 14:32:03.047937 I | cephosd: no ceph crash to silence
2023-02-23 14:32:03.047963 I | cephosd: completed removal of OSD 2
以下是删除前后群集的状态。
[root@rook-ceph-tools-6cd9f76d46-bl4tl /]# ceph status
cluster:
id: 75b45cd3-74ee-4de1-8e46-0f51bfd8a152
health: HEALTH_OK
services:
mon: 3 daemons, quorum a,b,c (age 43h)
mgr: a(active, since 42h), standbys: b
mds: 1/1 daemons up, 1 hot standby
osd: 4 osds: 4 up (since 43h), 4 in (since 43h)
rgw: 1 daemon active (1 hosts, 1 zones)
data:
volumes: 1/1 healthy
pools: 13 pools, 201 pgs
objects: 1.13k objects, 1.5 GiB
usage: 2.0 GiB used, 38 GiB / 40 GiB avail
pgs: 201 active+clean
io:
client: 1.3 KiB/s rd, 7.5 KiB/s wr, 2 op/s rd, 0 op/s wr
[root@rook-ceph-tools-6cd9f76d46-bl4tl /]# ceph status
cluster:
id: 75b45cd3-74ee-4de1-8e46-0f51bfd8a152
health: HEALTH_WARN
Degraded data redundancy: 355/2667 objects degraded (13.311%), 42 pgs degraded, 144 pgs undersized
services:
mon: 3 daemons, quorum a,b,c (age 43h)
mgr: a(active, since 42h), standbys: b
mds: 1/1 daemons up, 1 hot standby
osd: 3 osds: 3 up (since 28m), 3 in (since 17m); 25 remapped pgs
rgw: 1 daemon active (1 hosts, 1 zones)
data:
volumes: 1/1 healthy
pools: 13 pools, 201 pgs
objects: 1.13k objects, 1.5 GiB
usage: 1.7 GiB used, 28 GiB / 30 GiB avail
pgs: 355/2667 objects degraded (13.311%)
56/2667 objects misplaced (2.100%)
102 active+undersized
42 active+undersized+degraded
33 active+clean
24 active+clean+remapped
io:
client: 1.2 KiB/s rd, 2 op/s rd, 0 op/s wr
如果我做错了,那以后怎么做对呢?
谢谢
更新:
[root@rook-ceph-tools-6cd9f76d46-bl4tl /]# ceph health detail
HEALTH_WARN 1 MDSs report slow metadata IOs; Reduced data availability: 9 pgs inactive, 9 pgs down; Degraded data redundancy: 406/4078 objects degraded (9.956%), 50 pgs degraded, 150 pgs undersized; 1 daemons have recently crashed; 256 slow ops, oldest one blocked for 6555 sec, osd.1 has slow ops
[WRN] MDS_SLOW_METADATA_IO: 1 MDSs report slow metadata IOs
mds.ceph-filesystem-a(mds.0): 1 slow metadata IOs are blocked > 30 secs, oldest blocked for 6490 secs
[WRN] PG_AVAILABILITY: Reduced data availability: 9 pgs inactive, 9 pgs down
pg 13.5 is down, acting [0,1,NONE]
pg 13.7 is down, acting [1,0,NONE]
pg 13.b is down, acting [1,0,NONE]
pg 13.e is down, acting [0,NONE,1]
pg 13.15 is down, acting [0,NONE,1]
pg 13.16 is down, acting [0,1,NONE]
pg 13.18 is down, acting [0,NONE,1]
pg 13.19 is down, acting [1,0,NONE]
pg 13.1e is down, acting [1,0,NONE]
[WRN] PG_DEGRADED: Degraded data redundancy: 406/4078 objects degraded (9.956%), 50 pgs degraded, 150 pgs undersized
pg 2.8 is stuck undersized for 108m, current state active+undersized, last acting [0,1]
pg 2.9 is stuck undersized for 108m, current state active+undersized, last acting [0,1]
pg 2.a is stuck undersized for 108m, current state active+undersized, last acting [0,1]
pg 2.b is stuck undersized for 108m, current state active+undersized, last acting [1,0]
pg 2.c is stuck undersized for 108m, current state active+undersized, last acting [0,1]
pg 2.d is stuck undersized for 108m, current state active+undersized, last acting [1,0]
pg 2.e is stuck undersized for 108m, current state active+undersized, last acting [0,1]
pg 5.9 is stuck undersized for 108m, current state active+undersized, last acting [0,1]
pg 5.a is stuck undersized for 108m, current state active+undersized+degraded, last acting [0,1]
pg 5.b is stuck undersized for 108m, current state active+undersized+degraded, last acting [0,1]
pg 5.c is stuck undersized for 108m, current state active+undersized+degraded, last acting [1,0]
pg 5.d is stuck undersized for 108m, current state active+undersized+degraded, last acting [0,1]
pg 5.e is stuck undersized for 108m, current state active+undersized+degraded, last acting [0,1]
pg 5.f is stuck undersized for 108m, current state active+undersized+degraded, last acting [1,0]
pg 6.8 is stuck undersized for 108m, current state active+undersized, last acting [0,1]
pg 6.9 is stuck undersized for 108m, current state active+undersized, last acting [0,1]
pg 6.a is stuck undersized for 108m, current state active+undersized, last acting [0,1]
pg 6.c is stuck undersized for 108m, current state active+undersized, last acting [0,1]
pg 6.d is stuck undersized for 108m, current state active+undersized, last acting [0,1]
pg 6.e is stuck undersized for 108m, current state active+undersized, last acting [1,0]
pg 6.f is stuck undersized for 108m, current state active+undersized, last acting [0,1]
pg 8.0 is stuck undersized for 108m, current state active+undersized+degraded, last acting [0,1]
pg 8.1 is stuck undersized for 108m, current state active+undersized+degraded, last acting [1,0]
pg 8.2 is stuck undersized for 108m, current state active+undersized, last acting [1,0]
pg 8.3 is stuck undersized for 108m, current state active+undersized+degraded, last acting [0,1]
pg 8.4 is stuck undersized for 108m, current state active+undersized, last acting [0,1]
pg 8.6 is stuck undersized for 108m, current state active+undersized+degraded, last acting [1,0]
pg 8.7 is stuck undersized for 108m, current state active+undersized+degraded, last acting [1,0]
pg 9.0 is stuck undersized for 108m, current state active+undersized, last acting [1,0]
pg 9.1 is stuck undersized for 108m, current state active+undersized, last acting [0,1]
pg 9.2 is stuck undersized for 108m, current state active+undersized, last acting [0,1]
pg 9.5 is stuck undersized for 108m, current state active+undersized, last acting [1,0]
pg 9.6 is stuck undersized for 108m, current state active+undersized, last acting [0,1]
pg 9.7 is stuck undersized for 108m, current state active+undersized, last acting [0,1]
pg 11.0 is stuck undersized for 108m, current state active+undersized, last acting [1,0]
pg 11.2 is stuck undersized for 108m, current state active+undersized, last acting [1,0]
pg 11.3 is stuck undersized for 108m, current state active+undersized, last acting [0,1]
pg 11.4 is stuck undersized for 108m, current state active+undersized, last acting [0,1]
pg 11.5 is stuck undersized for 108m, current state active+undersized, last acting [0,1]
pg 11.7 is stuck undersized for 108m, current state active+undersized, last acting [0,1]
pg 12.0 is stuck undersized for 108m, current state active+undersized+degraded, last acting [0,1]
pg 12.2 is stuck undersized for 108m, current state active+undersized+degraded, last acting [0,1]
pg 12.3 is stuck undersized for 108m, current state active+undersized+degraded, last acting [0,1]
pg 12.4 is stuck undersized for 108m, current state active+undersized+remapped, last acting [1,0]
pg 12.5 is stuck undersized for 108m, current state active+undersized+degraded, last acting [0,1]
pg 12.6 is stuck undersized for 108m, current state active+undersized+remapped, last acting [1,0]
pg 12.7 is stuck undersized for 108m, current state active+undersized+degraded, last acting [0,1]
pg 13.1 is stuck undersized for 108m, current state active+undersized, last acting [1,NONE,0]
pg 13.2 is stuck undersized for 108m, current state active+undersized, last acting [0,NONE,1]
pg 13.3 is stuck undersized for 108m, current state active+undersized, last acting [1,0,NONE]
pg 13.4 is stuck undersized for 108m, current state active+undersized+remapped, last acting [0,1,NONE]
[WRN] RECENT_CRASH: 1 daemons have recently crashed
osd.3 crashed on host rook-ceph-osd-3-6f65b8c5b6-hvql8 at 2023-02-23T16:54:29.395306Z
[WRN] SLOW_OPS: 256 slow ops, oldest one blocked for 6555 sec, osd.1 has slow ops
[root@rook-ceph-tools-6cd9f76d46-bl4tl /]# ceph osd pool ls detail
pool 1 '.mgr' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 18 flags hashpspool stripe_width 0 pg_num_max 32 pg_num_min 1 application mgr
pool 2 'ceph-blockpool' replicated size 3 min_size 2 crush_rule 1 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 35 lfor 0/0/31 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd
pool 3 'ceph-objectstore.rgw.control' replicated size 3 min_size 2 crush_rule 2 object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change 181 lfor 0/181/179 flags hashpspool stripe_width 0 pg_num_min 8 application rook-ceph-rgw
pool 4 'ceph-objectstore.rgw.meta' replicated size 3 min_size 2 crush_rule 3 object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change 54 flags hashpspool stripe_width 0 pg_num_min 8 application rook-ceph-rgw
pool 5 'ceph-filesystem-metadata' replicated size 3 min_size 2 crush_rule 4 object_hash rjenkins pg_num 16 pgp_num 16 autoscale_mode on last_change 137 lfor 0/0/83 flags hashpspool stripe_width 0 pg_autoscale_bias 4 pg_num_min 16 recovery_priority 5 application cephfs
pool 6 'ceph-filesystem-data0' replicated size 3 min_size 2 crush_rule 5 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 92 lfor 0/0/83 flags hashpspool stripe_width 0 application cephfs
pool 7 'ceph-objectstore.rgw.log' replicated size 3 min_size 2 crush_rule 6 object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change 273 lfor 0/273/271 flags hashpspool stripe_width 0 pg_num_min 8 application rook-ceph-rgw
pool 8 'ceph-objectstore.rgw.buckets.index' replicated size 3 min_size 2 crush_rule 7 object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change 98 flags hashpspool stripe_width 0 pg_num_min 8 application rook-ceph-rgw
pool 9 'ceph-objectstore.rgw.buckets.non-ec' replicated size 3 min_size 2 crush_rule 8 object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change 113 flags hashpspool stripe_width 0 pg_num_min 8 application rook-ceph-rgw
pool 10 'qa' replicated size 2 min_size 1 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 310 lfor 0/0/137 flags hashpspool,selfmanaged_snaps max_bytes 42949672960 stripe_width 0 application rbd
pool 11 'ceph-objectstore.rgw.otp' replicated size 3 min_size 2 crush_rule 9 object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change 123 flags hashpspool stripe_width 0 pg_num_min 8 application rook-ceph-rgw
pool 12 '.rgw.root' replicated size 3 min_size 2 crush_rule 10 object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change 308 lfor 0/308/306 flags hashpspool stripe_width 0 pg_num_min 8 application rook-ceph-rgw
pool 13 'ceph-objectstore.rgw.buckets.data' erasure profile ceph-objectstore.rgw.buckets.data_ecprofile size 3 min_size 2 crush_rule 11 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 200 lfor 0/0/194 flags hashpspool,ec_overwrites stripe_width 8192 application rook-ceph-rgw
[root@rook-ceph-tools-6cd9f76d46-f4vsj /]# ceph osd pool ls detail
pool 1 '.mgr' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 17 flags hashpspool stripe_width 0 pg_num_max 32 pg_num_min 1 application mgr
pool 2 'ceph-blockpool' replicated size 3 min_size 2 crush_rule 1 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 39 lfor 0/0/35 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd
pool 3 'ceph-objectstore.rgw.control' replicated size 3 min_size 2 crush_rule 2 object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change 194 lfor 0/194/192 flags hashpspool stripe_width 0 pg_num_min 8 application rook-ceph-rgw
pool 4 'ceph-objectstore.rgw.meta' replicated size 3 min_size 2 crush_rule 3 object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change 250 lfor 0/250/248 flags hashpspool stripe_width 0 pg_num_min 8 application rook-ceph-rgw
pool 5 'ceph-filesystem-metadata' replicated size 3 min_size 2 crush_rule 4 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 70 lfor 0/0/55 flags hashpspool stripe_width 0 pg_autoscale_bias 4 pg_num_min 16 recovery_priority 5 application cephfs
pool 6 'ceph-filesystem-data0' replicated size 3 min_size 2 crush_rule 5 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 115 lfor 0/0/103 flags hashpspool stripe_width 0 application cephfs
pool 7 'ceph-objectstore.rgw.log' replicated size 3 min_size 2 crush_rule 6 object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change 84 flags hashpspool stripe_width 0 pg_num_min 8 application rook-ceph-rgw
pool 8 'ceph-objectstore.rgw.buckets.index' replicated size 3 min_size 2 crush_rule 7 object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change 100 flags hashpspool stripe_width 0 pg_num_min 8 application rook-ceph-rgw
pool 9 'ceph-objectstore.rgw.buckets.non-ec' replicated size 3 min_size 2 crush_rule 8 object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change 122 flags hashpspool stripe_width 0 pg_num_min 8 application rook-ceph-rgw
pool 10 'ceph-objectstore.rgw.otp' replicated size 3 min_size 2 crush_rule 9 object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change 135 flags hashpspool stripe_width 0 pg_num_min 8 application rook-ceph-rgw
pool 11 '.rgw.root' replicated size 3 min_size 2 crush_rule 10 object_hash rjenkins pg_num 8 pgp_num 8 autoscale_mode on last_change 144 flags hashpspool stripe_width 0 pg_num_min 8 application rook-ceph-rgw
pool 12 'ceph-objectstore.rgw.buckets.data' erasure profile ceph-objectstore.rgw.buckets.data_ecprofile size 3 min_size 2 crush_rule 11 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 167 lfor 0/0/157 flags hashpspool,ec_overwrites stripe_width 8192 application rook-ceph-rgw
pool 13 'qa' replicated size 2 min_size 1 crush_rule 1 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 267 lfor 0/0/262 flags hashpspool,selfmanaged_snaps max_bytes 32212254720 stripe_width 0 application qa,rbd
[root@rook-ceph-tools-6cd9f76d46-f4vsj /]# ceph osd tree
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 0.02939 root default
-5 0.02939 region nbg1
-4 0.02939 zone nbg1-dc3
-11 0.01959 host k8s-qa-pool1-7b6956fb46-cvdqr
1 ssd 0.00980 osd.1 up 1.00000 1.00000
3 ssd 0.00980 osd.3 up 1.00000 1.00000
-3 0.00980 host k8s-qa-pool1-7b6956fb46-mbnld
0 ssd 0.00980 osd.0 up 1.00000 1.00000
[root@rook-ceph-tools-6cd9f76d46-f4vsj /]# ceph osd crush rule dump
[
{
"rule_id": 0,
"rule_name": "replicated_rule",
"type": 1,
"steps": [
{
"op": "take",
"item": -1,
"item_name": "default"
},
{
"op": "chooseleaf_firstn",
"num": 0,
"type": "host"
},
{
"op": "emit"
}
]
},
{
"rule_id": 1,
"rule_name": "ceph-blockpool",
"type": 1,
"steps": [
{
"op": "take",
"item": -1,
"item_name": "default"
},
{
"op": "chooseleaf_firstn",
"num": 0,
"type": "host"
},
{
"op": "emit"
}
]
},
{
"rule_id": 2,
"rule_name": "ceph-objectstore.rgw.control",
"type": 1,
"steps": [
{
"op": "take",
"item": -1,
"item_name": "default"
},
{
"op": "chooseleaf_firstn",
"num": 0,
"type": "host"
},
{
"op": "emit"
}
]
},
{
"rule_id": 3,
"rule_name": "ceph-objectstore.rgw.meta",
"type": 1,
"steps": [
{
"op": "take",
"item": -1,
"item_name": "default"
},
{
"op": "chooseleaf_firstn",
"num": 0,
"type": "host"
},
{
"op": "emit"
}
]
},
{
"rule_id": 4,
"rule_name": "ceph-filesystem-metadata",
"type": 1,
"steps": [
{
"op": "take",
"item": -1,
"item_name": "default"
},
{
"op": "chooseleaf_firstn",
"num": 0,
"type": "host"
},
{
"op": "emit"
}
]
},
{
"rule_id": 5,
"rule_name": "ceph-filesystem-data0",
"type": 1,
"steps": [
{
"op": "take",
"item": -1,
"item_name": "default"
},
{
"op": "chooseleaf_firstn",
"num": 0,
"type": "host"
},
{
"op": "emit"
}
]
},
{
"rule_id": 6,
"rule_name": "ceph-objectstore.rgw.log",
"type": 1,
"steps": [
{
"op": "take",
"item": -1,
"item_name": "default"
},
{
"op": "chooseleaf_firstn",
"num": 0,
"type": "host"
},
{
"op": "emit"
}
]
},
{
"rule_id": 7,
"rule_name": "ceph-objectstore.rgw.buckets.index",
"type": 1,
"steps": [
{
"op": "take",
"item": -1,
"item_name": "default"
},
{
"op": "chooseleaf_firstn",
"num": 0,
"type": "host"
},
{
"op": "emit"
}
]
},
{
"rule_id": 8,
"rule_name": "ceph-objectstore.rgw.buckets.non-ec",
"type": 1,
"steps": [
{
"op": "take",
"item": -1,
"item_name": "default"
},
{
"op": "chooseleaf_firstn",
"num": 0,
"type": "host"
},
{
"op": "emit"
}
]
},
{
"rule_id": 9,
"rule_name": "ceph-objectstore.rgw.otp",
"type": 1,
"steps": [
{
"op": "take",
"item": -1,
"item_name": "default"
},
{
"op": "chooseleaf_firstn",
"num": 0,
"type": "host"
},
{
"op": "emit"
}
]
},
{
"rule_id": 10,
"rule_name": ".rgw.root",
"type": 1,
"steps": [
{
"op": "take",
"item": -1,
"item_name": "default"
},
{
"op": "chooseleaf_firstn",
"num": 0,
"type": "host"
},
{
"op": "emit"
}
]
},
{
"rule_id": 11,
"rule_name": "ceph-objectstore.rgw.buckets.data",
"type": 3,
"steps": [
{
"op": "set_chooseleaf_tries",
"num": 5
},
{
"op": "set_choose_tries",
"num": 100
},
{
"op": "take",
"item": -1,
"item_name": "default"
},
{
"op": "chooseleaf_indep",
"num": 0,
"type": "host"
},
{
"op": "emit"
}
]
}
]
1条答案
按热度按时间jjhzyzn01#
我不熟悉rook,但显然规则集是为你创建的?无论如何,它们都使用“主机”作为故障域,大小为3,但只有两个主机无法满足你的要求。我假设你的第四个OSD是在第三个主机上,这就是您的群集现在降级的原因。您需要至少再添加一个主机,以便您的PG可以成功恢复。至于擦除编码池,它也有“host”作为故障域,大小= 3(我假设EC配置文件是k=2,m=1?)您还需要3台主机。要恢复复制池,您可以将其大小更改为2,但我不建议永久这样做,仅出于恢复原因。由于您无法更改EC配置文件,因此在添加第三个OSD节点之前,该池将保持降级状态。要回答您的其他问题,请执行以下操作:
1.故障域:这实际上取决于您的设置,它可以是机架、机箱、数据中心等。但对于如此小的设置,将“主机”作为故障域是有意义的。
1.每台主机的OSD越多越好,您将有恢复选项。警告功能很好,如果Ceph注意到磁盘故障,它会向您发出警告,但如果有足够的OSD和主机,它可以自动恢复。如果您查看
ceph osd tree
的输出,只有2台主机有3个OSD,这就是目前不好的原因。