Ceph集群显示XXX daemons have recently crashed警告

问题

：

Ceph集群一直显示XXX daemons have recently crashed，而且数目越来越多;

解决方法：

最近有一个或多个Ceph守护进程崩溃，管理员尚未对该崩溃进行存档(确认)。这可能表示软件错误、硬件问题(例如，故障磁盘)或某些其它问题。

系统中所有的崩溃可以通过以下方式列出：

# ceph crash ls
ID                                                               ENTITY     NEW 
2020-05-02_00:53:25.028694Z_b29d405c-2512-4b80-916f-46c45c2cd6a9 osd.94         
2020-05-02_00:56:33.807897Z_feea566f-f237-42fd-aadf-45a5e8047896 osd.94                
2020-05-02_05:41:03.542296Z_21a06b0b-f2bc-42d1-8d50-5c104e150c9e mon.node01     
2020-05-02_09:52:51.146773Z_4e637ead-80df-42df-93f0-42c84ab8feb3 osd.19

新的崩溃可以通过以下方式列出：

# ceph crash ls-new

有关特定崩溃的信息可以通过以下方式检查：

# ceph crash info <crash-id>

###例如###

# ceph crash info 2020-05-02_05:41:03.542296Z_21a06b0b-f2bc-42d1-8d50-5c104e150c9e
{
    "os_version_id": "18.04", 
    "utsname_release": "4.15.0-55-generic", 
    "os_name": "Ubuntu", 
    "entity_name": "mon.node01", 
    "timestamp": "2020-05-02 05:41:03.542296Z", 
    "process_name": "ceph-mon", 
    "utsname_machine": "x86_64", 
    "utsname_sysname": "Linux", 
    "os_version": "18.04.3 LTS (Bionic Beaver)", 
    "os_id": "ubuntu", 
    "utsname_version": "#60-Ubuntu SMP Tue Jul 2 18:22:20 UTC 2019", 
    "backtrace": [
        "(()+0x12890) [0x7f6c9f2f3890]", 
        "(gsignal()+0xc7) [0x7f6c9e3ebe97]", 
        "(abort()+0x141) [0x7f6c9e3ed801]", 
        "(()+0x8c957) [0x7f6c9ede0957]", 
        "(()+0x92ab6) [0x7f6c9ede6ab6]", 
        "(()+0x92af1) [0x7f6c9ede6af1]", 
        "(()+0x92d24) [0x7f6c9ede6d24]", 
        "(()+0x1424b) [0x7f6c9f51424b]", 
        "(tc_new()+0x283) [0x7f6c9f535943]", 
        "(rocksdb::Arena::AllocateNewBlock(unsigned long)+0x6c) [0x55c1aabe88ac]", 
        "(rocksdb::Arena::AllocateFallback(unsigned long, bool)+0x4b) [0x55c1aabe89db]", 
        "(rocksdb::Arena::AllocateAligned(unsigned long, unsigned long, rocksdb::Logger*)+0x110) [0x55c1aabe8b80]", 
        "(rocksdb::ConcurrentArena::AllocateAligned(unsigned long, unsigned long, rocksdb::Logger*)+0xd4) [0x55c1aaaff004]", 
        "(()+0x5a3273) [0x55c1aab6b273]", 
        "(()+0x5a32f0) [0x55c1aab6b2f0]", 
        "(rocksdb::MemTable::Add(unsigned long, rocksdb::ValueType, rocksdb::Slice const&, rocksdb::Slice const&, bool, rocksdb::MemTablePostProcessInfo*)+0xfc) [0x55c1aaafa5bc]", 
        "(rocksdb::MemTableInserter::PutCFImpl(unsigned int, rocksdb::Slice const&, rocksdb::Slice const&, rocksdb::ValueType)+0x1bd) [0x55c1aab609ed]", 
        "(rocksdb::MemTableInserter::PutCF(unsigned int, rocksdb::Slice const&, rocksdb::Slice const&)+0x26) [0x55c1aab615d6]", 
        "(rocksdb::WriteBatch::Iterate(rocksdb::WriteBatch::Handler*) const+0xa19) [0x55c1aab58de9]", 
        "(rocksdb::WriteBatchInternal::InsertInto(rocksdb::WriteThread::WriteGroup&, unsigned long, rocksdb::ColumnFamilyMemTables*, rocksdb::FlushScheduler*, bool, unsigned long, rocksdb::DB*, bool, bool, bool)+0x14b) [0x55c1aab5cecb]", 
        "(rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&, rocksdb::WriteBatch*, rocksdb::WriteCallback*, unsigned long*, unsigned long, bool, unsigned long*, unsigned long, rocksdb::PreReleaseCallback*)+0x13f6) [0x55c1aaa80f06]", 
        "(rocksdb::DBImpl::Write(rocksdb::WriteOptions const&, rocksdb::WriteBatch*)+0x30) [0x55c1aaa82660]", 
        "(RocksDBStore::submit_common(rocksdb::WriteOptions&, std::shared_ptr<KeyValueDB::TransactionImpl>)+0x88) [0x55c1aaa342f8]", 
        "(RocksDBStore::submit_transaction_sync(std::shared_ptr<KeyValueDB::TransactionImpl>)+0x8c) [0x55c1aaa34c3c]", 
        "(MonitorDBStore::apply_transaction(std::shared_ptr<MonitorDBStore::Transaction>)+0x76b) [0x55c1aa80a02b]", 
        "(Paxos::begin(ceph::buffer::v14_2_0::list&)+0x562) [0x55c1aa90bca2]", 
        "(Paxos::propose_pending()+0x127) [0x55c1aa90d5f7]", 
        "(Paxos::finish_round()+0x50a) [0x55c1aa90de1a]", 
        "(Paxos::commit_finish()+0x5fc) [0x55c1aa90fd6c]", 
        "(C_Committed::finish(int)+0x34) [0x55c1aa913d54]", 
        "(Context::complete(int)+0x9) [0x55c1aa84a359]", 
        "(MonitorDBStore::C_DoTransaction::finish(int)+0x94) [0x55c1aa913ac4]", 
        "(Context::complete(int)+0x9) [0x55c1aa84a359]", 
        "(Finisher::finisher_thread_entry()+0x17f) [0x7f6ca05227bf]", 
        "(()+0x76db) [0x7f6c9f2e86db]", 
        "(clone()+0x3f) [0x7f6c9e4ce88f]"
    ], 
    "utsname_hostname": "node01", 
    "crash_id": "2020-05-02_05:41:03.542296Z_21a06b0b-f2bc-42d1-8d50-5c104e150c9e", 
    "archived": "2020-05-06 14:13:12.975173", 
    "ceph_version": "14.2.6"
}

可以通过“存档”崩溃（可能是在管理员检查之后）来消除此警告，从而不会生成此警告：

# ceph crash archive <crash-id>

同样，所有新的崩溃都可以通过以下方式存档：

# ceph crash archive-all

通过ceph crash ls仍然可以看到已存档的崩溃，但不是ceph crash ls-new即可看到。

“recent”所指的时间段由选项mgr/crash/warn_recent_interval控制（默认值：两周）。

可以通过以下方式完全禁用这些警告：

# ceph config set mgr mgr/crash/warn_recent_interval 0

参考：

https://docs.ceph.com/docs/master/rados/operations/health-checks/?highlight=backfillfull%20ratio
https://docs.ceph.com/docs/master/mgr/crash/?highlight=crash

你可能也喜欢