/var/log/DMIT-NOC.log
4.7K subscribers
189 photos
6 files
117 links
Download Telegram
/var/log/DMIT-NOC.log pinned «DMIT Hong Kong Strategy Adjustment Hong Kong will have 3 network profiles a) Tier 1 (New) Competitive Internet routing; DMIT will try reasonable effort to lower transcontinental latency. Estimated HK-EU latency (coming soon, April 20~May 1.) 115ms+(RU)…»
LAX was experienced an IO suspension issue for 37mins;
It has been resolved.
Limited number of VM were impacted.

We are still investivating the reason and details.
Keep posted.
https://github.com/ceph/ceph/pull/47702

This is a rare issue, and it just happened once in the cluster.
It will not always happen.

The issue has been stated; LAX is using Octopus which is EOL, and there is no update for Octopus.

DMIT will perform an update for Ceph to solve this issue.
Based on https://docs.ceph.com/en/latest/releases/index.html; this patch has not been released yet.

1/24 of the OSD crashed due to this bug. Since it is a crash, OSD will not actively report OUT to MON. Other OSDs are pinging this OSD. In MON and global, other OSD thinks this OSD is still UP and IN. It leads to no response to heartbeats signals. MON has a 600s wait period to mark the dead OSD as OUT.

It leads to active+clean+laggy for up to 20 PGs suddenly; the laggy also prevents some PGs from being read.

We are still digging into this problem and doing tuning for it.
/var/log/DMIT-NOC.log
LAX was experienced an IO suspension issue for 37mins; It has been resolved. Limited number of VM were impacted. We are still investivating the reason and details. Keep posted.
After log review, the IO is not suspended; it was stuck in laggy;
Since it's too slow for some PGs, it makes us thought the IO was suspended.

The next step is finding the best practice and behavior for the cluster to handle the OSD crashing.

There is no data lost has been discovered and reported.
The Ceph of LAX has been upgraded to Quincy; The next release should contain the bug fix.
Maintenance Notice:
Region: Hong Kong
Time: April 17 ~ April 21, 2023 (Local Time)
Length: The total length of interruption will not exceed 7 hours. Service degradation is no later than the termination time.

Content:
DMIT will perform a complete upgrade to Hong Kong by:
1. Replace the new ISP and upgrade all to 100G
2. 100G link with Equinix IX
3. Double the number of hosts and replace to NVMe storage.
4. 10G connection with CTG AS4809 GIA, CUG AS10099, CMI AS58453 CMONET
5. Level-up Anti-DDoS capacity.

Reason for interruption: Relocation of network core rack out of reseller space and direct sign contract with Equinix.
/var/log/DMIT-NOC.log pinned «Maintenance Notice: Region: Hong Kong Time: April 17 ~ April 21, 2023 (Local Time) Length: The total length of interruption will not exceed 7 hours. Service degradation is no later than the termination time. Content: DMIT will perform a complete upgrade to…»
Message from CTG:

Please kindly be informed that there will be an urgent maintenance and more information as below.

Time window (Date/Time):
2023-04-21 16:00:00 - 2023-04-21 22:00:00 UTC

2023-04-22 00:00:00 - 2023-04-22 06:00:00 UTC+8

Maintenance Description:
Hidden-faulty troubleshooting

Maintenance Location:
International/Overseas
[DMIT Location: TYO Pro]

Service Impact:
The circuit will experience outage up to 120minutes during the maintenance window.

Affected Circuit(s):
TYO-GIA-*****CTG
/var/log/DMIT-NOC.log pinned «Message from CTG: Please kindly be informed that there will be an urgent maintenance and more information as below. Time window (Date/Time): 2023-04-21 16:00:00 - 2023-04-21 22:00:00 UTC 2023-04-22 00:00:00 - 2023-04-22 06:00:00 UTC+8 Maintenance Description:…»
/var/log/DMIT-NOC.log
Maintenance Notice: Region: Hong Kong Time: April 17 ~ April 21, 2023 (Local Time) Length: The total length of interruption will not exceed 7 hours. Service degradation is no later than the termination time. Content: DMIT will perform a complete upgrade to…
We successfully completed the DMIT core network rack migration yesterday in the local time.

At present,
- We are in the process of transferring your data from the Hyper-converged Ceph system to the Standalone NVMe Ceph infrastructure.
- The new set of EPYC servers have been installed.

Pending tasks:
- Ensure the new server set is fully operational.
- NTT (AS2914) has used up their 100G interfaces at Equinix HK; DMIT is awaiting the completion of their DWDM deployment.
- Cogent (AS174) has not met the agreed-upon service delivery timeline. DMIT has pre-patched the cross-connect, and we are now waiting for their LOA to finish the connection.
- CUG (AS10099) is still waiting for the prefix filter to be updated.
Completed emergency maintenance:

Emergency maintenance has been successfully completed.

Our on-site engineer identified a critical hardware failure on one of the nodes and resolved the issue before posting any notice.

We will soon migrate all HKG VMs to a new cluster, and the old cluster will be rebuilt.

This is necessary due to architectural issues that have prevented any possiable updates for the past two years.
TYO partial routing failure report

Hours earlier, Telstra had prematurely terminated IP Transit services. (scheduled to be terminated on May 1.)

This caused us not to turn the new IP Transit service up on time, which resulted in some Internet null routes. However, this has been solved now;

The refreshed IP Transit vendor of DMIT:
Tokyo: +AS2914, +AS17676, [-AS4637, -AS3491]
Hong Kong: +AS2914, +AS9002, [-AS4637, -AS3491]
HKG.Pro

DMIT completed the new vendor connection for HKG Pro and DMIT would like to make the following adjustments for our customers.

TINY:
- 200GB > 400GB (Transfer)
- 0.75 GB > 1.0GB (RAM)
- 10GB > 20GB (SSD)
- 40Mbps > 100Mbps

STARTER:
- 500GB > 800GB (Transfer)
- 20GB > 40GB (SSD)
- 1.5GB > 2.0GB (RAM)
- $69.9 > $79.9 (Keep the price for existing order)

MINI:
- 800GB > 1200GB (Transfer)
- 40GB > 60GB (SSD)
- 100Mbps > 200Mbps
- $109.9 > $119.9 (Keep the price for existing order)

MICRO:
- 1000GB > 1600GB (Transfer)
- 40GB > 80GB (SSD)
- 100Mbps > 200Mbps
- $139.9 > $159.9 (Keep the price for existing order)

MEDIUM:
- 1500GB > 1800GB (Transfer)
- 80GB > 160GB (SSD)

LARGE:
- 2000GB > 2400GB (Transfer)
- 160GB > 240GB (SSD)

GIANT:
- 4000GB > 4800GB (Transfer)
/var/log/DMIT-NOC.log pinned «HKG.Pro DMIT completed the new vendor connection for HKG Pro and DMIT would like to make the following adjustments for our customers. TINY: - 200GB > 400GB (Transfer) - 0.75 GB > 1.0GB (RAM) - 10GB > 20GB (SSD) - 40Mbps > 100Mbps STARTER: - 500GB > 800GB…»
HKG.Pro

Received notification from the vendor that the AS-Path needs to be corrected; after completing the configuration, the BGP session restarted twice which causing the CN2 routing convergence limit to be triggered.

CN2 has route convergence restrictions, CTGnet accepts the route but CN2 does not, resulting in an null route.

It takes 30 minutes to be back to normal.