/var/log/DMIT-NOC.log
4.7K subscribers
189 photos
6 files
117 links
Download Telegram
We are still working on it; we suggest do not reboot if you still able to run your system, since the I/O is currently suspended.
The remote hand is on the way to the SJC location to implement hardware requirements for repair.

The SLA solution will be posted after repair done.
OSD recovery, backfill in progress.
Step 1 still need ~4hrs; 70% VM will return to normal;

Step 2 will take another ~4hrs; 99% VM will return to normal;

Step 3 needs whole day, it only leads to IO performance impact but not uptime impact.

The SLA is lower than TOS offered. The reimbursement will issue case by case; please submit ticket after the event end.

We are deeply sorry for the recent SLA drop that may have caused inconvenience to your business operations. We understand the importance of our services to your business and we take full responsibility for this interruption.

The fault report will be posted after the event.
Ceph does not allow to run after partitial recovery; step 2 in process.
/var/log/DMIT-NOC.log
Ceph does not allow to run after partitial recovery; step 2 in process.
Step 2 complete;

Due to one OSD failed to be recovery, and data difference during time; there are 13/512 (2.5390625%) data is unable to recovered.

Once again, we apologize for any inconvenience or concern that this may have caused. We value your trust and we will continue to work hard to earn and maintain it.
/var/log/DMIT-NOC.log
Step 2 complete; Due to one OSD failed to be recovery, and data difference during time; there are 13/512 (2.5390625%) data is unable to recovered. Once again, we apologize for any inconvenience or concern that this may have caused. We value your trust…
We regret to inform you that the loss of certain Ceph PG objects may prevent your VM's filesystem from mounting a hard drive during system boot. While this issue can be resolved manually, addressing over 2,000 VMs at this location falls outside the scope of our unmanaged services. However, we would like to offer a compensation package as a token of apology.

====Compensation Package====
DMIT will provide a more detailed fault report, and in the meantime, we will extend your service period by 30 days and double your transfer capacity in your VM permanently. (For UNMETERED plans, we will double your bandwidth as well.)
====Compensation Package====

To address the issue, we will need to take the following steps:

1. DMIT will stop all instances;
2. Client tries to start them one by one;
3. If not able to get into the system:
a) Instance cannot boot: reinstall system, the major(maybe header) object of your drive has permanent loss; rebuild is required.
b) Instance bootable, but system hang: Filesystem failed; manual fix or rebuild;

Please accept our apologies for any inconvenience caused. We appreciate your patience and understanding as we work to address this issue.

After all these, DMIT will first issues 30 days service extension. Then, double the resources. No ticket is required for this.

The fault report will be ready after all these.

[This has been emailed formally]
/var/log/DMIT-NOC.log pinned «We regret to inform you that the loss of certain Ceph PG objects may prevent your VM's filesystem from mounting a hard drive during system boot. While this issue can be resolved manually, addressing over 2,000 VMs at this location falls outside the scope of…»
The initial Summary:

~March 1
On or about March 1, DMIT San Jose received a large number of VM orders. (almost double the number of VM at that time).

~March 3
DMIT had noticed the tight resources and immediately stopped accepting new orders.
Memory resources were released to the two new nodes that were newly purchased last month.
The available storage resources were already lower than 30% at that time.

~March 6
On March 6, we increased the set-full-ratio of the OSD from 90% to 95% in order to prevent IO outages.

But this was still not enough to solve the problem, and we had ordered a enought amount of P5510 P5520 7.68TB on March 3.
FedEx expected to deliver on March 7, and we were scheduled to install these SSDs on March 8.

Due to the California weather, the delivery was delayed to March 9 and we planned to install the SSDs immediately on March 10 to relieve the pressure.

~March 8
On the night of March 8, we completed network maintenance, which caused the OSD to reboot.
Also due to OSD overload, BlueStore did not have enough space to allocate 4% log during start, resulting in OSD refusing to boot. This still only resulted in reduced IO performance.

~March 9
Due to the continued writes, on the morning of March 9, another OSD triggered a failure and caused backfill, which caused a chain reaction that resulted in a third OSD being written to full and then failing to start. This eventually led to current condition.

We immediately arranged to the on-site installation on March 9, but this still caused some PGs to be lost.

=== Tech Notes
- San Jose uses the latest tech stack of DMIT. We do not know bluestore will use 4% of the total OSD as a log. We thought it should be included in the data.
Once the data uses all space, the log cannot be issued during initiating. It leads to failure.
- San Jose does not have that much VM increase rate as before, the double order gave us limited time to upgrade.
=== Management Notes
- DMIT will prepare to upgrade the locations once resources are over 60%.
- DMIT will reject the order if we don't have the ability immediately to keep resources lower than 80%.
Current usage of other locations.

LAX: 37.59 TiB of 83.84 TiB (45%). NVMe
Offering Sanpshot;

TYO: 8.27 TiB of 15.72 TiB (53%). SATA
Switch to NVMe at Q3, 2023

HKG: 38.19 TiB of 52.39 TiB (73%). SATA
NVMe new system is on the way.
SSD will be ordered next week;
3.6673 TiB left to stop/reject order.

Extra protection:
DMIT.com add new feature: support image export.
The double transfer (double bw for UNMETERED) in Compensation Package has been delivered.
Please be advised: the double transfer, and double bandwidth will be removed if you choose to upgrade or stop renewing.

We suggest keeping it and ordering new services if you need more.
Snapshot for SJC is enabled.
/var/log/DMIT-NOC.log
We regret to inform you that the loss of certain Ceph PG objects may prevent your VM's filesystem from mounting a hard drive during system boot. While this issue can be resolved manually, addressing over 2,000 VMs at this location falls outside the scope of…
To facilitate accounting and bookkeeping, we will have uniformly extended service period on April 1, Eastern time.
Please ensure your service is active on day March, 31. 8:00pm Eastern Time. ( April 1st, 00:00 UTC time)
DMIT add soft watchdog support on SJC;
Enable it by simply installing and enabling the watchdog package and power cycle.
Ubuntu needs extra work; we'll post it on the knowledge base later.
Dear Valued Client,

We would like to inform you that DMIT has decided to decouple the prices of APAC, EMEA, and AMER regions due to a comprehensive range of factors, including inflation, energy price increases, supplier price adjustments, and renewal price increases. As a result, we will be implementing a fully dynamic pricing strategy for LITE, considering its low margins. This pricing strategy will be implemented on a priority basis in Hong Kong. The existing service keeps no changes.

We thank you for your understanding and continued trust in our services.

Best regards,
DMIT INC

https://www.dmit.io/pages/pricing
DMIT Hong Kong Strategy Adjustment

Hong Kong will have 3 network profiles

a) Tier 1 (New)
Competitive Internet routing;
DMIT will try reasonable effort to lower transcontinental latency.

Estimated HK-EU latency (coming soon, April 20~May 1.)
115ms+(RU), 130ms+ (VIE), 140ms+(FRA), 150ms+ (AMS), 155ms+ (LON)

b) Eyeball (From HKG.LITE and LITEv2)
Affordable direct connection (In-Asia) to mainland China.
Reasonable Effort to offer China routing. (No IP Transit Guarantee)

c) Premium (HKG.Pro, no change)
Direct China - Hong Kong routing via high-quality IP Transit.
Best Effort to offer China Premium routing, bi-direction. (IP Transit Guarantee)

=========================================

DMIT Upstream Change
1. Eyeball Transit (Done)
New: Local Vendor
Remove: China Mobile International
Routing:
- Direct CM bidirectional routing;
- Direct CT, CU returning routing;
- NTT for CT, CU inbound routing; (Estimated peering time with AS2914: April 21 ~ May 1)
** Costs increased 5x-20x times. (Dependes on destination) No change for current users.

2. Premium Transit (Very soon)
New: Local Vendor
Remove: ZenLayer
Primary Routing: CN2 by CTGnet, CU by CUG, CM by CMI.
Redundant Routing: CN - DMIT TYO - DMIT HKG via Premium ISP
/var/log/DMIT-NOC.log pinned «DMIT Hong Kong Strategy Adjustment Hong Kong will have 3 network profiles a) Tier 1 (New) Competitive Internet routing; DMIT will try reasonable effort to lower transcontinental latency. Estimated HK-EU latency (coming soon, April 20~May 1.) 115ms+(RU)…»