/var/log/DMIT-NOC.log – Telegram

/var/log/DMIT-NOC.log

4.75K subscribers

189 photos

6 files

117 links

Download Telegram

About

Blog

Apps

Platform

/var/log/DMIT-NOC.log

4.75K subscribers

/var/log/DMIT-NOC.log

Due to the impact of the epidemic situation; Remote maintenance, transportation and ISP maintenance of our data center in Los Angeles are all affected.

The following NOC matters have been postponed:

- We planned to connect with Zayo (AS6461) in March, 2020 to improve the quality of local connections in the United States. [ No further ETA from Zayo IPT delivery team ]

- New EPYC node RAM: We ordered a dozen sets DDR4 memory from our supplier; they ran out of their stock; [ No ETA for memory shipment ]

- New EPYC node: 2 sets of the server already shipped; ETA arrival is Mar 22, 2020; [ New ETA: March 30, 2020 ]. There are still have few sets not been shipped yet; [ No ETA ]

- Enhanced Backbone quality and SLA: We do have only 1 unprotected backbone in our LAX Extended PoP and LAX Datacenter; We already ask for upgrade it to protected loop at March 12, 2020; [ But there is no ETA for this ]

- We found that several Level 1 errors are reported from the transceiver every day; Our NOC expects that due to unpredictable fiber quality degradation and transceiver laser aging (the laser quality is expected to be lower than the lowest RX threshold of the laser within 1-3 months). We will add amplifiers at both ends of the transceiver and replace it with a new transceiver. At the same time, we will use DWDM to enhance reliability. All these tasks need to be synchronized doing after the ETA is obtained in the previous task to reduce the down time.

1.38K viewsedited 20:48

/var/log/DMIT-NOC.log

The issues on VM Control Panel have been fixed. We have to reboot Node 27J3B42 and 2VC8Q22 on this. There is no impact on other nodes and VM.

1.02K views21:25

/var/log/DMIT-NOC.log

/var/log/DMIT-NOC.log

The issues on VM Control Panel have been fixed. We have to reboot Node 27J3B42 and 2VC8Q22 on this. There is no impact on other nodes and VM.

We detected Huge TCP DDoS Attack to our client on LH36806/7 and gateway; we have to reboot LH36806 to recover the dead kernel status. We already enhanced our Anti-DDoS rule.

1.12K views09:38

/var/log/DMIT-NOC.log

Due to a large number of orders again, our two Xeon Gold nodes were severely overloaded. (20% more than our design). At the same time, due to DDoS attacks and high load, the node continuously disconnects from the cluster. This caused the task unable to execute; the continuous disconnection of the cluster eventually lead to the accumulation of tasks and the failure of the cluster.

The 2 HPE EPYC nodes we ordered from the supplier have arrived at the data center at present; the memory is still waiting for shipment; we cannot be used in the production environment for the time. Since no enough RAM.

We have ordered one more DL360 node directly from HPE. The DL360 Xeon Gold node will be ready in 5-7 days. it's the same with our current Xeon gold nodes.

We are very sorry that we failed to anticipate the expansion of our business far beyond our imagination.

We may stop accepting new orders from LAX in the near future. In order to ensure that the current service can basically fun safely.

1.23K viewsedited 05:09

/var/log/DMIT-NOC.log

/var/log/DMIT-NOC.log

Due to a large number of orders again, our two Xeon Gold nodes were severely overloaded. (20% more than our design). At the same time, due to DDoS attacks and high load, the node continuously disconnects from the cluster. This caused the task unable to execute;…

The reason we talk before is not exact correct. After in-depth investigation; Due to defects in PVE-Firewall. DMIT recompiled it; We delete defective designs; This also led to a failure in filtering broadcast and multicast packets successfully. We have captured a large number (> 100kpps) of broadcast or multicast packets sent by a guest VM when the network rate decreases and packet loss increases. The kernel of the guest VM does not have enough buffer and performance to process these multicast packets, which causes congestion in the guest VM. Since multicast packets are distributed to each guest VM; When the guest VM has a high load, it also causes a sudden high load of Host; (Since this is a shared VM, each VM does not have a full CPU core).

P.S: Some special intranet broadcast packets are easily blocked in a large-scale network environment. There are some packets that will cause network architecture changes (e.g. advertise their IP as IGMP Snooping router)

Our engineer team already put a new beta optimized PVE-Firewall to the nodes that have these issues for testing. During our test, it succeeded in blocking intranet attacks and abuse. Our engineer will upload the PVE-Firewall to our code platform and deploy it to all nodes.

However, our number of nodes still exceeds our expected number and we will not accept new orders for the time being. (Upgrade is allowed). Although the node resources become sufficient due to VM being released recently; new orders will be accepted when the new node is ready.

890 viewsedited 17:42

/var/log/DMIT-NOC.log

Scheduled maintenance:
3PM PST
Apr 7, 2020.
Window: 30min~1h

Maintenance:
1. Check the X-C quality,
2. Check light attenuation of our backbone.
3. Replace the transceiver on both side of backbone

Temporary Impact:
1. The network LH36806/7 will be inaccessible.
2. IX Peering goes down.
3. Network capability goes down.

1.02K views17:56

/var/log/DMIT-NOC.log

/var/log/DMIT-NOC.log pinned «Scheduled maintenance: 3PM PST Apr 7, 2020. Window: 30min~1h Maintenance: 1. Check the X-C quality, 2. Check light attenuation of our backbone. 3. Replace the transceiver on both side of backbone Temporary Impact: 1. The network LH36806/7 will be inaccessible.…»

17:56

/var/log/DMIT-NOC.log

/var/log/DMIT-NOC.log

Scheduled maintenance: 3PM PST Apr 7, 2020. Window: 30min~1h Maintenance: 1. Check the X-C quality, 2. Check light attenuation of our backbone. 3. Replace the transceiver on both side of backbone Temporary Impact: 1. The network LH36806/7 will be inaccessible.…

Done;

1.12K views22:54

/var/log/DMIT-NOC.log

/var/log/DMIT-NOC.log

Due to a large number of orders again, our two Xeon Gold nodes were severely overloaded. (20% more than our design). At the same time, due to DDoS attacks and high load, the node continuously disconnects from the cluster. This caused the task unable to execute;…

Due to the release and balance of resources; We now have enough resources to accept new orders.

The previous packet loss was not caused by overselling. But we have also strictly limited our resources right now.

The new Xeon Gold node and EPYC node will be ready at next week. Between April 25 and May 10, we will launch two more EPYC nodes.

901 views05:05

/var/log/DMIT-NOC.log

If you feel that LAX is not fast enough recently; Please check MTR test;

If you find that your LAX-China route passes through Shanghai; Please try to use UDP protocol to bypass the unknown restrictions of China Telecom Shanghai PoP temporarily

Our NOC is still contacting with China Telecom GNOC for this issue.

1.21K views18:39

/var/log/DMIT-NOC.log

Due to frequent customer-to-customer fraud via transfer function; We will temporarily close the transfer function in the near future; Our team will optimize the transfer process as much as possible to protect the interests of both DMIT and our customers.

998 viewsedited 10:20

/var/log/DMIT-NOC.log

/var/log/DMIT-NOC.log pinned «Due to frequent customer-to-customer fraud via transfer function; We will temporarily close the transfer function in the near future; Our team will optimize the transfer process as much as possible to protect the interests of both DMIT and our customers.»

10:20

/var/log/DMIT-NOC.log

We are sorry about the below situation, but we have to inform all customers that.

A) Kernel, Speed, Network, Bandwidth:

If we found that there is 3rd part kernel or unknown TCP acceleration program, add-ons, plug-in installed in VM.

Our team will NOT give any help on the network issue. That wastes our team too much time to explain how these software damages and screw up the TCP queue of the network in your VM.

This involves professional network knowledge. If you cannot understand it, please leave the system as it is. （e.g. buffer, bufferboat, queue, transmission control, retransmission, etc..)

If we note that, our team will give notice without any further help. The refund is acceptable if the requirement is met.

B) Guarantee:

Below is an industry common sense.

DMIT never guarantee the network from VM to China or any location. The speed on the cart page is the max value the VM can get. There is NO ISP give any guarantee of bandwidth. The bandwidth commitment that ISP gave to us is between DMIT and their port on switch/edge router. Including China Telecom

Regards,
NOC

1.39K viewsedited 07:20

/var/log/DMIT-NOC.log

/var/log/DMIT-NOC.log pinned «We are sorry about the below situation, but we have to inform all customers that. A) Kernel, Speed, Network, Bandwidth: If we found that there is 3rd part kernel or unknown TCP acceleration program, add-ons, plug-in installed in VM. Our team will NOT…»

07:20

/var/log/DMIT-NOC.log

>Due to TPE (Trans-Pacific Express) S1S cable fault. There is a huge network degradation on the CN2 backbone. Please wait for CTA/CTG NOC response and route rescheduling. Please do not open a ticket for this issue. DMIT has no ability to repair

1.66K viewsedited 01:08

/var/log/DMIT-NOC.log

/var/log/DMIT-NOC.log pinned a photo

01:15

/var/log/DMIT-NOC.log

/var/log/DMIT-NOC.log

>Due to TPE (Trans-Pacific Express) S1S cable fault. There is a huge network degradation on the CN2 backbone. Please wait for CTA/CTG NOC response and route rescheduling. Please do not open a ticket for this issue. DMIT has no ability to repair

DMIT rejected the CN2 LAX-PVG routes to avoid congestion (>300ms with 40%+ packet loss) temporarily. For now it should go though our Internet Transit.

1.3K views01:32

/var/log/DMIT-NOC.log

/var/log/DMIT-NOC.log

DMIT rejected the CN2 LAX-PVG routes to avoid congestion (>300ms with 40%+ packet loss) temporarily. For now it should go though our Internet Transit.

Back to CN2 GIA.

1.4K views02:36

/var/log/DMIT-NOC.log

/var/log/DMIT-NOC.log

Back to CN2 GIA.

CN2 Backbone resumed. It carried by multiple submarine cables now. The latency is flapping between 160ms~185ms. (In general, it should between 120~125ms). Bandwidth is now unaffected, but latency.

According to our experience, it will take about 2-4 months to complete the repair.

Regards,
NOC

1.43K views01:01

/var/log/DMIT-NOC.log

Scheduled maintenance
Date: Apr 22, 2020
Time: 12pm - 3pm PST
Datacenter: LAX
Estimated duration: 1h
Max duration: 3h
Affected services:
- Reinstall;
- Backup;
- Network in some nodes;
- Control Panel;

Best regards,
NOC

1.3K viewsedited 05:38