Re: VLAN interface keeps going down

Moudar

Hi

I got a lab inside EVE-NG.

I have configured one physical interface as bond0.

bond0 is including 3 VLAN interfaces, one and sometimes two pf these VLAN interfaces keep going down

the other side is a cisco switch with trunk port connected to interface eth5 in the gateway

A-GW-1> show cluster state

Cluster Mode:   High Availability (Active Up) with IGMP Membership

ID         Unique Address  Assigned Load   State          Name

1 (local)  172.22.1.2      0%              DOWN           A-GW-1
2          172.22.1.1      100%            ACTIVE(!)      A-GW-2


Active PNOTEs: LPRB, IAC

Last member state change event:
   Event Code:                 CLUS-110300
   State change:               STANDBY -> DOWN
   Reason for state change:    Interface bond0.10 is down (Cluster Control Protocol packets are not received)
   Event time:                 Thu May  2 17:42:56 2024

Last cluster failover event:
   Transition to new ACTIVE:   Member 1 -> Member 2
   Reason:                     Interface bond0.10 is down (Cluster Control Protocol packets are not received)
   Event time:                 Thu May  2 17:39:06 2024

Cluster failover count:
   Failover counter:           3
   Time of counter reset:      Thu May  2 16:19:41 2024 (reboot)


A-GW-1> sh
A-GW-1> show inter
interface  - Show a specific interface's configurations
interfaces - Lists all interfaces
A-GW-1> show interface bond0.10
state on
mac-addr 50:00:00:08:00:05
type vlan
link-state not available
mtu 1500
auto-negotiation off (bond0)
speed 1000M (bond0)
ipv6-autoconfig Not configured
monitor-mode Not configured
duplex full (bond0)
link-speed Not configured
comments VLAN_10
ipv4-address 10.10.10.10/24
ipv6-address Not Configured
ipv6-local-link-address Not Configured

Statistics:
TX bytes:14208090 packets:338285 errors:0 dropped:0 overruns:0 carrier:0
RX bytes:0 packets:0 errors:0 dropped:0 overruns:0 frame:0

so the interface state is on but still down in the cluster

here is the log from /var/log/messages

ext3 jbd dm_multipath lp pcspkr sr_mod cdrom psmouse serio_raw button parport_pc parport e1000 i2c_piix4 dm_snapshot dm_bufio dm_zero dm_mirror dm_region_hash dm_log dm_mod xfs mptspi mptscsih mptbase virtio_scsi virtio_blk virtio_pci virtio_ring virtio nvme nvme_core ata_piix ahci libahci libata sg sym53c8xx scsi_transport_spi cciss sd_mod crc_t10dif crct10dif_common scsi_transport_fc scsi_tgt
May  2 17:38:45 2024 A-GW-2 kernel:CPU: 0 PID: 7370 Comm: snd_c Kdump: loaded Tainted: P           OEL ------------   3.10.0-1160.15.2cpx86_64 #1
May  2 17:38:45 2024 A-GW-2 kernel:Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.1-0-g0551a4be2c-prebuilt.qemu-project.org 04/01/2014
May  2 17:38:45 2024 A-GW-2 kernel:task: ffff8800376ad540 ti: ffff880087dd4000 task.ti: ffff880087dd4000
May  2 17:38:45 2024 A-GW-2 kernel:RIP: 0010:[<ffffffff902d9a11>]  [<ffffffff902d9a11>] e1000_alloc_rx_buffers+0xd1/0x6d0 [e1000]
May  2 17:38:45 2024 A-GW-2 kernel:RSP: 0018:ffff8801bfc03d48  EFLAGS: 00000286
May  2 17:38:45 2024 A-GW-2 kernel:RAX: ffffc90001082818 RBX: 00ff8801bfc039fc RCX: 00000000000005f2
May  2 17:38:45 2024 A-GW-2 kernel:RDX: 00000000000000e8 RSI: 000000009d24a640 RDI: ffff8801acf1a8c0
May  2 17:38:45 2024 A-GW-2 kernel:RBP: ffff8801bfc03da0 R08: 0000000000000000 R09: ffff8801bfc03b40
May  2 17:38:45 2024 A-GW-2 kernel:R10: ffff8801bfc039fc R11: 0000000000000000 R12: ffff8801bfc03cb8
May  2 17:38:45 2024 A-GW-2 kernel:R13: ffffffff817c544a R14: ffff8801bfc03da0 R15: ffffc90025015e90
May  2 17:38:45 2024 A-GW-2 kernel:FS:  0000000000000000(0000) GS:ffff8801bfc00000(0000) knlGS:0000000000000000
May  2 17:38:45 2024 A-GW-2 kernel:CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May  2 17:38:45 2024 A-GW-2 kernel:CR2: 00000000f7721000 CR3: 000000018d510000 CR4: 00000000000006f0
May  2 17:38:45 2024 A-GW-2 kernel:Call Trace:
May  2 17:38:45 2024 A-GW-2 kernel: <IRQ> 
May  2 17:38:45 2024 A-GW-2 kernel: [<ffffffff902d96e1>] e1000_clean_rx_irq+0x2d1/0x530 [e1000]
May  2 17:38:45 2024 A-GW-2 kernel: [<ffffffff902da233>] e1000_clean+0x223/0x8c0 [e1000]
May  2 17:38:45 2024 A-GW-2 kernel: [<ffffffff810a4b8c>] ? mod_timer+0x10c/0x240
May  2 17:38:45 2024 A-GW-2 kernel: [<ffffffff816a23fb>] net_rx_action+0x26b/0x3a0
May  2 17:38:45 2024 A-GW-2 kernel: [<ffffffff8109aef8>] __do_softirq+0x128/0x290
May  2 17:38:45 2024 A-GW-2 kernel: [<ffffffff817c797c>] call_softirq+0x1c/0x30
May  2 17:38:45 2024 A-GW-2 kernel: <EOI> 
May  2 17:38:45 2024 A-GW-2 kernel: [<ffffffff8102f9a5>] do_softirq+0x55/0x90
May  2 17:38:45 2024 A-GW-2 kernel: [<ffffffff8109a2f0>] __local_bh_enable_ip+0x60/0x70
May  2 17:38:45 2024 A-GW-2 kernel: [<ffffffff8109a317>] local_bh_enable+0x17/0x20
May  2 17:38:45 2024 A-GW-2 kernel: [<ffffffff90b01256>] cphwd_api_message+0x6d6/0xae0 [simmod_0]
May  2 17:38:45 2024 A-GW-2 kernel: [<ffffffff916e4160>] ? cphwd_q_pending_queue_try_flush+0x460/0x460 [fw_0]
May  2 17:38:45 2024 A-GW-2 kernel: [<ffffffff916e4186>] ? cphwd_q_async_dequeue_cb.lto_priv.2422+0x26/0x70 [fw_0]
May  2 17:38:45 2024 A-GW-2 kernel: [<ffffffff927eca4e>] ? kernel_thread_run+0x39e/0xfb0 [fw_0]
May  2 17:38:45 2024 A-GW-2 kernel: [<ffffffff810bff50>] ? wake_up_atomic_t+0x30/0x30
May  2 17:38:45 2024 A-GW-2 kernel: [<ffffffff927ac4d0>] ? cpaq_kut_register_client+0x40/0x40 [fw_0]
May  2 17:38:45 2024 A-GW-2 kernel: [<ffffffff927b219e>] ? kiss_kthread_run+0x1e/0x50 [fw_0]
May  2 17:38:45 2024 A-GW-2 kernel: [<ffffffff927ac4eb>] ? plat_run_thread+0x1b/0x30 [fw_0]
May  2 17:38:45 2024 A-GW-2 kernel: [<ffffffff810befc2>] ? kthread+0xe2/0xf0
May  2 17:38:45 2024 A-GW-2 kernel: [<ffffffff810beee0>] ? insert_kthread_work+0x40/0x40
May  2 17:38:45 2024 A-GW-2 kernel: [<ffffffff817c429d>] ? ret_from_fork_nospec_begin+0x7/0x21
May  2 17:38:45 2024 A-GW-2 kernel: [<ffffffff810beee0>] ? insert_kthread_work+0x40/0x40
May  2 17:38:45 2024 A-GW-2 kernel:Code: 00 00 45 3b 65 18 74 23 45 85 e4 45 89 65 18 41 8d 54 24 ff 0f 84 c5 04 00 00 0f ae f8 41 0f b7 45 36 49 03 87 d0 03 00 00 89 10 <48> 83 c4 30 5b 41 5c 41 5d 41 5e 41 5f 5d c3 4c 89 ff e8 18 ca 
May  2 17:38:45 2024 A-GW-2 kernel:sending NMI to other CPUs:
May  2 17:38:45 2024 A-GW-2 kernel:NMI backtrace for cpu 1 skipped: idling at pc 0xffffffff817b83fb
May  2 17:38:49 2024 A-GW-2 spike_detective: spike info: type: cpu, cpu core: 0, top consumer: system interrupts, start time: 02/05/24 17:38:25, spike duration (sec): 23, initial cpu usage: 100, average cpu usage: 100, perf taken: 0

May  2 17:39:00 2024 A-GW-2 kernel:[fw4_1];CLUS-220201-2: Starting CUL mode because CPU usage (81%) on the remote member 1 increased above the configured threshold (80%). 
May  2 17:39:01 2024 A-GW-2 kernel:[fw4_1];CLUS-210300-2: Remote member 1 (state STANDBY -> DOWN) | Reason: Interface is down (Cluster Control Protocol packets are not received)
May  2 17:39:06 2024 A-GW-2 kernel:[fw4_1];CLUS-114405-2: State change: ACTIVE! -> STANDBY | Reason: Member state has been changed after returning from ACTIVE/ACTIVE scenario (remote cluster member 1 has higher priority)
May  2 17:39:06 2024 A-GW-2 kernel:[fw4_1];CLUS-210305-2: Remote member 1 (state DOWN -> ACTIVE(!)) | Reason: Interface is down (Cluster Control Protocol packets are not received)
May  2 17:39:06 2024 A-GW-2 kernel:[fw4_1];CLUS-100201-2: Failover member 2 -> member 1 | Reason: Member state has been changed after returning from ACTIVE/ACTIVE scenario (remote cluster member 1 has higher priority)
May  2 17:39:07 2024 A-GW-2 kernel:[fw4_1];CLUS-210300-2: Remote member 1 (state ACTIVE(!) -> DOWN) | Reason: Interface is down (Cluster Control Protocol packets are not received)
May  2 17:39:07 2024 A-GW-2 kernel:[fw4_1];CLUS-114704-2: State change: STANDBY -> ACTIVE | Reason: No other ACTIVE members have been found in the cluster
May  2 17:39:07 2024 A-GW-2 kernel:[fw4_1];CLUS-100102-2: Failover member 1 -> member 2 | Reason: Available on member 1
May  2 17:39:07 2024 A-GW-2 kernel:[fw4_1];CLUS-214802-2: Remote member 1 (state DOWN -> STANDBY) | Reason: There is already an ACTIVE member in the cluster
May  2 17:39:12 2024 A-GW-2 spike_detective: spike info: type: cpu, cpu core: 0, top consumer: system interrupts, start time: 02/05/24 17:38:54, spike duration (sec): 17, initial cpu usage: 100, average cpu usage: 100, perf taken: 0

May  2 17:40:00 2024 A-GW-2 xpand[6195]: admin localhost t +volatile:clish:admin:28133 t 
May  2 17:40:00 2024 A-GW-2 clish[28133]: User admin logged in with ReadWrite permission
May  2 17:42:56 2024 A-GW-2 kernel:[fw4_1];CLUS-120202-2: Stopping CUL mode after 199 sec (short CUL timeout), because no member reported CPU usage above the configured threshold (80%) during the last 10 sec. 
May  2 17:42:56 2024 A-GW-2 kernel:[fw4_1];CLUS-110305-2: State change: ACTIVE -> ACTIVE(!) | Reason: Interface bond0.10 is down (Cluster Control Protocol packets are not received)
May  2 17:42:56 2024 A-GW-2 kernel:[fw4_1];CLUS-210300-2: Remote member 1 (state STANDBY -> DOWN) | Reason: Interface is down (Cluster Control Protocol packets are not received)
May  2 17:43:02 2024 A-GW-2 kernel:[fw4_1];CLUS-114904-2: State change: ACTIVE(!) ->  ACTIVE | Reason: Reason for ACTIVE! alert has been resolved
May  2 17:43:02 2024 A-GW-2 kernel:[fw4_1];CLUS-120200-2: Starting CUL mode because CPU-00 usage (81%) on the local member increased above the configured threshold (80%).
May  2 17:43:22 2024 A-GW-2 kernel:[fw4_1];CLUS-120202-2: Stopping CUL mode after 17 sec (short CUL timeout), because no member reported CPU usage above the configured threshold (80%) during the last 10 sec. 
May  2 17:43:22 2024 A-GW-2 kernel:[fw4_1];CLUS-110305-2: State change: ACTIVE -> ACTIVE(!) | Reason: Interface bond0.10 is down (Cluster Control Protocol packets are not received)
May  2 17:43:28 2024 A-GW-2 kernel:[fw4_1];CLUS-114904-2: State change: ACTIVE(!) ->  ACTIVE | Reason: Reason for ACTIVE! alert has been resolved
May  2 17:43:33 2024 A-GW-2 kernel:[fw4_1];CLUS-110305-2: State change: ACTIVE -> ACTIVE(!) | Reason: Interface bond0.10 is down (Cluster Control Protocol packets are not received)
May  2 17:43:35 2024 A-GW-2 kernel:[fw4_1];CLUS-120200-2: Starting CUL mode because CPU-00 usage (88%) on the local member increased above the configured threshold (80%).
May  2 17:43:35 2024 A-GW-2 kernel:[fw4_1];CLUS-114904-2: State change: ACTIVE(!) ->  ACTIVE | Reason: Reason for ACTIVE! alert has been resolved
May  2 17:43:46 2024 A-GW-2 kernel:[fw4_1];CLUS-120202-2: Stopping CUL mode after 10 sec (short CUL timeout), because no member reported CPU usage above the configured threshold (80%) during the last 10 sec. 
May  2 17:43:46 2024 A-GW-2 kernel:[fw4_1];CLUS-110305-2: State change: ACTIVE -> ACTIVE(!) | Reason: Interface bond0.10 is down (Cluster Control Protocol packets are not received)
May  2 17:44:03 2024 A-GW-2 kernel:[fw4_1];CLUS-220201-2: Starting CUL mode because CPU usage (87%) on the remote member 1 increased above the configured threshold (80%). 
May  2 17:44:03 2024 A-GW-2 kernel:[fw4_1];CLUS-114904-2: State change: ACTIVE(!) ->  ACTIVE | Reason: Reason for ACTIVE! alert has been resolved
May  2 17:44:44 2024 A-GW-2 kernel:[fw4_1];CLUS-120202-2: Stopping CUL mode after 37 sec (short CUL timeout), because no member reported CPU usage above the configured threshold (80%) during the last 10 sec. 
May  2 17:44:44 2024 A-GW-2 kernel:[fw4_1];CLUS-110305-2: State change: ACTIVE -> ACTIVE(!) | Reason: Interface bond0.10 is down (Cluster Control Protocol packets are not received)
May  2 17:45:03 2024 A-GW-2 kernel:[fw4_1];CLUS-114904-2: State change: ACTIVE(!) ->  ACTIVE | Reason: Reason for ACTIVE! alert has been resolved

the_rock

This only happens on one member?

Andy

Moudar

look like both!

the_rock

I only see gw-2 in the logs. How is this interface configured in eve-ng, what type?

Moudar

EVE interface tple1000

A-GW-1

 show cluster state

Cluster Mode:   High Availability (Active Up) with IGMP Membership

ID         Unique Address  Assigned Load   State          Name

1 (local)  172.22.1.2      0%              STANDBY        A-GW-1
2          172.22.1.1      100%            ACTIVE(!)      A-GW-2


Active PNOTEs: LPRB

Last member state change event:
   Event Code:                 CLUS-114802
   State change:               INIT -> STANDBY
   Reason for state change:    There is already an ACTIVE member in the cluster (member 2)
   Event time:                 Fri May  3 21:38:19 2024

Cluster failover count:
   Failover counter:           0
   Time of counter reset:      Fri May  3 21:32:51 2024 (reboot)

A-GW-2

show cluster state

Cluster Mode:   High Availability (Active Up) with IGMP Membership

ID         Unique Address  Assigned Load   State          Name

1          172.22.1.2      0%              STANDBY        A-GW-1
2 (local)  172.22.1.1      100%            ACTIVE(!)      A-GW-2


Active PNOTEs: LPRB, IAC

Last member state change event:
   Event Code:                 CLUS-110305
   State change:               ACTIVE -> ACTIVE(!)
   Reason for state change:    Interface bond0.10 is down (Cluster Control Protocol packets are not received)
   Event time:                 Fri May  3 21:42:07 2024

Cluster failover count:
   Failover counter:           0
   Time of counter reset:      Fri May  3 21:32:51 2024 (reboot)

the_rock

Maybe compare the settings with show interface command in clish to one that does work and see if there is anything different.

Andy

Moudar

 show interface bond0.10
state on
mac-addr 50:00:00:01:00:05
type vlan
link-state not available
mtu 1500
auto-negotiation off (bond0)
speed 1000M (bond0)
ipv6-autoconfig Not configured
monitor-mode Not configured
duplex full (bond0)
link-speed Not configured
comments VLAN_10
ipv4-address 10.10.10.11/24
ipv6-address Not Configured
ipv6-local-link-address Not Configured

Statistics:
TX bytes:922824 packets:21972 errors:0 dropped:0 overruns:0 carrier:0
RX bytes:3594934 packets:78129 errors:0 dropped:0 overruns:0 frame:0
A-GW-2> show interface bond0.20
state on
mac-addr 50:00:00:01:00:05
type vlan
link-state not available
mtu 1500
auto-negotiation off (bond0)
speed 1000M (bond0)
ipv6-autoconfig Not configured
monitor-mode Not configured
duplex full (bond0)
link-speed Not configured
comments VLAN_20
ipv4-address 10.10.20.21/24
ipv6-address Not Configured
ipv6-local-link-address Not Configured

Statistics:
TX bytes:1218 packets:29 errors:0 dropped:0 overruns:0 carrier:0
RX bytes:12550 packets:251 errors:0 dropped:0 overruns:0 frame:0

the_rock

Just an idea...can you try create vlan say off eth2 and see if you have same issue?

Andy

AmirArama

i believe you see those messages only on vlan 10 is because we monitor only the lowest vlan ID on the physical interface.

the messages says that CCP packets are not received. you should check if the actual link goes down (should be noted in /var/log/messages / dmesg) or NOT, on both members. i also see those messages at the same time you have CUL messages about high CPU, and i wonder if it's related. (HIGH CPU disrupting the processing of CCP packets). i would start by optimizing the performance if possible.

i also would try to run tcpdump on this vlan interface for ccp packets, and run regular ping as well between cluster members to figure out if you do have connectivity issue.

the_rock

I remember having that kernel value set to 1 when I had R81.10 lab cluster and I never had this problem...

Andy

Are you a member of CheckMates?

VLAN interface keeps going down