ECS docker basic cluster

suranga · November 11, 2020, 5:40pm

Enonic version: 7.4.1
OS: Docker hub official image.

I have started the cluster with 5 nodes (2 - data & 3 - master). Also configured the command storage location. A cluster can identity each node properly, but eventually, when I started the cluster it is giving some error as follows from every node on the cluster.

|2020-11-11T19:20:50.194+05:30|2020-11-11 13:50:50,193 e[1;31mERRORe[0;39m e[36mc.e.x.c.impl.ClusterManagerImple[0;39m - Provider elasticsearch not healthy:
{|
||2020-11-11T19:20:50.194+05:30|“cluster_name” : “wapp_cluster”,|
||2020-11-11T19:20:50.194+05:30|“status” : “red”,|
||2020-11-11T19:20:50.194+05:30|“timed_out” : true,|
||2020-11-11T19:20:50.194+05:30|“number_of_nodes” : 5,|
||2020-11-11T19:20:50.194+05:30|“number_of_data_nodes” : 2,|
||2020-11-11T19:20:50.194+05:30|“active_primary_shards” : 0,|
||2020-11-11T19:20:50.194+05:30|“active_shards” : 0,|
||2020-11-11T19:20:50.194+05:30|“relocating_shards” : 0,|
||2020-11-11T19:20:50.194+05:30|“initializing_shards” : 0,|
||2020-11-11T19:20:50.194+05:30|“unassigned_shards” : 12,|
||2020-11-11T19:20:50.194+05:30|“delayed_unassigned_shards” : 0,|
||2020-11-11T19:20:50.194+05:30|“number_of_pending_tasks” : 0,|
||2020-11-11T19:20:50.194+05:30|“number_of_in_flight_fetch” : 0,|
||2020-11-11T19:20:50.194+05:30|“task_max_waiting_in_queue_millis” : 0,|
||2020-11-11T19:20:50.194+05:30|“active_shards_percent_as_number” : 0.0|
||2020-11-11T19:20:50.194+05:30

So the XP login page also not appearing with this, basically, the web server won’t start because of this.

Any help on this is very much appreciated.

Thanks & Regards
Suranga

gbi · November 12, 2020, 9:56am

Hi,

The fact that the cluster is red can mean a couple of things. For example, not enough minimum master nodes, not enough data nodes, connectivity problems between nodes. So I suggest we start by making sure that the nodes can communicate.

I assume each node is running on their own dedicated VM. Since you are running in docker, you should take care on what address you bind elasticsearch makes sence. Setting network.host = _eth0_ in com.enonic.xp.cluster.cfg should work. That will bind to the container IP. But you do not want to publish that IP to the cluster, so set network.publish.host to <VM-IP>. Also make sure port 9300 is exposed in docker and that all the VMs can communicate on that port to each other.

You should see logs on the nodes something like this (pay extra attention to publish_address and bound_addresses and make sure they make sence):

09:10:20.849 INFO  org.elasticsearch.node - [master-node-01] version[2.4.6], pid[1], build[NA/NA]
09:10:20.852 INFO  org.elasticsearch.node - [master-node-01] initializing ...
09:10:20.882 INFO  org.elasticsearch.plugins - [master-node-01] modules [], plugins [], sites []
09:10:20.923 INFO  org.elasticsearch.env - [master-node-01] using [1] data paths, mounts [[/enonic-xp/home/repo/index (/dev/sda1)]], net usable_space [5.7gb], net total_space [9.9gb], spins? [possibly], types [xfs]
09:10:20.923 INFO  org.elasticsearch.env - [master-node-01] heap size [494.9mb], compressed ordinary object pointers [true]
09:10:24.712 INFO  org.elasticsearch.node - [master-node-01] initialized
09:10:24.716 INFO  org.elasticsearch.node - [master-node-01] starting ...
09:10:25.191 INFO  org.elasticsearch.transport - [master-node-01] publish_address {192.168.60.11:9300}, bound_addresses {172.17.0.2:9300}
09:10:25.204 INFO  org.elasticsearch.discovery - [master-node-01] vagrant/dIxGrsUQR9WEkaM3rUP1qw
09:10:28.291 INFO  org.elasticsearch.cluster.service - [master-node-01] new_master {master-node-01}{dIxGrsUQR9WEkaM3rUP1qw}{192.168.60.11}{192.168.60.11:9300}{data=false, local=false, master=true}, added {{data-node-1}{WFXOhHMhSq-bBP8rm9pTEQ}{192.168.60.12}{192.168.60.12:9300}{local=false, master=false},}, reason: zen-disco-join(elected_as_master, [0] joins received)
09:10:28.370 INFO  org.elasticsearch.http - [master-node-01] publish_address {192.168.60.11:9200}, bound_addresses {172.17.0.2:9200}
09:10:28.370 INFO  org.elasticsearch.node - [master-node-01] started
09:10:29.007 INFO  org.elasticsearch.gateway - [master-node-01] recovered [6] indices into cluster_state
09:10:30.637 INFO  org.elasticsearch.cluster.service - [master-node-01] added {{frontend-node-1}{dXlecUM3SKSvPt27lxWLpQ}{192.168.60.13}{192.168.60.13:9300}{data=false, local=false, master=false},}, reason: zen-disco-join(join from node[{frontend-node-1}{dXlecUM3SKSvPt27lxWLpQ}{192.168.60.13}{192.168.60.13:9300}{data=false, local=false, master=false}])

suranga · November 12, 2020, 11:45am

Hi Gbi
Many thanks for your reply, I have set the network.host = eth0 & expose the 9300 also, now I am getting the following error

2020-11-12T17:10:09.589+05:30	2020-11-12 11:40:09,588 e[1;31mERRORe[0;39m e[36mc.h.i.c.i.o.JoinMastershipClaimOpe[0;39m - [172.31.20.158]:5701 [3.12.7#OSS] [3.12.7] Target is this node! -> [172.31.20.158]:5701
	2020-11-12T17:10:09.589+05:30

java.lang.IllegalArgumentException: Target is this node! -> [172.31.20.158]:5701|java.lang.IllegalArgumentException: Target is this node! -> [172.31.20.158]:5701|

Also, let me know if I set the network.host = eth0 should I update the discovery.unicast.hosts= ?

Basically, my idea is the autoscaling the data nodes,

Thanks & Regards,
Suranga

tsi · November 12, 2020, 12:00pm

Suranga, are you setting up nodes manually this time, or are you still using AWS ECS?

suranga · November 12, 2020, 1:15pm

Hi Thomas,

I am testing this with AWS ECS, actually fixed 3 - master nodes & trying to autoscaling (load balance) with 2 data nodes( front+back) with basic cluster.

Thanks & Regards,
Suranga
.

suranga · November 16, 2020, 5:34am

Hi All,

Any body can help me with com.enonic.xp.cluster.cfg config for autoscaling setup. i have server setup as follows.

2 Server belongs to ALB =(data=true, master=false)
3 Server =(data=false, master=true)

I need to know what is the common config should I used for com.enonic.xp.cluster.cfg. Also, can anybody explain to me how does indexing sharing while serving with the load balancer.

Thanks
Suranga

tsi · November 16, 2020, 8:59am

First thing is that you should probably use XP 7.3.x (the one we demoed). We introduced a new Grid component in 7.4 which we have not tested with ECS yet.

Also, for the cluster setup, you need to get the list of IP’s for the cluster dynamically from Amazon ECS. Maybe you could share your current configuration files?

As for indexes - they should be replicated across all data nodes automatically. In your case, you should have tree different node groups. Masters, data and front-end. You should only autoscale and load balance the front-end nodes, and these nodes should have data = false.

suranga · November 16, 2020, 10:26am

Hi Thomas,

Thanks for your reply. We actually tested this with 7.4 & we followed basic cluster documentation which was published in docs. according to that doc, we thought that data + front-end can run in the same instance since the basic cluster explains as 3-master nodes & 2-data nodes. So kindly explain is that ok if i go with the following setup.

XP 7.3
3 x master (fixed)
2 x data (fixed)
2 x front-end (autoscaling)

Also thanks for asking about the cluster config because that is one place I was confused badly.

here are my current cluster config file.

com.enonic.xp.cluster.cfg

cluster.enabled=true
node.name=enonic-data1
discovery.unicast.hosts=172.31.26.137,172.31.16.91,172.31.35.128,172.31.20.158,172.31.13.195
network.host=eth0:ipv4
network.publish.host=eth0:ipv4

Also here my doubt is do I need to update all the cluster ip addresses in discovery.unicast.hosts= ?

With the above situation, I am facing difficulty to setup common task for autoscaling instances.

Thanks & Regards,
Suranga

tsi · November 17, 2020, 10:35am

Yes, the setup you describe should be the best way for an autoscaling cluster. As you point out, to make this work you cannot use fixed list of IP’s.

Here is an example from how we setup the configuration in our demo:
com.enonic.xp.cluster.cfg

cluster.enabled=true
#node.name=<generated-UUID>
network.host=_eth0_
network.publish.host=_eth0_

com.enonic.xp.elasticsearch.cfg

#node.client = false
node.data = false
node.master = false
discovery.unicast.sockets=xp.enonic-ecs-tst
cluster.name = enonic-ecs-tst

Notice how we are referencing sockets discovery instead of hosts. This must point to the ECS private dns service that returns the list of all XP node IP’s.

tsi · November 17, 2020, 10:36am

Reason for separating data and front-end nodes is simply that scaling up and down data nodes is more complex, and needs to be handled more carefully than front-end (stateless) nodes.

suranga · November 17, 2020, 11:07am

Many thanks, Thomas,

So that means the 2 x data nodes & 3 x master nodes will I able use com.enonic.xp.elasticsearch.cfg as follows for each

DATA.

#node.client = false
node.data = true
node.master = false
discovery.unicast.sockets=xp.enonic-ecs-tst
cluster.name = enonic-ecs-tst

MASTER

#node.client = false
node.data = false
node.master = true
discovery.unicast.sockets=xp.enonic-ecs-tst
cluster.name = enonic-ecs-tst

Thanks & Regards,
Suranga

tsi · November 17, 2020, 11:10am

Yeah. This seems to be the right way to go.

suranga · November 17, 2020, 11:26am

Hi Thomas,

Thanks for you prompt reply, I will test this quickly & get back to you with the output.

Thanks & Regards,
Suranga

suranga · November 17, 2020, 6:05pm

Hi Thomas,
I have set everything as above and created a service discovery namespace as xp.enonic-ecs-tst

necessary DNS record added in route53 for all the docker instances, pls check the image.

My com.enonic.xp.elasticsearch.cfg as follows

But from enonic log I am getting error as follows.

1) Error injecting constructor, java.lang.IllegalArgumentException: Failed to resolve address for [xp.enonic-ecs-tst]

I am getting the same error for all nodes (master / data / front)

Any idea about this, anything missing in my configs?

Thanks & Regards,
Suranga

gbi · November 18, 2020, 8:09am

Hi,
Have you verified that doing nslookup inside one of the containers works?

suranga · November 18, 2020, 8:41am

Hi Gbi,

Do you have any idea about should I add A record or SRV record in aws ECS service discovery for namespace record?

from inside the container, I am getting a response for the Linux dig command as follows

dig xp.enonic-ecs-tst

Please see some logs which are continuously running on each nodes.

Thanks & Regards,
Suranga

gbi · November 19, 2020, 9:50am

Hi,
It is a good sign that other nodes are waiting to the system-repo initialization. That means that there is an elected master node and the elected master node should start initialization. I have seen this symptom when the nodes do not have enough cpu/memory. How much are you giving each node?

I would try to:

Clear all data from all the nodes
Give them 2 cpu each (atleast)
Give them 4G memory each (atleast)
Try again

suranga · November 19, 2020, 1:08pm

Hi Gbi,

Thanks for the reply,

I have used 3G with 2 vCPU, let me try with 4G & see again.

Best regards,
Suranga

gbi · November 19, 2020, 3:38pm

If that does not work, I need to see the complete logs from all the nodes to debug the issue.

suranga · November 20, 2020, 3:39pm

Hi Gbi,

I have increased the vCPU & memory 7G.

I am getting the same error from all the nodes.

master logs as follows:

15:02:04.939 INFO c.e.x.l.i.framework.FrameworkService - Starting Enonic XP…
15:02:05.264 INFO c.e.x.l.i.p.ProvisionActivator - Installing 92 bundles…
15:02:06.260 INFO ROOT - bundle org.apache.felix.scr:2.1.16 (11)Starting with globalExtender setting: false
15:02:06.286 INFO ROOT - bundle org.apache.felix.scr:2.1.16 (11) Version = 2.1.16
15:02:07.465 WARN org.apache.tika.parsers.PDFParser - TIFFImageWriter not loaded. tiff files will not be processed
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.
J2KImageReader not loaded. JPEG2000 files will not be processed.
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.

15:02:07.504 WARN org.apache.tika.parser.SQLite3Parser - org.xerial’s sqlite-jdbc is not loaded.
Please provide the jar on your classpath to parse sqlite files.
See tika-parsers/pom.xml for the correct version.
15:02:08.030 INFO c.e.x.s.i.config.ConfigInstallerImpl - Loaded config for [com.enonic.xp.repo]
15:02:08.045 INFO c.e.x.s.i.config.ConfigInstallerImpl - Loaded config for [com.enonic.xp.server.trace]
15:02:08.071 INFO c.e.x.s.i.config.ConfigInstallerImpl - Loaded config for [com.enonic.xp.cluster]
15:02:08.095 INFO c.e.x.s.i.config.ConfigInstallerImpl - Loaded config for [com.enonic.xp.elasticsearch]
15:02:08.146 INFO c.e.x.s.i.config.ConfigInstallerImpl - Loaded config for [com.enonic.xp.web.dos]
15:02:08.166 INFO c.e.x.s.i.config.ConfigInstallerImpl - Loaded config for [com.enonic.xp.server.shell]
15:02:08.260 INFO c.e.x.s.shell.impl.ShellActivator - Remote shell access is disabled
15:02:08.668 INFO c.e.x.s.internal.trace.TraceService - Call tracing is disabled in config
15:02:08.916 INFO c.e.x.c.i.a.c.AuditLogConfigImpl - Audit log is enabled and mappings updated.
15:02:08.969 INFO c.e.x.i.blobstore.BlobStoreActivator - Waiting for blobstore-provider [file]
15:02:08.971 INFO c.e.x.i.blobstore.BlobStoreActivator - Found blobstore-provider [file]
15:02:08.994 INFO c.e.x.i.blobstore.BlobStoreActivator - Registered blobstore [file] successfully
15:02:09.836 INFO org.elasticsearch.node - [d1b00043-fc5e-4829-8044-78da600e1d68] version[2.4.6], pid[1], build[NA/NA]
15:02:09.836 INFO org.elasticsearch.node - [d1b00043-fc5e-4829-8044-78da600e1d68] initializing …
15:02:09.841 INFO org.elasticsearch.plugins - [d1b00043-fc5e-4829-8044-78da600e1d68] modules [], plugins [], sites []
15:02:09.859 INFO org.elasticsearch.env - [d1b00043-fc5e-4829-8044-78da600e1d68] using [1] data paths, mounts [[/ (overlay)]], net usable_space [24.8gb], net total_space [29.4gb], spins? [possibly], types [overlay]
15:02:09.859 INFO org.elasticsearch.env - [d1b00043-fc5e-4829-8044-78da600e1d68] heap size [1.7gb], compressed ordinary object pointers [true]
15:02:09.861 WARN org.elasticsearch.env - [d1b00043-fc5e-4829-8044-78da600e1d68] max file descriptors [4096] for elasticsearch process likely too low, consider increasing to at least [65536]
15:02:11.547 INFO org.elasticsearch.node - [d1b00043-fc5e-4829-8044-78da600e1d68] initialized
15:02:11.547 INFO org.elasticsearch.node - [d1b00043-fc5e-4829-8044-78da600e1d68] starting …
15:02:11.613 INFO org.elasticsearch.transport - [d1b00043-fc5e-4829-8044-78da600e1d68] publish_address {10.0.29.203:9300}, bound_addresses {10.0.29.203:9300}
15:02:11.620 INFO org.elasticsearch.discovery - [d1b00043-fc5e-4829-8044-78da600e1d68] enonic-ecs-tst/srniXcIPQBuUdl3PNoivMw
15:02:14.894 INFO org.elasticsearch.cluster.service - [d1b00043-fc5e-4829-8044-78da600e1d68] detected_master {267465a9-3baf-4da5-8b25-180fd2ba182c}{QfV307dSQOumgp94KAGuIA}{10.0.25.254}{10.0.25.254:9300}{data=false, local=false, master=true}, added {{aae88ded-b6d3-418a-90e2-7f7c64ab0bde}{m_aIkUGCTVGbkkVJrtAOmg}{10.0.17.210}{10.0.17.210:9300}{data=false, local=false, master=true},{267465a9-3baf-4da5-8b25-180fd2ba182c}{QfV307dSQOumgp94KAGuIA}{10.0.25.254}{10.0.25.254:9300}{data=false, local=false, master=true},{0e0e5621-486d-47cb-b9b8-28183928be88}{NPisFLcLSveaeBWc5aWCUg}{10.0.16.29}{10.0.16.29:9300}{data=false, local=false, master=false},}, reason: zen-disco-receive(from master [{267465a9-3baf-4da5-8b25-180fd2ba182c}{QfV307dSQOumgp94KAGuIA}{10.0.25.254}{10.0.25.254:9300}{data=false, local=false, master=true}])
15:02:14.949 INFO org.elasticsearch.node - [d1b00043-fc5e-4829-8044-78da600e1d68] started
15:02:15.067 INFO c.e.x.c.impl.ClusterManagerImpl - Adding cluster-provider: elasticsearch
15:02:15.388 INFO org.elasticsearch.cluster.service - [d1b00043-fc5e-4829-8044-78da600e1d68] added {{7c32f499-b281-4098-9e17-cc582b6b2be3}{2za6n9LtRaqsmONXsBBHeg}{10.0.0.99}{10.0.0.99:9300}{data=false, local=false, master=false},{c5a40605-aa77-4953-b17b-c97b00b296de}{XnOr-sEBQOOYvrog2QInMQ}{10.0.11.109}{10.0.11.109:9300}{data=false, local=false, master=false},{bd16dfd8-56df-4e0d-85e3-8d13cb34cd21}{4Qj44JbKSFqGeTonzWLNTg}{10.0.24.198}{10.0.24.198:9300}{data=false, local=false, master=false},}, reason: zen-disco-receive(from master [{267465a9-3baf-4da5-8b25-180fd2ba182c}{QfV307dSQOumgp94KAGuIA}{10.0.25.254}{10.0.25.254:9300}{data=false, local=false, master=true}])
15:02:15.408 INFO org.eclipse.jetty.util.log - Logging initialized @11156ms to org.eclipse.jetty.util.log.Slf4jLog
15:02:15.592 INFO org.eclipse.jetty.server.Server - jetty-9.4.28.v20200408; built: 2020-04-08T17:49:39.557Z; git: ab228fde9e55e9164c738d7fa121f8ac5acd51c9; jvm 11.0.8+10
15:02:15.624 INFO org.eclipse.jetty.server.session - DefaultSessionIdManager workerName=d1b00043-fc5e-4829-8044-78da600e1d68
15:02:15.626 INFO org.eclipse.jetty.server.session - No SessionScavenger set, using defaults
15:02:15.628 INFO org.eclipse.jetty.server.session - d1b00043-fc5e-4829-8044-78da600e1d68 Scavenging every 600000ms
15:02:15.642 INFO o.e.j.server.handler.ContextHandler - Started o.e.j.s.ServletContextHandler@f17462c{/,null,AVAILABLE,@xp}
15:02:15.643 INFO o.e.j.server.handler.ContextHandler - Started o.e.j.s.ServletContextHandler@2169a7fb{/,null,AVAILABLE,@api}
15:02:15.644 INFO o.e.j.server.handler.ContextHandler - Started o.e.j.s.ServletContextHandler@5b6082f8{/,null,AVAILABLE,@status}
15:02:15.658 INFO o.e.jetty.server.AbstractConnector - Started xp@668df19d{HTTP/1.1, (http/1.1)}{0.0.0.0:8080}
15:02:15.660 INFO o.e.jetty.server.AbstractConnector - Started api@106a9ccf{HTTP/1.1, (http/1.1)}{0.0.0.0:4848}
15:02:15.661 INFO o.e.jetty.server.AbstractConnector - Started status@5ffc627c{HTTP/1.1, (http/1.1)}{0.0.0.0:2609}
15:02:15.662 INFO org.eclipse.jetty.server.Server - Started @11410ms
15:02:15.671 INFO c.e.xp.web.jetty.impl.JettyActivator - Started Jetty
15:02:15.672 INFO c.e.xp.web.jetty.impl.JettyActivator - Listening on ports 8080, 4848 and 2609
15:02:16.781 INFO c.e.x.c.i.a.ApplicationRegistryImpl - Registering configured application com.enonic.xp.app.system bundle 87
15:02:16.820 INFO c.e.x.c.i.a.ApplicationRegistryImpl - Registering configured application com.enonic.xp.app.applications bundle 89
15:02:16.837 INFO c.e.x.c.i.a.ApplicationRegistryImpl - Registering configured application com.enonic.xp.app.main bundle 90
15:02:16.843 INFO c.e.x.c.i.a.ApplicationRegistryImpl - Registering configured application com.enonic.xp.app.standardidprovider bundle 91
15:02:16.845 INFO c.e.x.c.i.a.ApplicationRegistryImpl - Registering configured application com.enonic.xp.app.users bundle 92
15:02:16.846 INFO E.F.org.apache.felix.framework - FrameworkEvent STARTLEVEL CHANGED
15:02:16.847 INFO c.e.x.l.i.framework.FrameworkService - Started Enonic XP in 11909 ms
15:02:20.460 ERROR c.e.x.c.impl.ClusterManagerImpl - Provider elasticsearch not healthy: {
“cluster_name” : “enonic-ecs-tst”,
“status” : “red”,
“timed_out” : true,
“number_of_nodes” : 7,
“number_of_data_nodes” : 0,
“active_primary_shards” : 0,
“active_shards” : 0,
“relocating_shards” : 0,
“initializing_shards” : 0,
“unassigned_shards” : 4,
“delayed_unassigned_shards” : 0,
“number_of_pending_tasks” : 0,
“number_of_in_flight_fetch” : 0,
“task_max_waiting_in_queue_millis” : 0,
“active_shards_percent_as_number” : 0.0
}
15:02:26.466 ERROR c.e.x.c.impl.ClusterManagerImpl - Provider elasticsearch not healthy: {
“cluster_name” : “enonic-ecs-tst”,
“status” : “red”,
“timed_out” : true,
“number_of_nodes” : 7,
“number_of_data_nodes” : 0,
“active_primary_shards” : 0,
“active_shards” : 0,
“relocating_shards” : 0,
“initializing_shards” : 0,
“unassigned_shards” : 4,
“delayed_unassigned_shards” : 0,
“number_of_pending_tasks” : 0,
“number_of_in_flight_fetch” : 0,
“task_max_waiting_in_queue_millis” : 0,
“active_shards_percent_as_number” : 0.0
}
15:02:32.470 ERROR c.e.x.c.impl.ClusterManagerImpl - Provider elasticsearch not healthy: {
“cluster_name” : “enonic-ecs-tst”,
“status” : “red”,
“timed_out” : true,
“number_of_nodes” : 7,
“number_of_data_nodes” : 0,
“active_primary_shards” : 0,
“active_shards” : 0,
“relocating_shards” : 0,
“initializing_shards” : 0,
“unassigned_shards” : 4,
“delayed_unassigned_shards” : 0,
“number_of_pending_tasks” : 0,
“number_of_in_flight_fetch” : 0,
“task_max_waiting_in_queue_millis” : 0,
“active_shards_percent_as_number” : 0.0
}
15:02:38.473 ERROR c.e.x.c.impl.ClusterManagerImpl - Provider elasticsearch not healthy: {
“cluster_name” : “enonic-ecs-tst”,
“status” : “red”,
“timed_out” : true,
“number_of_nodes” : 7,
“number_of_data_nodes” : 0,
“active_primary_shards” : 0,
“active_shards” : 0,
“relocating_shards” : 0,
“initializing_shards” : 0,
“unassigned_shards” : 4,
“delayed_unassigned_shards” : 0,
“number_of_pending_tasks” : 0,
“number_of_in_flight_fetch” : 0,
“task_max_waiting_in_queue_millis” : 0,
“active_shards_percent_as_number” : 0.0

Thanks & Regards,
Suranga