Working with large amounts of data/content/content-types

ComLock · May 11, 2016, 8:13am

Continuing the discussion from Publishing Large amount og content is slow:

So I have 2.3 million addresses I’m fetching from a database, which I want to add to Enonic XP.
I would also like to setup a nightly sync…
It takes about 15 minutes to fetch the data from the external db, but adding it to Enonic XP seems to take forever.

Any thoughts on how I should go about doing this?

Should I create or modify and publish one address at a time or
Create/modify 2.3 million addresses one at a time and then publish all of them.
I’m skeptical toward the latter even working…

Perhaps something in betweeen.
Is there a batch interface?

Is there such a thing as a unversioned repository so I don’t have to check wether something is modified, just dump it in?

Note: Even if I increase Java memory 250.000 rows is about the max I can fetch from the db at a time anyways.

The most I have added thus far is 50K rows.

I tried deleting them In admin/content-studio by selecting top folder, but it failed after marking about 25K as pending delete. I’m having troubles deleting and publishing the delete of the other 25K…

This leads me to be very skeptical about adding 2.3 million…

How would Enonic XP handle 2.3 million addresses in one folder?
I made hierarchical with fylke/kommune/sted/gate/address because it makes sense.

rmy · May 11, 2016, 10:24am

Hi.

2.3 million entries will be a bit too much I think for the current content-API, and I dont think the content studio will behave very well when trying to do things like selecting / publishing hundred thousand content etc.

We are planning to create a low level collection type “Bucket” to handle large datasets, where stuff like versioning, security etc will be optional, thus much quicker to process, but that is a bit further down the track.

Is it another way to handle these data for now; e.g fetching it on demand or something? Ill have to do some research and come back to you on this.

ComLock · May 11, 2016, 10:31am

I’d very much like to use enonic/elastic’s smart query with aggregations, etc like I have done with the other content-types I have.

I don’t think I can use the db as it seems slow and I don’t manage it.

I have the same data in a solr instance (which only requires 17 minutes for a full sync, btw)
But the rails backend on top of the solr does not provide aggregation.
Only the rails frontend has aggregation.

So I guess I could extend the backend with aggregations.

But as I said I would rather be able to use Enonic XP’s elasticsearch…

rmy · May 11, 2016, 10:40am

If you mail me the dataset (rmy@enonic.com) I will do some testing and see if something can be done

ComLock · May 11, 2016, 11:34am

I could if
http://repo.enonic.com/public/com/enonic/xp/docs/6.4.2/docs-6.4.2-libdoc.zip!/module-lib_xp_io.html
had write to file…

How do I use node’s fs in enonic?

Do you have other better solutions for writing to file from js?

ComLock · May 11, 2016, 11:52am

Perhaps something here: https://wiki.openjdk.java.net/display/Nashorn/Nashorn+extensions

rmy · May 11, 2016, 11:55am

Ah, so you are fetching from db in js, and then creating content? That will be too slow I guess. The content-layer is made to be as fail-safe as possible, thus doing a lot of checks and validations. Also, the index will be refreshed to ensure that stuff is available for search immediately.

Some options:

Export data from db to a enonic-passable import data structure. This can be imported into both draft and master, making the publish unnecessary
Export data from db to CSV, create a java-helper that reads CSV files and uses the Node layer to create the content (this is faster and could be done without refreshing indexes etc). I once started a small project for doing this kind of stuff; check out https://github.com/runarmyklebust/csv-loader for ideas.

If you are able to produce #2, I could do some research for you and see if its a feasible solution.

ComLock · May 11, 2016, 11:58am

Yeah, problem is I do not program java.
I believe we choose Enonic XP because it was js… so if we can’t program it in js, then we have a problem.

Can I access the Node layer from js?

ComLock · May 11, 2016, 12:00pm

Just since I found it. Found this

var FileWriter=Java.type("java.io.FileWriter");
var olinkfile = caldir+"/"+year+"_links.html";
var fw = new FileWriter(olinkfile);
fw.write(links.join("\n"));
fw.write("\n");
fw.close();  // forgetting to close it results in a truncated file

Here: http://stackoverflow.com/questions/34279298/javascript-nashorn-scripting-mode-how-to-write-to-file

rmy · May 11, 2016, 12:14pm

By “produce #2” i meant the CSV-file btw , the idea was to create a tool for doing large CSV-imports.

ComLock · May 11, 2016, 12:38pm

I was thinking exporting to a JSON file, so it matches to the content type…

PVMerlo · May 16, 2016, 3:42am

@ComLock, just a question: have you worked with around 100k of content objects in enonic? It supports this number? I’m making an organization site that have around this number of entries, but I’m not sure enonic will handle it xD

rmy · May 16, 2016, 12:39pm

100k entries should not be a problem I think; publishing all of them while calculating dependencies may be an issue, but it depends on the structure of the data. The main issue is the (one time?) operation to get it into XP as content.

I was able to import 2.3 million addresses as nodes, and I’m doing further testing with large datasets now.
Im also working on the xpLoader-app to support bulk loading of large datasets, but its still not ready for public usage.

PVMerlo · May 16, 2016, 1:57pm

In my case the these 100k doesn’t have relationships. I’ll try then thanks

rmy · May 16, 2016, 3:33pm

Yes, try it out and give me an update. How to you plan to create the data? If you’d like, I could to some testing on the data also.

tsi · May 17, 2016, 6:45pm

Would cool to get some status update on this? Any progress?

PVMerlo · May 18, 2016, 8:22pm

I’ve imported my 100k and it worked. Some of the objects have a bug that does not let me edit them in admin interface (but on the controllers they’re ok). I repported this at Bad HTTP parsed.

rmy · May 19, 2016, 7:21am

Good. Im doing testing with large datasets at the moment, and things seems to be working pretty ok, but there are of course room for improvments:

In 6.6, there will be a switch to avoid refreshing the index between each create content in the API that will speed up the process a lot. On my mac, creating content now uses 5ms per content, while nodes are at 2ms pr node.