Solr Get Updated Data

In the Apache Solr MySql Hello World, we import data from MySQL to Solr which is full import, that is import all data from DB.

However, we just need to import updated data to Solr in following cases:

  1. if there is new record inserted into the MySQL
  2. one record is updated in DB
  3. one record is deleted

We want Solr insert/update/delete index if there is changes in DB.

This is called Delta Import in Solr.

Delta Import

In order to achieve delta import, we need to modify data-config.xml which is located in <solr home>/server/solr/<core name>/conf/.

Change it to following:

<dataConfig>

  <dataSource type="JdbcDataSource" driver="com.mysql.cj.jdbc.Driver"
    url="jdbc:mysql://localhost:3306/test" user="root" password="123456"/>
  <document>
  
    <entity name="CUSTOMER" pk="CUSTOMER_ID"
            query="select CUSTOMER_ID id, CUSTOMER_ID, CUSTOMER_CODE, NAME from CUSTOMER"
            deltaImportQuery="select CUSTOMER_ID id, CUSTOMER_ID, CUSTOMER_CODE, NAME from CUSTOMER where CUSTOMER_ID='${dih.delta.CUSTOMER_ID}'"
            deltaQuery="select CUSTOMER_ID from CUSTOMER where update_datetime &gt; '${dih.last_index_time}'"
            >      
    </entity>
  </document>
</dataConfig>

Then execute delta import:

_config.yml

If success, the GUI will show:

Indexing completed. Added/Updated: 1 documents. Deleted 0 documents.
Requests: 2 , Fetched: 2 , Skipped: 0 , Processed: 1 

Commit

The commit of above screenshot has to be checked, otherwise we cannot query the updated data in Solr.

For example, we update name from abc to 123 in DB, then do delta import without commit checked.

Then query with name:123, solr will return no result.

Do delta import from code

Delta Import also can be done from code like:

DefaultHttpClient httpClient = new DefaultHttpClient();
HttpPost httpPost = new HttpPost("http://localhost:8983/solr/helloworld/dataimport?command=delta-import&commit=true");

httpPost.addHeader(HTTP.CONTENT_TYPE, "application/json");	
HttpResponse response = httpClient.execute(httpPost);

HttpEntity entity = response.getEntity();

String body = EntityUtils.toString(entity);

httpClient.getConnectionManager().shutdown();

return body;

dataimport.properties

There is dataimport.properties under the conf folder.

#Wed Jan 16 07:12:33 UTC 2019
last_index_time=2019-01-16 07\:12\:32
CUSTOMER.last_index_time=2019-01-16 07\:12\:32

If full import or delta import success, Solr will update the last_index_time in it.

last_index_time not updated

If deltaImportQuery is deltaImportQuery=select CUSTOMER_ID, CUSTOMER_CODE, NAME from CUSTOMER where CUSTOMER_ID='${dih.delta.customer_id}', but the field column and name in data-config.xml are CUSTOMER_ID, then delta import will not throw any error/warning, just the last_index_time not be updated.

data-config.xml

dataSource

It describes the data source whose data will be imported to Solr.

type

type is optional. The default value is JdbcDataSource.

name

The attribute name can be used if there are multiple datasources used by multiple entities.

password

The password attribute is optional if there is no password set for the DB.

Alternately, the password can be encrypted,

https://lucene.apache.org/solr/guide/7_6/uploading-structured-data-store-data-with-the-data-import-handler.html#encrypting-a-database-password describes how to do this.

entity

For a RDBMS data source, an entity is a view or table, which would be processed by one or more SQL statements to generate a set of rows (documents) with one or more columns (fields).

Note that entity elements can be nested, and this allows the entity relationships in the sample database to be mirrored here,

so that we can generate a denormalized Solr record which may include multiple features for one item, for instance.

pk

It is optional and only needed when using delta-imports.

deltaQuery

It is used in delta import which is used to get id from table when update_datetime great than last_index_time.

deltaImportQuery

It is used in delta import and select all the updated records according to the id which from deltaQuery

field

<field column="CUSTOMER_ID" name="CUSTOMER_ID"/>.

field elements, which map the data source field names to Solr fields, and optionally specify per-field transformations.

If the column name is different from the solr field name(case does not matter), then another attribute name should be given.

Remove old data from Solr

One approach to remove old data from Solr is assign your primary key to uniqueKey in query and deltaImportQuery like:

query="select CUSTOMER_ID id, CUSTOMER_ID, CUSTOMER_CODE, NAME from T_CRM_CUSTOMER"

and

deltaImportQuery="select CUSTOMER_ID id, CUSTOMER_ID, CUSTOMER_CODE, NAME from T_CRM_CUSTOMER where CUSTOMER_ID='${dih.delta.CUSTOMER_ID}'"

wherein the customer_id is primary key of your table, id is the uniqueKey which defined in managed-schema.xml.

Without customer_id id in query and deltaImportQuery, for example, there is one document in solr with name=abc and customer_id=123, if this record is changed to name=qwe and customer_id=123 in DB, after do delta import, both name=abc and name=qwe exist is Solr.

If we don’t have value for unique key id, Solr will generate a unique value for it and update the document according to this value.

Schedule Delta Import

From Solr wiki:

The dataimport scheduler is NOT included in any released Solr version.

This is a proposal with a very old issue in Jira.

The feature may never become real, because all modern operating systems already have scheduling capability built in, and Solr would be reinventing a very old wheel.

References

Solr Ref Guide

DateImportHandler

Written on January 16, 2019