Monday, 12 September 2016

What is Java Content Repository

JSR-170 defines itself as "a standard, implementation independent way to access content bi-directionally on a granular level within a content repository," and goes on to define a content repository as "a high-level information management system that is a superset of traditional data repositories, [which] implements 'content services' such as: author based versioning, full textual searching, fine grained access control, content categorization and content event monitoring."
The Java Content Repository API (JSR-170) is an attempt to standardize an API that can be used for accessing a content repository. If you're not familiar with content management systems (CMS) such as Documentum, Vignette, or FileNet, then you must be wondering what a content repository is. Think of a content repository as a generic application "data store" tht can be used for storing both text and binary data (images, word processor documents, PDFs, etc.). One key feature of a content repository is that you don't have to worry about how the data is actually stored: data could be stored in a RDBMS or a filesystem or as an XML document. In addition to providing services for storing and retrieving your data, most content repositories provide advanced services such as uniform access control, searching, versioning, observation, locking, and more.

Various CMSs from different vendors have been on the market for quite some time, and all of these CMSs ship their own version of a content repository. The problem is, each CMS vendor provides its own API for interacting with the content repository shipped with that vendor's CMS. This is a problem for the application developer, since he has to learn a particular vendor's API and potentially tie up his code with one particular CMS implementation.

JSR-170 tries to solve this problem by standardizing the API that should be used for connecting to any content repository. With JCR-170, you develop code by only using the javax.jcr.* classes and interfaces. This should be able to work with any JSR-170 compliant content repository.

This article is a step-by-step tutorial for newcomers to JSR-170. I've decided to use Apache Jackrabbit, the reference implementation of JSR-170, as the content repository. I'll start the discussion by talking a little more about what content repository is and what is needed for standardizing the content repository API. After that I'll introduce you to JSR-170 by discussing the repository model defined by JSR-170. Next I will talk about what Apache Jackrabbit is, how to build it, and configure it for use. Once Apache Jackrabbit is set up, I will develop a sample application for demonstrating the basic features of JSR-170 API.

Need for Java Content Repository API


As the number of vendors offering proprietary content repositories has increased, the need for common programmatic interface to these repositories has become apparent and that's where JSR-170 comes into play. JSR-170 defines a programmatic interface that should be used for connecting to content repository. You can think about JSR-170 as a JDBC-like API for content repositories, allowing you to develop your program independently of any particular content repository implementation. At runtime, you can configure this program to work either with a natively JSR-170 compliant content repository (e.g., Communique or Apache Jackrabbit) if your repository is not natively JSR-170 compliant (e.g., Documentum or Vignette), then you can use some kind of repository-specific JSR-170 driver that takes care of converting your JSR-170 method calls to repository-specific method calls.

CMSs are a quite old concept. Some of the common applications of CMSs include a web content management system used to manage content (static HTML files and images) on a company's web site, or a document management system where a company stores scanned copies of all sales orders. There are different CMS vendors in the market that provide this type of application. CMS vendors need a content repository as a backend, one that handles both structured and non-structured content efficiently. By "structured content," we mean content like a news item or press release that is posted in the system and retrieved by queries (e.g., your application's front page should display, say, the 3 latest press releases or 10 latest news items). An example of unstructured content is a scanned copy of a sales order or an image that should be displayed on your corporate website.

To support these CMS systems, vendors have developed their own content repositories that ship with their CMS systems. They also provide proprietary APIs that can be used for accessing this repository. As the number of CMS vendors increases, standardizing this API becomes apparent and that's where JSR-170 comes into play.

Figure 1 describes the structure of an application developed using the JSR-170 API. At run time, this application can work with either content repository 1, 2 or 3. Of these, only content repository 2 is natively JSR-170 compliant; the other two repositories need JSR-170 drivers for interacting with a JSR-170 application. Note one more thing: your application does not have to worry about how actual content is stored. Content repository 1 may use RDMBS as underlying data store where as content repository 2 may use the filesystem as its underlying data store, while some other repository could use a mix of these.

What is Java Content Repository

Figure 1. Structure of JSR-170 compliant application

The JCR-170 API has different advantages for different stakeholders in content repository space.
  • Developers do not have to spend time learning each vendor's repository-specific API. Instead, once she is comfortable with JSR-170, a developer should be able to work with any JSR-170 compliant content repository. In the past, developers had to make choice between a CMS with great features and poor development tools, or one with great development tools but poor features. Now that the interface between content repository and CMS applications is standardized, you can choose the best choices from both worlds.
  • Corporations won't have to face problem of vendor lock-in. More commonly, many corporations have more than one CMS either because different departments choose to use different CMSs in the past, or because some acquired company used a different CMS system. In the past, corporations spent a lot of money getting these different systems to interact with each other. With JSR-170, they can be assured that same application will work with all CMSs.
  • CMS vendors were forced to develop and maintain their own content repository implementations, which meant lots of infrastructure code. Now they can leave development of the content repository to some other vendor and concentrate more on their core competency: developing CMS applications.

Content Repository Model


JSR-170 says that a content repository is composed of a number of workspaces, which should normally contain similar content. A repository can have one or more workspaces. Each workspace contains a single rooted tree of items. An item is either a node or a property. Each node may have zero or more child nodes and zero or more child properties. Only the root node does not have parent and all other nodes have exactly one parent. Every workspace has only one root node. Properties have one node as a parent and cannot have children; they are leaves of the trees. All of the actual content in the repository is stored within the values of the properties.

Figure 2 describes a content repository model for a sample blogging application. Every child node of the root node represents one blog entry. Any actual data related to a blog entry is stored as properties of blogEntry. The properties blogTitle, blogAuthor, and creationTime should all be self-evident, while the blogContent property contains actual entry data, and a blogAttachment property holds a binary image file that is image attached:

What is Java Content Repository

Figure 2. Content repository model (click for full-size version)

In addition to this repository model, JSR-170 also defines different features or operations that should be supported by a compliant repository. To make it easy for existing CMS vendors to adopt to these new standards, JSR-170 has brought in the concept of compliance levels, which define the number of features that must be supported for a given level of compliance. JSR-170 defines three different compliance levels:
  • Level 1 defines a read-only repository: This includes functionality for the reading of repository content, export of content to XML and searching. This functionality should meet the needs of presentation templates and basic portal applications comprising a large portion of existing codebase of content-related applications. Level 1 is also designed to be easy to implement on top of an existing content repository.
  • Level 2 defines a writable repository: Level 2 repository is a superset of Level 1. In addition to Level 1's functionality, it defines methods for writing content and importing content from XML. Applications written against Level 2 features include any application that generates data, information or content, both structured and unstructured.
  • Advanced options: In addition to Level 1 or Level 2 features, the specification defines five additional functional blocks: Versioning, (JTA) Transactions, Query using SQL, Explicit Locking and Content Observation. In addition to being either Level 1 or Level 2 compliant, any repository can decide to implement one or more of these functional blocks. A repository that implements all of these features in addition to being Level 2 compliant can be used as a general purpose off-the-shelf infrastructure for content management, document management, code management, or just about any other application that persists content
So, if you are a CMS vendor, the first step is to make your repository Level 1 compliant. As time progresses, you can decide to move to Level 2 compliance and implement advanced features based on your needs or client base.

What Is Apache JackRabbit?


Apache Jackrabbit is fully JSR-170 compliant, Level 2 compliant, and implements all optional feature blocks. Beyond the JCR-170 API, Jackrabbit features numerous extensions and administrative features that are needed to run a repository but are not specified by JCR-170.

We have decided to use Apache Jackrabbit as the content repository in our sample application. One problem with Apache Jackrabbit is that it doesn't offer a binary release, so developers need to build it from source code before installing it. See Building Jackrabbit for information on how to build Apache Jackrabbit from source code.

How to Configure Apache Jackrabbit


After downloading and building the Jackrabbit source code successfully, let's configure it. Jackrabbit needs two parameters at runtime to configure a content repository instance.

1. Repository home directory: The filesystem path of the directory that usually contains all the repository content, search indexes, internal configuration, and other persistent information managed within the content repository. The directory structure of the content repository will look something like this:

 c:/temp
        |
        |--Blogging
                |
                |-repository
                |       |
                |       |-index
                |       |-meta
                |       |-namespaces
                |       |-nodetypes             
                |
                |-version
                |
                |-workspace
                        |
                        |--default
In this case, value of repository home directory parameter should be c:/temp/Blogging.

2. Repository configuration file: The filesystem path of the repository configuration XML file. This file contains configuration information for the repository, including class names for Jackrabbit components (deciding which implementation we want to use) and configuration information required for that component. Take a look at the following listing, which represents what a typical configuration file would look like:

<Repository>
 <FileSystem class="org.apache.jackrabbit.core.fs.local.LocalFileSystem">
  <param name="path" value="${rep.home}/repository"/>
 </FileSystem>
 <Security appName="Jackrabbit">
  <AccessManager class="org.apache.jackrabbit.core.security.SimpleAccessManager"/>
  <LoginModule class="org.apache.jackrabbit.core.security.SimpleLoginModule">
    <param name="anonymousId" value="anonymous"/>
  </LoginModule>
 </Security>
 <Workspaces rootPath="${rep.home}/workspaces" defaultWorkspace="default"/>
 <Workspace name="${wsp.name}">
  <FileSystem class="org.apache.jackrabbit.core.fs.local.LocalFileSystem">
   <param name="path" value="${wsp.home}"/>
  </FileSystem>
  <PersistenceManager 
        class="org.apache.jackrabbit.core.state.db.DerbyPersistenceManager">
   <param name="url" value="jdbc:derby:${wsp.home}/db;create=true"/>
   <param name="schemaObjectPrefix" value="${wsp.name}_"/>
  </PersistenceManager>
  <SearchIndex class="org.apache.jackrabbit.core.query.lucene.SearchIndex">
   <param name="path" value="${wsp.home}/index"/>
  </SearchIndex>
 </Workspace>
 <Versioning rootPath="${rep.home}/version">
  <FileSystem class="org.apache.jackrabbit.core.fs.local.LocalFileSystem">
   <param name="path" value="${rep.home}/version" />
  </FileSystem>
  <PersistenceManager 
        class="org.apache.jackrabbit.core.state.db.DerbyPersistenceManager">
   <param name="url" value="jdbc:derby:${rep.home}/version/db;create=true"/>
   <param name="schemaObjectPrefix" value="version_"/>
  </PersistenceManager>
  </Versioning>
  <SearchIndex class="org.apache.jackrabbit.core.query.lucene.SearchIndex">
   <param name="path" value="${rep.home}/repository/index"/>
  </SearchIndex>
</Repository>

In the repository configuration file, the <Repository> element is a top-most or root element. One <Repository> element is equivalent to one repository configuration information and it contains following elements
  • <FileSystem>: The filesystem element represents virtual filesystem implementation that would be used for storing global data--data that is applicable at level of repository, such as registered namespace, custom node types, etc. Apache Jackrabbit provides a few options to store this data. One option is to store it on an underlying filesystem, which we are doing in our sample application by using LocalFileSystem. If you want this data to be stored in a database, then use DbFileSystem.
  • <Security>: The security element contains security configuration information for this repository. It has two child elements: <AccessManager> and <LoginModule>. The value of <AccessManager> indicates the class that should be queried to determine if a user has rights to perform a particular action on a particular item. The <LoginModule> element allows you to configure a class of LoginModule type, which is used for implementing authentication.
  • <Workspaces>: This element holds configuration that is common across all workspaces in that repository. Its rootPath attribute points to the root directory containing all workspace folders. In our sample directory configuration it would be c:/temp/Blogging/Workspace. The value of defaultWorkspace attribute contains default name of the workspace.
  • <Workspace>: This element represents the default template for all workspaces in this repository. So, when you create a new workspace in this repository, its workspace.xml file will look like this element. The <Workspace> element has three child elements. The first is <FileSystem>, which configures the virtual filesystem that should be used for storing data related to this workspace. The <PersistenceManager> element indicates how you want to persist content of this workspace. Apache Jackrabbit gives you with a choice of storing it on the filesystem, in a database, in memory as hashtable, or as an XML file. In our sample we are planning to persist that content in a Derby database. The last element is <SearchIndex>, which is an optional element. The value of this element points to a class which is used for indexing as well as actual query execution.
  • <Versioning>: This element configures a versioning-related object. You may have noticed that it contains the same child elements FileSystem and PersistentManager as seen in Workspace. That's because JSR-170 treats version as nodes, and so the same structure can be reused.
  • <SearchIndex>: This element configures the index that is used for searching repository-wide content.
The repository home directory and repository file configuration parameters are passed either directly to Jackrabbit when a repository instance is created or indirectly through settings for the JNDI object factory. You can set the value of the org.apache.jackrabbit.repository.home system property to point to the repository home directory. In our example, we will set it to c:/temp/Blogging. Then again, if you have a repository.xml file and you want to use that for setting up the repository, then you can set the value of the org.apache.jackrabbit.repository.conf system property to point your repository.xml. In our case, we don't want to use an existing repository.xml, instead we want Jackrabbit to generate a default repository.xml file for us. If you don't set either of these properties, then Jackrabbit will treat the current folder as the home directory and create a repository directory structure file as well as a repository.xml file in it. Refer to the Apache Jackrabbit online documentation to configure Apache Tomcat to create a repository configuration object and bind it in the JNDI tree.