Snowball

Saturday, June 14, 2014

Amazing Google Trends and Correlate

Google Trends (www.google.com/trends) is a great tool to explore trends on user interest on virtually anything under the sun. Mining its collection of all search queries from 2004 onwards ( a mind boggling number, assuming 5 billion queries a day (which is the current rate of search queries), total number of search queries for over last ten years would be something around 5*3660=18.3 trillion..i.e.. 18,000,000,000,000 search queries!!!!), google trends identifies user interest across time (2004 onwards) and space (for regions, city, etc.).

Until recently, inspite of having this mountain of data at its disposal to look for answers, google trends was still quite dumb at understanding the context of the search terms. In simple terms, when I type in "Amazon", I could be interested in 'n' number of things?... The river amazon, or the company amazon.com, or the amazon rainforest?.. To Google Trends it did not make any difference. For it was just a term with no meaning and hence would aggregate all content that contained the search term irrespective of its relevance. It was left to the user to somehow communicate to the tool the desired context by way of excluding terms that are not related to the context of interest. For example, to look for user interest trends on amazon.com company, I would have typed in something like "amazon.com -rainforest -river" i.e all those search queries that contain the term "amazon.com" but not "rainforest" or "river". a workaround and not a foolproof way.

That was until Google launched the new version of Trends. In the new version of Trends, Google has been able to "teach" Trends to understand what search terms mean. Using its treasure trove of search queries, Google has been able to categorise all of its search query data content into people, places, and things. This is by no means a small feat and is a shining example of what can be achieved using "big data" analysis.

So now when I type in "Amazon" in Trends, it understands that it could be a name of a retail company , or name of a rainforest, or name of a river, or a fictional character, or just a plain term. It throws up all of those options for the user to select one.

Each identified context for a search term is organised as a topic by Trends. All search queries that relate to a particular context are associated with a topic of the same name. Now when I try to find out the trends for amazon - the company, I can simply select the associated topic and be sure that the results reflect pure interest in amazon.com unadulterated by any other context. Google Trends claim to have more than 700,000 topics in its catalogue with more being added every day.

This stuff is really cool. It is amazing how Google can categorise the information about the world by analysing its search query database. What is stopping it from taking it further to develop a full-fledged classification of people, places, and things of everything that exists or ever existed....?. If google has raised the bar by 10x by giving Trends some intelligence, it has raised the bar by 100x by developing Google Correlate which in tandem with Trends puts awesome power in your hands. Using Google Correlate, you can find out the correlations (and causations, if you are lucky) related to the trends that you see in time and space.

I really enjoy playing with Trends by just typing in terms that interest me and see the trends across time and space.. For example, I was interested in finding the trends of stress levels across time (2004 onwards) and compare the same for different countries. So I typed in two terms stress and anxiety which I thought are good proxies for general stress as we understand it and then added two locations - United States and UK to bring up the following graph.

Interesting to note the downward trend till around 2008 and an upward trend from thereon..hmm.. maybe the impact of 2008 financial crisis has something to do with it.... UK seems to be on a higher trajectory path than the United States..

Breakup by regions where relative popularity of search queries is the highest

The possibilities are endless. You can estimate relative market share of say.., cell phone makers by using the search queries as a proxy and drill it down by region, city.. and time (last 30,60,90,... days). Or look at patterns of diseases in space and time...So take it out for a spin and you might find a gold nugget in there that can send you off in a new direction in your work, or life.

Wednesday, October 19, 2011

HL7 V3 Reference Information Model (RIM) - Data Model Design Principles

The foundation of HL7 V3 is the Reference Information Model (RIM). RIM is the fundamental static model of information from which all other V3 models are derived. Though RIM consists of only six core concepts (or classes), it is flexible and powerful to model all kinds of information including that related to healthcare. What makes RIM flexible and powerful is its use of design principles that produce high quality data models.

Matthew West of Shell Oil Company in a set of three white papers ‘Developing High Quality Data Models’ published in 1994 [1] enumerated six design principles that can be used to produce high quality data models. These design principles are –

Entity types should represent, and be named after, the underlying nature of an object, not the role it plays in a particular context.

Entity types should be part of a sub-type/super-type hierarchy ("class hierarchy" if you're familiar with object oriented terms) in order to define a universal context for the model.

Activities and associations should be represented by entity types (not relationships).

Relationships (in the entity/relationship sense) should only be used to express the involvement of entity types with activities or associations.

Entity types should have a single attribute as their primary unique identifier. This should be artificial, and not changeable by the user. Relationships should not be used as a part of the primary unique identifier. (They may be part of alternate identifiers.)

Candidate attributes should be suspected of representing relationships to other entity types.

The objective of this document is to highlight how RIM incorporates these design principles to create a flexible and powerful conceptual information model. Understanding these principles and how they relate to RIM can help to get a deeper understanding of the model. In order to use the model effectively in V3 development process, it is important to understand basic underpinnings of the model. People who have basic knowledge of RIM but need to understand it better in order to develop and implement models based on it can benefit from the discussion below.

Reference Model

HL7 V3 Reference Information Model (RIM) is an information model that defines the information from which all information related content of V3 messages is derived. RIM is the ultimate source of all information models developed as part of V3 development process. Essentially, it is common shared view of information semantics and structure that define and bind concepts into a meaningful, generic abstraction of the real world. HL7 RIM consists of six core classes and several specialized and enumerated sub types of these core classes.

Figure 1: RIM Backbone

HL7 V3 development process is a model driven development methodology that is based upon iteratively constraining scope and content of information thru a linear sequence of constraint models. All models derived from RIM are statements of constraint against the RIM for its use in a specific context. Constraint models narrow down properties of a class, set of values that an attribute can take, restrict the domain of coded concepts, or restrict cardinalities of association between model classes. At each level of constraint, while scope and flexibility gets reduced, information model increasingly becomes specific towards a particular usage or requirement.

This process of applying multiple sequential constraints to RIM ultimately leads to the constraint model that defines the structure and semantics of the message to be exchanged. To start with, a domain information message model (DIM) is developed using new concepts created by constraining class attributes, data types and relationships from the RIM. DIM is a common shared model of a set of Message Information Models (MIM) in a particular domain. MIM is a specific model of constraint against a DIM and is a common shared information model for a set of messages.

Figure2: V3 Constraint Models

The first design principle states that –

“Entity types should represent, and be named after, the underlying nature of an object, not the role it plays in a particular context.”

This is a very powerful design principle that disambiguates between “what an entity is” and “what it does in a particular context”. These two things often get mixed up as a single abstraction in the modeling process and introduce inflexibility in the model.

For instance, let us take an example of a person who is a customer as well as a vendor of the same product. In figure 3 below, person is abstracted as a vendor or a customer of a product that he trades in. Class Customer and Vendor are shown as specializations of class Person resulting in two different types of person – Customer or Vendor. In other words a person could either be a customer or a vendor of a product but not both.

Figure3: Entity and Context

In the redesigned model below, a distinction is made between what the entity is and what it does in the current context. ‘What an entity does’ is moved from as part of the identity of the entity to the relationship between the person and the product class. The role played by a person in each of the two relationships is based upon what a person does in the context of the relationship. If a person buys a product, he becomes a customer and if he sells a product he becomes a vendor of that product. These roles are represented as association roles in the model below. The model recognizes the fact that what a person does in a context are roles played by a person in that context. As we will see in the next section, this relationship is modeled using an associative Role class that defines the competency or role played by an entity in context of another entity.

Figure4: Activity as Relationship

In a similar manner, RIM defines ‘Entity’ as a set of information classes which describe ‘things’ such as persons, organizations, places, devices, substances and containers. Entity classes and enumerations thus defined in the RIM are purely based on what an Entity is and is devoid of any superimposed identity based upon what the entity does in context of another entity. For example, RIM does not define the concept “Employee” to be a type of entity but makes it part of the relationship between two entity classes. Concept ‘Employee’ is described as a relationship between entity classes Person and an organization.

The second design principle states that –

“Activities and associations should be represented by entity types (not relationships)”

Relationship between classes is usually modeled using association roles that enumerate what a class does in relationship with another class. In figure 5 below, Class Person plays the role of a Patient and class Organisation plays the role of a Provider according to the activities performed by these classes in the relationship. The association between instances of the two classes is a many-to-many association with zero or more instances of Person Entity class associated with zero or more instances of organization entity class.

Figure5: Association Roles

In this model, even though a person instance can be associated with multiple instances of Organisation entity, there could be only one association between specific instances of Person and organisation entities. For example, as depicted in figure 6 below, person ‘John Doe’ could be a patient at Organisation ‘Apollo Hospital’ at different points in time. Thus John Doe has more than one association with Apollo hospital. This information cannot be captured thru simple association relationship of this model and constrains John Doe to have only one association with Apollo Hospital.

Figure6: Multiple associations between specific entity instances

To solve this problem, association between Person and Organisation classes is modeled as a class that sits in between these two classes and represents the association between them. Association between Person and Organisation entity classes is modeled using a ‘Role’ class that is an associative class and one that also represents what an entity class does in context of another entity class. Any information that pertains to the association between Person and Organisation classes is captured in this associative class. For example, information related to multiple encounters of John Doe in the role of a patient with Apollo Hospital is captured in the role class (for ex., date attribute). Figure 7 shows these associations of John Doe and Apollo Hospital as multiple instances of the Role class.

Figure7: Example of multiple associations between specific entity instances

In addition to providing for specification of attributes that are specific to the role, externalizing Role from an entity also enables association of multiple roles to an entity. Seen in another way, this allows for two entities to be associated in multiple ways via roles played by player entity in context of the scoper entity. This design principle also ties into actor-role pattern that is very useful and commonly used. Figure 8 displays multiple roles played by Person entity in the context of an Organisation.

Figure8: Multiple relationships between entity types

For example, John Doe while being a patient at Apollo hospital can also be an employed there.

Figure9: Example of multiple relationships between entity types

Similarly, there exists many-to-many relationship between Role and Act class. In the figure below, an entity in the role of a Physician can perform many acts of type observation and one act of observation can be performed by many physicians.

Figure10: Many-to-many association between Role and Act class

To resolve many-to-many relationship between Role and Act classes, we introduce a Role Participation class that is an associative class between Role and Act classes. As stated in this design principle, all activities between Role and Act classes should be represented as a class type. A Role participation class allows for all activities performed by a role in an act to be captured as instances of the Participation class. Role participation class includes coded attribute that enumerates different participation roles that a Role can assume in the context of an Act. Essentially, Role participation is a contextual role played by an entity in a competency role.

Figure11: Participation Role Class

Third design principle states that

“Entity types should be part of sub-type/super-type class hierarchy”

According to this design principle, any conceptual abstraction of a real world entity should either specialize from another entity or should itself be the object of specialization. In a sub-type / super-type class hierarchy, sub-type inherits all attributes and relationships of the super-type but adds at least one unique attribute and/or relationship that is not present in the super-type. For example, in RIM entity model, generic entity class Living_Subject is specialized into “Person” and “Non Person Living Subject” entity classes. Class ‘Person’ adds new attributes such as ‘address’, ‘marital status code’, etc. that complement and complete the definition of a person as a concept.

Figure12: Entity Type Specialisation

Extending sub-types from super-types, as explained above, provides what is known as formal justification for setting up a sub-type / super-type class hierarchy.

In situations, where sub-types have the same set of attributes and relationships as the super-type, sub-types serve to illustrate kind of things represented by the super-type. This method of organizing real world information provides for what is known as informal justification for setting up a sub-type / super-type class hierarchy.

Figure below displays a fragment of RIM entity class hierarchy with entity sub-types represented as entities inside other entities. Common attributes shared between entity super-type and its sub-type(s) are shown in the outside entity. All attributes of entity super-type are inherited by the sub-type.

Figure13: Entity Class Hierarchy

Each sub-type (or concept) in the class hierarchy ‘Entity->Living_Subject->Person, NonPersonLivingSubject defines attributes in addition to those inherited from the super-type (or concept). This is an example of type specialization that is based upon formal rationale of specialization.

Sub-types of NonPersonLivingSubject displayed in the right corner box in the diagram above are examples of specialization that is not based upon formal reasoning. Concepts such as Animal, Microorganism, and Plant are sub-types of NonPersonLivingSubject type. No additional attributes have been defined for these specialized concepts. Concepts Animal, Microorganism, and Plant can also be thought of as distinct concepts when we talk of ‘NonPersonLivingSubject’ domain but for which at present we have decided not to model the distinction itself. These concepts serve to illustrate examples of the concept NonPersonLivingSubject.

RIM does not represent (in the model diagram) conceptual specializations of a class that do not require additional properties beyond the properties of the class that it is specialization of. For all such enumerated specializations, classifier attribute is defined the super class to distinguish specializations that exist conceptually but are not represented in the model. For example, Concepts Animal, Microorganism, and Plant are distinguished with Entity Class code values ‘ANM’, ‘MIC’, and ‘PLNT’ respectively in the controlling vocabulary of Entity Class Code.

Looking at the model defined in Figure 4, we realize that in addition to a person, it could also be an organisation that is a vendor and/or customer of the product. We can capture common attributes of Person and Organisation in a generalization hierarchy with Trading Partner as the super class of enumerated specializations – Person and organisation. Doing so simplifies our model by recognizing the fact that relationships can now apply to either a person or an organization.

Figure14: Entity Specialisation

Similarly, common entity concepts in RIM are organized in a generalization-specialization hierarchy with Entity class at the root of hierarchy. This allows for a simple RIM model since all entity class specializations can participate in the same relationships as applicable for the class generalization in the entity class hierarchy model.

RIM achieves its enormous flexibility as a model thru the use of structural attributes such as Class Code that help distinguish between different conceptual specializations.The controlling vocabulary provides the semantic description of the concepts. It also provides a universal context of generic entity types that are linked to each other in a hierarchical generalization-specialization relationship - one that is based on pure semantics of the concepts involved.

Fourth design principle states that

“Relationships as involvement”

This design principle states that where relationship between two entity types is represented as an associative entity type, relationship between entity types themselves is simply involvement of either entity type with the associative entity between them.

As shown in the figure 15 below, association between person and organization entity types is modeled using another entity type (Role Class). With the introduction of a Role class, relationship between Person and Organization entity types now becomes involvement of Person and Organisation classes with the Role class. In figure below, the involvement is depicted as ‘plays’ or ‘scoped by’ which are roles played by Person and organization entity types in their relationship with Role class.

Figure15: Relationship as involvement

Fifth design principle states that

“Use only surrogate identifiers”

This design principle asserts that only surrogate identifiers be used to indentify entity types. Surrogate identifiers are system generated identifiers that are unique to the system. The identifier possesses no meaning and is used only to guarantee data integrity. Surrogate identifiers are best used for reference entities when it is difficult to find attributes that never change value.

In HL7 RIM, Instance Identifier (II) data type is used to uniquely identify an object. Instance Identifiers are created as Object Identifiers (OID). OIDs are guaranteed to be unique if created by an ISO registration authority following procedures laid down by ISO standards. The basic structure of the instance identifier includes a namespace that is the root of the OID and an extension that serves as the identifier within the namespace. Instance identifiers uniquely identify Act, Role, and Entity class objects. Though HL7 does not mandate creation of meaningless identifiers to identify objects, it does provide the option of doing so using the extension attribute of the OID.

In cases where entity described by class attributes is a concept data type Concept Descriptor (CD) is used to express object identity. In a CD, namespace is the code system and identifier is the code attribute.

Sixth design principle states that

“Candidate attributes should be suspected of representing relationships to other entity types.”

This design principle implores examination of all class attributes to determine whether these attributes are really relationships to another concept.

Conclusion

In the discussion above we saw how HL7 RIM data model conforms to the design principles that produce data models which are generic and flexible to provide a universal context to model real world entities.

________________________________________________________________________
[1]. Matthew West "Developing High Quality Data Models, Volume 1, Principles and Techniques", The Data Management Guide. (London: Shell International Petroleum Company Limited, 1994). These ideas are further expanded at http://www.matthew-west.org.uk/documents/princ03.pdf.

Tuesday, October 18, 2011

Term "CLOUD"

Recent Microsoft advertisement http://www.youtube.com/watch?v=tdqoQ0zL7GQ&feature=related on Cloud computing seems to be an attempt to popularise the term "Cloud" with consumers. It is interesting to note that where terms such as "Software-as-a-Service", "Service Oriented architecure", etc. failed in the past to catch on, "cloud" stands a good chance of catching on because it is a simple word that is not abstract, is not tech heavy and can be related to real world cloud in a way that makes it intuitive to understand what it is all about.

Google trends http://www.google.com/trends is a great tool to track popularity of terms over a period of time across geographies, regions, etc. It is a fun tool to use to track trends or catch them early. Typing the term "cloud" in google trends brings up following chart...

Average worldwide search volume from 2004 to present is scaled down and mapped to 1.0on y-axis of this chart. What this chart then shows is how search traffic changes from the average over a period of time. Till about end of 2007, average search volume for the term "Cloud" did not vary much from the long-term average and it is safe to assume that till this point in time, people interest was in real clouds. There is a marked increase in the term cloud in 2008 that accelerates during 2009 and really takes off in 2010. Unless people suddenly became interested in real clouds all over the world, this steep increase can only be attributed to new interest in the tech representation of the term "cloud". Though spike B in the chart is due to the interest in volcanic ash cloud over iceland.

Searching for term "Cloud Computing" brings up a similar picture.

Interesting to note is the comparison between the terms "Saas" and "Cloud Computing". Though average search volume for "Saas" is higher, which is expected since the term came into being much earlier, the trend clearly shows growing popularity of Cloud as compared to stagnant or waning popularity of "Saas" term.

Here's another chart showing relative popularity of "SaaS" and "SOA" terms. Declining trend of popularity for these terms is clearly visible in the chart.

I find Google trends not only fun to use but also very useful in getting a feel of what's on the mind of people . Are people worried about the economy?. Is Inflation or deflation a bigger concern?. Using google trends one can find answers to such questions. One can get insights into businesses by looking at search volume trends. Searching for "Amazon" brings up the chart below. One can get so many insights into amazon business using this chart. The long term upward trend that probably matches Amazon stock price. The spikes come in regularly every year during the christmas season. 2007 spike was an exceptional one, maybe that has to do with the economic conditions prevailing at that time....

How I explained REST to myself

My recent visit to the passport office to renew my passport was a complicated and messy affair. Back at the office, mulling over the whole experience, I realized that my interaction with the passport office is a perfect example of a REST based service and how Roy Fielding would describe a true REST architecture.

Let me talk about my experience briefly before I get to how I relate it to core REST principles.

Looking around in the main hall as I entered the passport office at the appointed time, I found a window with a sign that displayed “for online appointments”. I presented my application form and all documents to the officer at the window. After going thru my papers, the officer told me to go to Window6 in Hall 2 as he handed me my papers back. He was probably too busy to answer to my queries and gave me a look that clearly meant – just do as told.

So I went to Window6 in Hall2, stood in a short queue, and when my turn came, gave my papers to the officer at the window. After checking my documents again and making some entries, the officer handed the documents back to me and I was told to go back to the window I came from. By this time, I had realized that it is in my best interest to do what is told and not worry too much about figuring out the process.

Back at the the window where I started from, I again forwarded my documents, checked and stamped many times now, to the window officer. This time I was asked to make the fee payment. After making the payment, I was handed back a receipt and mercifully told that the process is over.

At the core of REST architectural style, is something called – HATEOAS – “Hypermedia as the engine of application state”. A user can transition to next state using only Hypertext and links provided in the server response. Based on the current application state, server determines what actions can be performed by the user to transition to future application state(s). An important advantage of this architectural style is that client can easily adapt to changes in server responses since the path taken by the user thru the application is strictly based on server responses.

In my context of interaction with the passport officers, what I needed to do next was determined by the response I got from the officer at the window I visited first. Based on my information, I was told which window I need to go to next. Following instructions provided by the officers, I moved from one stage to next in the application process. This is not different from a REST based server that guides its clients on possible next states thru hypertexts and links provided in the representation of the resource(s) accessed.

A true REST architecture makes it mandatory for the client to discover all future actions dynamically using the hypermedia links provided in the server response. The client then needs to know only the entry point URL into the application and from there on discover future actions from the server responses. In my case, that entry point was a window with a sign board stating “For Online Appointments”. Once I accessed that, I had to discover all future actions thru instructions received from the passport officers.

“Media type ” is another important concept in REST architectural style that makes it possible for REST API to be the true driver of the application state. Media type standardizes out-of-band information that further guides the interaction between the client and the server. For ex., Media type such as “text/html” defines the rendering model and the browser behavior for HTML markup elements. In my example, I would characterize media-type to be “Go-to-the-next-window-as-told”. This clearly defined my behavior around the response(s) that I received from the officers.

Having the system in control of all activities does provide lot of flexibility and strength to the whole process, even though it may seem chaotic from a user perspective. It is a discovery process just as a true REST application server would want its clients to do.