Meaningful data visualisation and insight generation requires clean data. What we mean by clean data is normalised or sanitised data - data that follows a certain rule to be uniform, so as to be comparable. Only uniform things are comparable and normalisation or sanitization of data is an important part of patent research because of the inherent inconsistencies of some data points.
One of the major data points of patents that need this 'cleaning' (in other words normalisation) is company (or institutes, universities, labs, etc.) names. These are called assignee/applicant names in patentese (patent language).
Assignee/applicant cleaning has many methods in terms of choosing the right assignee name - that is not something we will be discussing in this post. We will focus on learning how to use one open source tool (a very powerful one) for assignee cleaning - Open Refine (formerly known as Google Refine).
So in essence, this article will help you get started with using open refine for cleaning data in the assignee/applicant column with company/university names but not really delve into which name to choose and other aspects (I may cover them in a separate article later).
Steps:
One of the major data points of patents that need this 'cleaning' (in other words normalisation) is company (or institutes, universities, labs, etc.) names. These are called assignee/applicant names in patentese (patent language).
Assignee/applicant cleaning has many methods in terms of choosing the right assignee name - that is not something we will be discussing in this post. We will focus on learning how to use one open source tool (a very powerful one) for assignee cleaning - Open Refine (formerly known as Google Refine).
So in essence, this article will help you get started with using open refine for cleaning data in the assignee/applicant column with company/university names but not really delve into which name to choose and other aspects (I may cover them in a separate article later).
Steps: