Mar 11, 2017

Patent Research Training > Data Cleaning > Assignee (Player / Company Name) Cleaning & Normalization - III

1. Create Project

Once you have your data excel sheet ready, you can go to the starting page of Open Refine and upload it. You should see something like this after you upload your sheet:

I used an excel result from espacenet; however, any format will be good for this.

For now, do not vary any options, just click on Create Project.

2. Selection of Text Facets

Once you create the project, you should see a screen like the one shown below where you can select the options I have highlighted

3. Cluster - Text Facets

Once you select the highlighted options, you should see something like this:

4. Merging the cluster suggestions

Open Refine gives multiple options to find similar looking text entries that can be normalized. All of these are essentially algorithms (but you don't need to look them up in detail, unless you want to)

The first option will be Method - 'key collision' based on Keying Function - 'fingerprint'

See the first cluster values: These are two variations of writing the same company name. We have successfully identified this anomaly - now select the Merge? option and click on 'Merge Selected & Re-Cluster' button at the bottom.

Once you don't see any options after re-clustering, simply select another Keying function at the top:

See the first and the fourth cluster suggestions using the 'metaphone3' keying function.

Once you have tried other keying functions in the 'key collision' method, try the other method instead of 'key collision' as well - which is 'nearest neighbor'

Again, we find similar looking names that might be typo errors. (we get a lot of these in patent data now, don't we?)

After checking all methods and merging the ones you think are correct, you can click on close and then export the sheet for further use.

The exported excel (or CSV, if you like) will have the Applicant(s) column modified according to the changes you have made using the Merge Selected option.

Give this a try and let me know in the comments if you face any issues.

Overall, keep exploring Open Refine, and discover its other functions. There are lots of them!