February 7, 2019 at 5:43 am #28645
- Topic 842
- Replies 0
- posts 842
#Announcement(Startup) #Invitation #Challenge [ via IoTForIndiaGroup ]
#Organizer : IBM / IIMA #City : Ahmedabad #DueDateTime : 28 Feb 2019 #StartDateTime : 22 March 2019 #EndDateTime : 7 April 2019
Need for a Personal Data Entities dataset
Because of the recent advances in Deep Learning, there have been few attempts to detect personal data entities in unstructured text using neural models. However, neural models require large amounts of training data to make good predictions. We cannot train models on the personal data of real people, because of privacy reasons. Hence any such training dataset has to be artificially generated.
It may not be possible to manually annotate datasets large enough to train neural models. Hence we have to come up with ways to programmatically annotate personal data entities in unstructured text.
A team in IBM Research have discussed their dataset generation method in the following research paper.
Riddhiman Dasgupta, Balaji Ganesan, Aswin Kannan, Berthold Reinwald, and Arun Kumar. “Fine Grained Classification of Personal Data Entities.” arXiv preprint arXiv:1811.09368 (2018). https://arxiv.org/abs/1811.09368
However, there continues to be a need for richer and more diverse datasets that can advance the research in identifying personal data entities, which in-turn will improve privacy, and the protection of personal details that people share with the govt and private companies.
IIM Ahmedabad and IBM are excited to bring this dataset generation and coding challenge to students, budding data scientists, industry experts and fellow academicians. The challenge is to generate datasets with fictional but realistic personal data to advance AI research for identifying personal data entities in documents.
What is the Dataset Generation hackathon?
Generating training data for machine learning models requires intuition, an Aha! idea, that saves lot of manual effort, and leads to much better neural model performance. Hence we’re posing this problem to you! We’re seeking fresh, and out of the box ideas to accomplish this task. We’ll however provide you with resources and some methods that we’re familiar with, to get you started.
What is expected of you
We’ll be providing a corpus of English texts which are from customer complaints to financial companies. The personal data entities in these texts have already been removed and contain placeholders like xxxx. As part of this hackathon, you have to impute (create new) values for the personal data entities that have been redacted from texts.
You must be logged in to reply to this topic.