In recent years, machine learning R&D and large-scale service development have become more popular. Therefore it is necessary to handle large amounts of data. In order to handle large-scale data efficiently, it is essential to accelerate database systems. We are working on accelerating database systems by obtaining efficient query plans with the following two approaches.
- Using machine learning to capture data and workload characteristics
- Formulating data storage and query as an optimization problem
Data integration is the process of providing the user a unified view of data residing at different databases. Recently, as the amount of generated data has been increasing, the demand for utilizing distributed data as a whole also has been increasing. We are researching efficient transaction management method in the P2P-based data integration architecture, called Dejima.
We are developing conversational agents that can understand what people say and how they feel, and react naturally like humans. Especially, conversational agents with chit-chat ability are our target. Both search-based and generation-oriented approaches are explored:
- Search-based: Given users’ utterances, search appropriate responses from a large-scale (previous) conversation data.
- Generation-oriented: Generate utterances from scratch using various machine-learning techniques.
The language barrier has been the challenge for people: it causes miscommunications and misunderstandings. Our belief is that machine translation is the key to tackle this challenge. We cover topics of statistical machine translation, neural network based machine translation, and evaluation techniques of machine translation outputs.
Paraphrases convey the equivalent information in different expressions. Let’s take a pair of paraphrases: “The discussion heated up.” and “Their debate entered high gear.” as an example. Why we can understand they deliver the same information, even though they use totally different words? How our brains represent these expressions to estimate their semantic relevance? Paraphrases are the key to answer these questions. Paraphrases are also useful resources for applications that need to understanding of users’ saying, such as question answering and conversational agents. We are working on analysis of linguistic phenomena on paraphrases, as well as technologies to detect paraphrases. For more details, please visit our project web site.
Language Learning Support
Everyone knows that mastering foreign languages is important but hard. We are developing systems that support language learners as well as language teachers: automatic estimation of proficiency levels of English documents and lexical simplification techniques to help teachers adjust levels of their course materials to be appropriate for their students.
Thanks to SNS, people have been more connected with each other than ever before, and many devices are likewise connected with each other via IoT technology. Graph data can represent these connections, so graph analysis attracts considerable attention. The goal of graph mining is to discover valuable insights from graph data. For example, clustering identifies groups where people (or devices) behave similarly. Another example is link prediction, which finds pairs of people (or devices) that are likely to connect with each other. These analytics are used for recommendation systems, marketing, and more, and so are employed in a wide variety of applications. We aim to develop efficient, effective graph mining methods.
Patent Evaluating AI System
The number of registered patents in the world exceeds 10 million, and the number of patent applications per year exceeds 3 million. Thus, the burden of patent search on companies is increasing because of increasing of patents. In addition, the number of patent applications per year in China is much increasing and it is at the top of the world. The globalization of intellectual property rights is progressing, so it becomes more difficult to execute comprehensive patent search. In this research, we aim to develop a system for semi-automatic and highly accurate patent retrieval while utilizing the success of recent machine learning with neural network and incorporating big data analysis methods such as graph matching and clustering.
Exploratory Data Analysis
The exploratory data analysis is a technique for discovering characteristic data that largely differ from ordinary data, such as from purchase data and astronomical observation data. For example, we utilize important observations by discovering characteristic from the aspect of specific region or season. In particular, in collaboration with National Astronomical Observatory of Japan (NAOJ), we are working on the technology of outlier detection and imputation by learning normal patterns so that we aim to discover intrinsic variables whose brightness and position change in a short period of time.
Attribute Graph Clustering
We daily search routes from car navigation systems and smart phones. We propose new and useful types of route search that cannot be searched with existing services. In particular, we focus on actual attributes of data (e.g., category and text) and around information, for example we investigate routes with matching other attributes or integrating parking status. Besides this, we propose efficient algorithms of existing searches and implement services cooperating with local governments.
Spatio-temporal Data Mining
Local governments recently deploy many sensors in cities such as temperature, sound, traffic volume, and air pollution, and the sensors generate a large amount of data. We analyze the big data and obtain beneficial knowledges “temperature and traffic volume increase at the same time,” and “with increasing noise, air pollution also increases”. Our goal is to propose new techniques that find non-trivial knowledges.