In recent years, machine learning R&D and large-scale service development have become more popular. Therefore it is necessary to handle large amounts of data. In order to handle large-scale data efficiently, it is essential to accelerate database systems. We are working on accelerating database systems by obtaining efficient query plans with the following two approaches.
- Using machine learning to capture data and workload characteristics
- Formulating data storage and query as an optimization problem
Data integration is the process of providing the user a unified view of data residing at different databases. Recently, as the amount of generated data has been increasing, the demand for utilizing distributed data as a whole also has been increasing. We are researching efficient transaction management method in the P2P-based data integration architecture, called Dejima.
We are studying conversation systems, in particular, systems for chit-chats. Our goal is to establish a conversation system with human-level communication skills such that it understands users’ language and emotions and reacts appropriately. We employ various machine learning methods for generating natural responses: neural networks and pre-trained language models. We also work on creating corpora for training conversation systems by two approaches: web-crawling and automatic generation using paraphrasing technologies.
Alignment-based semantic similarity assessment
Paraphrases convey the equivalent information in different expressions. Let’s take a pair of paraphrases: “The discussion heated up.” and “Their debate entered high gear.” as an example. Why we can understand they deliver the same information, even though they use totally different words? How our brains represent these expressions to estimate their semantic relevance? Paraphrases are the key to answer these questions. Paraphrases are also useful resources for applications that need to understand users’ saying, such as question answering and conversational agents. We are working on analysis of linguistic phenomena on paraphrases, as well as technologies to detect paraphrases. For more details, please visit our project web site.
Language Learning Support
We are developing assistive technologies for English education. Data-Driven Learning (DDL) is a trend in language education where students autonomously learn how a term and phrase is typically used through observing its use in various contexts. We are developing technologies to support DDL in collaboration with language education experts, for example, automatic assessment of sentence difficulty levels and automatic paraphrasing to control difficulty levels.
Thanks to SNS, people have been more connected with each other than ever before, and many devices are likewise connected with each other via IoT technology. Graph data can represent these connections, so graph analysis attracts considerable attention. The goal of graph mining is to discover valuable insights from graph data. For example, clustering identifies groups where people (or devices) behave similarly. Another example is link prediction, which finds pairs of people (or devices) that are likely to connect with each other. These analytics are used for recommendation systems, marketing, and more, and so are employed in a wide variety of applications. We aim to develop efficient, effective graph mining methods.
Patent Evaluating AI System
The number of registered patents in the world exceeds 10 million, and the number of patent applications per year exceeds 3 million. Thus, the burden of patent search on companies is increasing because of increasing of patents. In addition, the number of patent applications per year in China is much increasing and it is at the top of the world. The globalization of intellectual property rights is progressing, so it becomes more difficult to execute comprehensive patent search. In this research, we aim to develop a system for semi-automatic and highly accurate patent retrieval while utilizing the success of recent machine learning with neural network and incorporating big data analysis methods such as graph matching and clustering.
Exploratory Data Analysis
The exploratory data analysis is a technique for discovering characteristic data that largely differ from ordinary data, such as from purchase data and astronomical observation data. For example, we utilize important observations by discovering characteristic from the aspect of specific region or season. In particular, in collaboration with National Astronomical Observatory of Japan (NAOJ), we are working on the technology of outlier detection and imputation by learning normal patterns so that we aim to discover intrinsic variables whose brightness and position change in a short period of time.
Attribute Graph Clustering
We daily search routes from car navigation systems and smart phones. We propose new and useful types of route search that cannot be searched with existing services. In particular, we focus on actual attributes of data (e.g., category and text) and around information, for example we investigate routes with matching other attributes or integrating parking status. Besides this, we propose efficient algorithms of existing searches and implement services cooperating with local governments.
Spatio-temporal Data Mining
Local governments recently deploy many sensors in cities such as temperature, sound, traffic volume, and air pollution, and the sensors generate a large amount of data. We analyze the big data and obtain beneficial knowledges “temperature and traffic volume increase at the same time,” and “with increasing noise, air pollution also increases”. Our goal is to propose new techniques that find non-trivial knowledges.