Supervised and Semi-Supervised Learning for IR (full day)
Yi Zhang and Rong Jin
The Probabilistic Relevance Model: BM25 and Beyond
Hugo Zaragoza and Stephen Robertson
Web Mining for Search
Ricardo Baeza-Yates and Rosie Jones
Content-based and Semantic-based Image/Video Retrieval
Winston Hsu and Rong Yan
Supervised and Semi-Supervised Learning for IR (continued)
Yi Zhang and Rong Jin
Indian Language Information Retrieval: Dealing with Indian Language Text Retrieval Issues
Mandar Mitra, Prasenjit Majumder, and Sobha L
Learning to Rank for Information Retrieval
Applied Text Mining
Online Advertising: Business Models, Technologies and Issues
James Shanahan and Ayman Farahat
This tutorial will present a broad coverage of supervised and semi-supervised learning techniques and their application to information retrieval, with focus on semi-supervised learning. It will be organized into four parts: 1) a brief introduction to supervised learning, and its application to text categorization and ranking; 2) an overview of semi-supervised classification and the related learning algorithms, illustrated by the applications to information retrieval, 3) an introduction to active learning and its related learning algorithms, with the emphasis on its application to interactive retrieval and adaptive filtering, and 4) an overview of semi-supervised clustering and the related learning algorithms, illustrated by the applications to document clustering.
Yi Zhang is an Assistant Professor at Baskin School of Engineering, University of California Santa Cruz, where she teaches Information Retrieval. She has taught text and data mining class at Nanjing University and has given invited lectures at a Text Mining course in Carnegie Mellon University. Her research is related to information retrieval, text mining, statistical machine learning, and natural language processing, and she has published and served as a reviewer for conferences and journals in the area of information retrieval and machine learning. She has collaborated with start-ups, large corporations and government agencies on related topics. She received the Best Paper Award in ACM SIGIR 2002. Dr. Zhang received her Ph.D. and M.S. from Carnegie Mellon University and a B.S. from Tsinghua University. She has received NSF grant and Air Force Research Young Investigator Award.
Rong Jin has been an Assistant Professor in the Computer and Science Engineering Department of Michigan State University since 2003. He is working in the areas of statistical machine learning and its application to information retrieval. Dr. Jin has worked on a variety of machine learning algorithms and their application to information retrieval, including retrieval models, collaborative filtering, cross lingual information retrieval, document clustering, and video/image retrieval. He has published over sixty conference and journal articles on related topics. Dr. Jin holds a B.A. in Engineering from Tianjin University, an M.S. in Physics from Beijing University, and an M.S. and Ph.D. in Computer Science from Carnegie Mellon University. He received the NSF Career Award in 2006.
The Probabilistic Relevance Model (PRM) is the formal framework behind BM25 and some of the most widely used algorithms for retrieval. In this tutorial we will discuss the theoretical modeling and the practical tuning work that is required to understand the PRM, derive new algorithms and go beyond BM25.
Hugo Zaragoza is a researcher in Information Retrieval at Yahoo! Research Barcelona. He is interested in the applications of machine learning and natural language processing to information retrieval. Previously he worked at Microsoft Research Cambridge (UK) and collaborated with Microsoft product groups such as MSN-Search and SharePoint Portal Server.
Stephen Robertson runs the Information Retrieval and Analysis group at Microsoft Research Cambridge (UK). He is one of the inventors of the Probabilistic Relevance Model and of Okapi BM25. Prior to joining Microsoft, he was at City University London, where he retains a part-time position. He was awarded the Tony Kent STRIX award by the Institute of Information Scientists in 1998 and the Salton Award by ACM SIGIR in 2000.
The tutorial demonstrates in a step-by-step mode how to exploit and extend experimental laboratory IR research into an integrated research design framework for interactive IR studies. Potential test variables and experimental designs illustrate the cases of: a) ultra-light IR interaction and simulations of searcher activities; b) interactive-light IR experiments involving test persons in laboratory settings; and c) semi-controlled field studies of IR in naturalistic organizational contexts.
Peter Ingwersen is a Professor at the Royal School of Library and Information Science, Denmark. His areas of research are Interactive IR, IR evaluation methods, IR Theory; and Informetrics-Scientometrics. He has given tutorials on Introduction to IR and User-oriented IR at SIGIR and the ESSIR summer schools and organized with Kalervo Jarvelin the SIGIR Workshop IR in Context (2004 and 2005). He has published a number of highly cited research monographs and journal articles on IR and received several international research awards.
Web search is a public-facing industry application of IR research. Web mining is a tool both to solve web search problems, and to generate knowledge from the artifacts of web search. We introduce web search, and the web mining research which is motivated by it. Where possible we will give examples using publicly available resources and tools.
Ricardo Baeza-Yates is Yahoo! VP of Research for Europe, the Middle East and Latin America, leading the Yahoo! Research labs at Barcelona, Spain; Santiago, Chile; and Haifa Israel. He is co-author of Modern Information Retrieval. His research interests include algorithms and data structures, information retrieval, text and multimedia databases, software and database visualization, user interfaces and web mining.
Rosie Jones is a Senior Research Scientist in Information Retrieval at Yahoo! Research. She is an active participant in the IR community, serving as Senior PC member for SIGIR in 2007 and 2008. Her research interests include information retrieval, web mining and natural language processing.
The explosive growth of multimedia data stimulates the needs for effective methods of multimedia content analysis and retrieval. This tutorial aims to provide comprehensive coverage of recent developments in content-based and semantic-concept-based image and video retrieval, including theoretical and practical results, illustrative demos, and complementary information to IR researchers interested in this research area.
Winston Hsu is an Assistant Professor in the Graduate Institute of Networking and Multimedia, National Taiwan University. He received his Ph.D. (2006) degree from Columbia University, New York. His current research interests are to enable "Next-Generation Multimedia Retrieval" and generally include content analysis, mining, retrieval, and machine learning over large-scale multimedia databases.
Rong Yan is a Research Staff Member in the Intelligent Information Management Department of IBM T. J. Watson Research Center. Dr. Yan received his M.Sc. (2004) and Ph.D. (2006) degrees from Carnegie Mellon University's School of Computer Science. His research interests include multimedia information retrieval, video content analysis and data mining.
The need for effective methods for IR from Indian language (IL) texts has been increasingly felt in recent times. This tutorial will address some of the essential issues one needs to handle while searching texts written in Indian languages. The following topics will be covered:
The tutorial should be useful for those planning to participate in FIRE, a TREC-style evaluation event for ILIR.
Mandar Mitra is an Assistant Professor at the Indian Statistical Institute, Kolkata. He received his PhD in Computer Science from Cornell University, USA. He participated at TREC as a part of the SMART group from 1994-1998 and has also participated in CLEF in the recent past. He is the coordinator of FIRE, the new evaluation workshop for Indian languages (http://www.isical.ac.in/~fire).
Prasenjit Majumder is a Post Doctoral fellow in the Information Retrieval Lab, CVPR unit, Indian Statistical institute, Kolkata. His doctoral dissertation was on ``Information Retrieval for Resource-Constrained Languages'' from Jadavpur University, Kolkata, India. He is a regular participant at CLEF and co-coordinator of the FIRE workshop.
Sobha L is a faculty member at the AU-KBC Research Centre, Anna University, Chennai. She received her Ph.D. in Computational Linguistics from Mahatma Gandhi University, Kerala. She has worked extensively in the area of NLP (especially in Anaphora Resolution and Information Extraction) for both English and Indian languages.
An introduction will be given to the new research area, learning to rank for information retrieval. For learning, a training set of queries and their associated documents (with relevance judgments) are provided. The ranking model is trained with the data in a supervised fashion, by minimizing certain loss functions. For ranking, the model is applied to new queries and sorts their associated documents. Three major approaches have been proposed, i.e., pointwise, pairwise and listwise approaches to learning to rank. The pointwise approach solves the problem of ranking by means of regression or classification on single documents. The pairwise approach transforms ranking to classification on document pairs. The listwise approach tackles the ranking problem directly, by adopting listwise loss functions, or optimizing IR evaluation measures. After introducing these three approaches, we will compare them, discuss their underlying theories, and mention some future research directions.
Tie-Yan Liu is a lead researcher at Microsoft Research Asia. His current research interests include learning to rank for information retrieval, and infrastructure/algorithms for large-scale machine learning. Dr. Liu has about 70 quality papers published in refereed international conferences and journals, including SIGIR(9), WWW(3), ICML(3), KDD(2), etc. He has over 30 filed US / international patents or pending applications. He is the winner of the Most Cited Paper Award for the Journal of Visual Communication and Image Representation. He has been a program committee member for about 30 international conferences, a Senior Program Committee member (formerly Area Coordinator) of SIGIR 2008, and a co-chair of the SIGIR 2007 workshop on learning to rank for information retrieval (LR4IR 2007). He has been a tutorial speaker at WWW 2008, AIRS 2008, etc.
Text Mining is an exciting research area that tries to solve the information overload problem by using techniques from data mining, machine learning, NLP, IR and knowledge management. In this tutorial we will present the general theory of Text Mining and will demonstrate several systems that use these principles to enable interactive exploration of large textual collections. The Tutorial will cover the state of the art in this rapidly growing area of research. Several real world applications of text mining will be presented.
Ronen Feldman is an Associate Professor and the head of the Information Systems department at the Business School of the Hebrew University in Jerusalem. He received his B.Sc. from the Hebrew University and his Ph.D. in Computer Science from Cornell University in NY. He is the founder of ClearForest Corporation. He is the author of the book The Text Mining Handbook published by Cambridge University Press in 2007.
Internet advertising revenues in the United States totaled $16.9 billion for 2006, up 35 percent versus 2005 revenues of $12.5 billion (according to the Interactive Advertising Bureau). Fueled by these growth rates and the desire to provide added incentives and opportunities for both advertisers and publishers, alternative business models to online advertising are been developed. This tutorial will review the main business models of online advertising including: the pay-per-impression model (CPM); and the pay-per-click model (CPC); and a relative newcomer, the pay-per-action model (CPA), where an action could be a product purchase, a site visit, a customer lead, or an email signup. The tutorial will also discuss in detail the technology being leveraged to automatically target ads within these business models; this largely derives from the fields of information retrieval, machine learning, statistics, and economics. We will also discuss the nascent but fast growing field of mobile advertising which has its own challenges and opportunities. Challenges in the field of online advertising such as noisy statistics, click fraud (often considering the spam of online advertising), deception, privacy and other open issues will also be discussed, as well as Web 2.0 applications such as social networks and video/photo-sharing.
James G. Shanahan is an Independent Consultant based in San Francisco, California, USA. He has spent the last 20 years developing and researching cutting-edge information management systems to harness information retrieval, linguistics and machine learning. Prior to being an independent consultant, Jimi was Chief Scientist (and member of the executive team) at Turn Inc. where he focused on the development and deployment of an online ad targeting system (CPA/CPC/CPM-based) in a principled and measured way that leveraged advanced statistical and machine learning techniques; These responsibilities included leveraging the entire reservoir of data assets in order to develop methods for identifying key optimizations, deploying relevant analytical tools and improving the user experience. Prior to joining Turn, Jimi was Principal Research Scientist at Clairvoyance Corporation where he led the Knowledge Discovery from Text Group. Before that he was a Research Scientist at Xerox Research Center Europe (XRCE), where, as a member of the Co-ordination Technologies Group, he developed Document Souls, a patented document-centric approach to information access. In the early 90s, he worked on the AI Team within the Mitsubishi Group in Tokyo.
He has published six books and over 50 research publications in the area of machine learning and information processing. Jimi is General Chair for CIKM 2008. Jimi received his Ph.D. in engineering mathematics from the University of Bristol, U. K. and holds a bachelor of science degree in computer science from the University of Limerick, Ireland. He is a Marie Curie fellow and member of IEEE and ACM.
Ayman Farahat is chief scientist at Admob, the world’s largest mobile advertising network. Prior to that, he was a member of the research staff at Palo Alto research center where he worked on various aspects of statistical natural language processing, social networks and link analysis information retrieval and computational linguistics. Ayman's research has focused on developing methods for extracting, organizing and monetizing information from a large heterogeneous collection of documents such as the Web. A central theme of his research is combining the linguistic and textual content with other non textual features such as links and access patterns to develop statistical models. These statistical models are based on advanced machine learning algorithms such as Generalized linear models, Bayesian statistics.