Information Retrieval and Text Mining

Logistics

Instructor: Dr. Doina Caragea (dcaragea@ksu.edu)

Meeting Time: TU 11:30am-12:45pm in Nichols 236

Prerequisites: Basic knowledge in probability and statistics, data structures and algorithms. Prior knowledge of either Java or C++.

Targeted audience: Graduate students from Computer Science and related areas.

Course Description:

Information Retrieval (IR) refers to the processing, indexing and querying of unstructured or loosely structured information. This course will focus on the theory and practice of search engines for retrieving textual information (including web documents). Basic and advanced topics in IR will be covered, with emphasis on newer technologies that go beyond simple keyword search. Programming assignments will provide hands-on experience with retrieval systems. More advanced research in IR will be stimulated through the means of a class project.

Course Objectives:

  • Learn about traditional and advanced topics in IR. Understand the technologies underlying search engines, how they work and when they fail.
  • Focus on intelligent techniques that go beyond simple keyword search. Learn how machine learning and text mining can help the information retrieval process.
  • Gain practical experience in IR by implementing simple “proof-of-concept” retrieval systems.
  • Identify active research topics in IR and study one in detail, as part of the class project.

Topics:

The topics targeted in this course include: efficient text indexing, latent semantic indexing, Boolean and vector space retrieval models, probabilistic retrieval models (binary independence models and language models), evaluation, web spiders, link analysis (PageRank and HITS), relevance feedback and query expansion, text mining and categorization, topic detection and clustering, query-answering, etc.

Recommended Textbook

Course work and evaluation

There will be several individual assignments, several reading assignments and paper presentations, and a project (individual or group). There are no exams. Students will be required to write summaries for the assigned readings and to present some of the application papers, in particular papers related to their project topic. Individual assignments and the project will be evaluated based on design and analysis, writing and critical thinking. Projects will be evaluated by the peer students and by the instructor based on a final project presentation and a project report. Students are encouraged to attend every lecture and participate in class discussion.

Grading:

Assignments: 30% (individual)

Presentations: 20% (individual or team)

Project: 40% (individual or team)

Participation: 10%

Resources

Conferences

WWW

WWW06 WWW07 WWW08

IJCAI

IJCAI07 IJCAI05

AAAI

AAAI05 Semantic Web, Information Retrieval, and Extraction

AAAI06 Special Track on Artificial Intelligence and the Web

AAAI07 Special Track on Artificial Intelligence and the Web

AAAI08 Special Track on Artificial Intelligence and the Web.

ISWC

ISWC07 ISWC08

Some Workshops SIGIR Workshop NIPS Workshop

Additional Reading

V. Bush, As we may think, Atlantic Monthly, 176 (1945), pp.101-108.

PhD/MSc Work on IR

“Recommended reading for IR research students” by Moffat, Zobel, and Hawking. Appeared in the SIGIR Forum, 39(2), 2005. A list of readings collected by attendees at the SWIRL 2004 workshop held in Lorne, Australia. Available on-line in PDF.

Rose, D. E. and Levinson, D. 2004. Understanding user goals in web search In Proceedings of the 13th international Conference on World Wide Web (New York, NY, USA, May 17 - 20, 2004). WWW '04. ACM, New York, NY, 13-19.

Monika R. Henzinger, Hyperlink Analysis for the Web IEEE Internet Computing, vol. 5, no. 1, pp. 45-50, Jan/Feb., 2001.

BrowseRank: Letting Web Users Vote for Page Importance by Yuting Liu*, Bin Gao, Tie-Yan Liu, et al. , SIGIR 2008. (SIGIR Best Student Paper Award).

Using latent semantic analysis to find different names for the same entity in free text by Tim Oates, Vinay Bhat, Vishal Shanbhag, Charles Nicholas.

Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis Gabrilovich, Evgeniy ; Markovitch, Shaul, Proceedings of the 20th International Joint Conference on Artificial Intelligence, 2007

Ziv Bar-Yossef, Maxim Gurevich Random Sampling from a Search Engine's Index, Best Paper Award, World Wide Web Conference (WWW) 2007.

S. Pandit, C. Olston Navigation-Aided Retrieval, World Wide Web Conference (WWW) 2007.

Z. Bar-Yossef, M. Gurevich Efficient Search Engine Measurements, World Wide Web Conference (WWW) 2007.

Adomavicius G. and Tuzhilin A., Toward the Next Generation of Recommender Systems: A Survey of the State-of-the-Art and Possible Extensions, IEEE Trans.on Knowledge and Data Engg., Vol 17, No 6, 2005.

Das A., Datar M, Garg, A, Rajaram S., Google News Personalization: Scalable Online Collaborative Filtering, World Wide Web Conference (WWW) 2007.

Krallinger M, Valencia A: Text-mining and information-retrieval services for molecular biology. Genome Biol 2006, 6:224.

Yamamoto Y, Takagi T: Biomedical knowledge navigation by literature clustering. J Biomed Inform 2007, 40:114-130.

Mary Shultz (2007) Comparing test searches in PubMed and Google Scholar J Med Libr Assoc. 2007 October; 95(4): 442–445.

Hunter, L, Cohen, KB Biomedical Language Processing: What’s Beyond PubMed? Molecular Cell 21(5):589-594 March 3, 2006.

Martin Krallinger, Alfonso Valencia and Lynette Hirschman (2008) Linking genes to literature: text mining, information extraction, and retrieval applications for biology Genome Biology 2008, 9(Suppl 2):S8.

Building Domain-Specific Search Engines with Machine Learning Techniques. Andrew McCallum, Kamal Nigam, Jason Rennie and Kristie Seymore. AAAI-99 Spring Symposium.

Shilpy Sharma (2008) Information Retrieval in Domain Specific Search Engine with Machine Learning Approaches

Mark D. Smucker and James Allan. "Find-Similar: Similarity Browsing as a Search Tool," SIGIR, 2006.

Jimmy Lin and Mark D. Smucker. "How Do Users Find Things with PubMed? Towards Automatic Utility Evaluation with User Simulations," Technical Report LAMP-TR-148/HCIL-2008-07, University of Maryland, College Park, February 2008.

Learning to Rank

Christopher J. C. Burges, Learning to Rank for Web Search: Some New Directions Keynote talk.

Xing Wei, Bruce Croft and Andrew McCallum (2006) Table extraction for answer retrieval. Information Retrieval Journal (IRJ), volume 9, issue 5, pages 589-611, November 2006.

K Punera, S Rajan and J Ghosh, "Automatic Construction of N-ary Tree Based Taxonomies", MSD Workshop at ICDM '06.

Suju Rajan, Kunal Punera and Joydeep Ghosh, "A Maximum Likelihood Framework for Integrating Taxonomies", AAAI-05, pp. 856-861.

Aarti Gupta, Tim Oates: Using Ontologies and the Web to Learn Lexical Semantics. IJCAI 2007: 1618-1623.

Silva, R., Heller, K.A., and Ghahramani, Z. (2007) Analogical Reasoning with Relational Bayesian Sets. In the Eleventh International Conference on Artifical Intelligence and Statistics (AISTATS-2007). San Juan, Puerto Rico.

Ghahramani, Z. and Heller, K.A. (2006) Bayesian Sets. In Advances in Neural Information Processing Systems 18 (NIPS-2005).

Heller, K.A. and Ghahramani, Z. (2006) A Simple Bayesian Framework for Content-Based Image Retrieval. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR-2006).

Pranam Kolari et al., Detecting Spam Blogs: A Machine Learning Approach, Proceedings of the 21st National Conference on Artificial Intelligence (AAAI 2006), July 2006.

Zareen Syed et al., Wikipedia as an Ontology for Describing Documents, InProceedings, Proceedings of the Second International Conference on Weblogs and Social Media, March 2008.

Haizheng Zhang, W. Bruce Croft, Brian Levine, and Victor Lesser; A Multi-Agent Approach for Peer-to-Peer based Information Retrieval System, Proceedings of the Third International Joint Conference on Autonomous Agents and Multiagent Systems (AAMAS 2004)

A. Sieg, B. Mobasher, R. Burke. Learning Ontology-Based User Profiles: A Semantic Approach to Personalized Web Search. In IEEE Intelligent Informatics Bulletin, Vol.8, No. 1, November 2007.

A. Sieg, B. Mobasher, R. Burke. Web Search Personalization with Ontological User Profiles. Proceedings of the 16th ACM Conference on Information and Knowledge Management (CIKM 2007) Lisboa, Portugal, November 2007.

Xuerui Wang, Andrew McCallum and Xing Wei. Topical N-grams: Phrase and Topic Discovery, with an Application to Information Retrieval. Proceedings of the 7th IEEE International Conference on Data Mining (ICDM), 2007.

Lightly-Supervised Attribute Extraction for Web Search. Kedar Bellare, Partha Pratim Talukdar, Giridhar Kumaran, Fernando Pereira, Mark Liberman, Andrew McCallum and Mark Dredze. NIPS Workshop on Machine Learning for Web Search, (NIPS WS), 2007.

Yi, X. and Allan, J. (2008) "Evaluating Topic Models for Information Retrieval," to appear in the Proceedings of CIKM 2008, Napa Valley, CA, October 26-30, 2008.

Kumaran, G. and Allan, J. (2008) "Effective and Efficient User Interaction for Long Queries," Proceedings of the 31st Annual International ACM SIGIR Conference, pp. 11-18.

R. Nallapati, A. Ahmed, E. P. Xing, and W. Cohen, Sparse Feature Joint Latent Topic Models for text and citations., Proceedings of The Fourteen ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, (KDD 2008).

Community-based Link Prediction with Text. David Mimno, Hanna M. Wallach and Andrew McCallum. In Proceedings of the NIPS 2007 Workshop on Statistical Network Modeling (NIPS WS), 2007.

Zhenyu Liu and Wesley W. Chu "Knowledge-Based Query Expansion to Support Scenario-Specific Retrieval of Medical Free Text", Journal of Information Retrieval, special issue on Contextual Information Retrieval, 2007.

Kleinberg J. (1998). Authoritative Sources in a Hyperlinked Environment. Proceedings of the ACM-SIAM Symposium on Discrete Algorithms. Also appears as IBM Research Report RJ 10076, May 1997.

Brin S. and Page L. (1998). The Anatomy of a Large-Scale Hypertextual Web Search Engine. Proceedings of the Seventh World Wide Web Conference (WWW7), Brisbane. Also in a special issue of the journal Computer Networks and ISDN Systems, Volume 30, issues 1-7.

Bharat K., & Henzinger M.R. (1998). Improved Algorithms for Topic Distillation in Hyperlinked Environment. Proceedings of the 21st International ACM SIGIR Conference on Research and Development in Information Retrieval, Melbourne, Australia.

Chakrabarti S., van der Berg M., Dom B. (1999). Focused crawling: a new approach to topic-specific Web resource discovery. In Proceedings of 8th International World Wide Web Conference (WWW8). Also in Computer Networks & ISDN Systems 31(11-16):1623-1640.

Popular science

cis890fall08.txt · Last modified: 2009/08/18 21:23 by dcaragea
www.chimeric.de Creative Commons License Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0