大规模数据处理/云计算 Introduction - PKU

大规模数据处理/云计算 Introduction - PKU

/ Lecture 1 Introduction to MapReduce 7/9/2013 http://net.pku.edu.cn/~course/cs402/ Jimmy Lin University of Maryland SEWMGroup This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details What is this course about?

Data-intensive information processing Large-data (web-scale) problems Focus on MapReduce programming An entry-level course~ 2 What is MapReduce? Programming model for expressing distributed computations at a massive scale Execution framework for organizing and

performing such computations Open-source implementation called Hadoop 3 Why Large Data? How much data? Google processes 20 PB a day (2008) Wayback Machine has 3 PB + 100 TB/month (3/2009)

Facebook has 2.5 PB of user data + 15 TB/day (4/2009) eBay has 6.5 PB of user data + 50 TB/day (5/2009) CERNs LHC will generate 15 PB a year 640K ought to be enough for anybody. 5

6 Happening everywhere! microarray chips microprocessors Molecular biology (cancer) fiber optics Network traffic (spam) 300M/day

Simulations (Millennium) particle colliders Particle events (LHC) 1B 7 1M/sec 8 Maximilien Brice, CERN

9 Maximilien Brice, CERN 10 Maximilien Brice, CERN 11 Maximilien Brice, CERN No data like more data! s/knowledge/data/g; How do we get here if were not Google?

(Banko and Brill, ACL 2001) (Brants et al., EMNLP 2007) 12 Example: information extraction Answering factoid questions Pattern matching on the Web Works amazingly well Who shot Abraham Lincoln? X shot Abraham Lincoln

Learning relations Start with seed instances Search for patterns on the Web Using patterns to find more instances Wolfgang Amadeus Mozart (1756 - 1791) Einstein was born in 1879 Birthday-of(Mozart, 1756) Birthday-of(Einstein, 1879)

PERSON (DATE PERSON was born in DATE (Brill et al., TREC 2001; Lin, ACM TOIS 2007) (Agichtein and Gravano, DL 2000; Ravichandran and Hovy, ACL 2002; ) 13 Example: Scene Completion Hays, Efros (CMU), Scene Completion Using Millions of Photographs SIGGRAPH, 2007 Image Database Grouped by Semantic Content

30 different Flickr.com groups 2.3 M images total (396 GB). Select Candidate Images Most Suitable for Filling Hole

Classify images with gist scene detector [Torralba] Color similarity Local context matching Computation

Index images offline 50 min. scene matching, 20 min. local matching, 4 min. compositing Reduces to 5 minutes total by using 5 machines Extension Flickr.com has over 500 million images 14 14

More Data More Gains? CNNIC 2010 6 4.2 31.8% 4334 2.77 18.6% 1.4 30% 15 2009 2009 238868 145475 93393 37. 88

312.46 73.4 1.41 0.33 567.27 4.73 8.86% 11.24% 5.36% 4.53% 4.61% 8.94% 16 Did you know? 17 Did you know? We are currently preparing our students for

jobs that dont yet exist It is estimated that a weeks worth of the New York Times contains more information than a person was likely to come across in a lifetime in the 18th century The amount of new technical information is doubling every 2 years So what does IT ALL MEAN? 18 We are living in exponential times 19 Two Different Views

a thrower-awayer Jennifer Widom trying to live an efficient life so that one has time to work and be with ones family. MyLifeBits Gordon Bell 20

Information Overloading 21

What is Cloud Computing? The best thing since sliced bread? Before clouds Grids Vector supercomputers

Cloud computing means many different things: Large-data processing Rebranding of web 2.0 Utility computing Everything as a service 23 Rebranding of web 2.0

Rich, interactive web applications Clouds refer to the servers that run them AJAX as the de facto standard (for better or worse) Examples: Facebook, YouTube, Gmail, The network is the computer: take two

User data is stored in the clouds Rise of the netbook, smartphones, etc. Browser is the OS 24 Source: Wikipedia (Electricity meter) Utility Computing What?

Why? Computing resources as a metered service (pay as you go) Ability to dynamically provision virtual machines Cost: capital vs. operating expenses Scalability: infinite capacity Elasticity: scale up or down on demand Does it make sense?

Benefits to cloud users Business case for cloud providers I think there is a world market for about five computers. 26 Everything as a Service Utility computing = Infrastructure as a Service (IaaS)

Platform as a Service (PaaS) Why buy machines when you can rent cycles? Examples: Amazons EC2, Rackspace Give me nice API and take care of the maintenance, upgrades, Example: Google App Engine

Software as a Service (SaaS) Just run it for me! Example: Gmail, Salesforce 27 Utility Computing pay-as-you-go Microsoft pay less

utility computing 500 use less pay less cloud computing 28 Platform as a Service (PaaS) Web Application Services PaaS Internet Multi-tenant architecture platform 29

Software as a Service (SaaS) a model of software deployment whereby a provider licenses an application to customers for use as a service on demand. 30 Who cares? Ready-made large-data problems

Lots of user-generated content Even more user behavior data Examples: Facebook friend suggestions, Google ad placement Business intelligence: gather everything in a data warehouse and run analytics to generate insight Utility computing

Provision Hadoop clusters on-demand in the cloud Lower barrier to entry for tackling large-data problem Commoditization and democratization of large-data capabilities 31 Story around Hadoop Google-IBM Cloud Computing Initiative 2007 10 Google IBM 6

MapReduce Hadoop 200 0 2500 33 Cloud Computing Initiative Google and IBM team on cloud computing initiative for universities(2007) provide several hundred computers access through the Internet to test parallel programming projects The idea for the program from

Google senior software engineer Christophe Bisciglia Google Code University 34 The Information Factories Googleplex(pre-2008) servers number 450,000, according to the lowest estimate 200 petabytes of hard disk storage four petabytes of RAM To handle the current load of 100 million queries a day, input-output bandwidth must be

in the neighborhood of 3 petabits per second 35 Google Infrastructure 2003 "The Google file system," in sosp. Bolton Landing, NY, USA: ACM Press, 2003. 2004 "MapReduce: Simplified Data Processing on Large Clusters," in osdi, 2004, 2006 "Bigtable: A Distributed Storage System for Structured Data (Awarded Best Paper!)," in osdi, 2006 . 36

Hadoop Project Doug Cutting 37 History of Hadoop

2004 - Initial versions of what is now Hadoop Distributed File System and Map-Reduce implemented by Doug Cutting & Mike Cafarella December 2005 - Nutch ported to the new framework. Hadoop runs reliably on 20 nodes. January 2006 - Doug Cutting joins Yahoo! February 2006 - Apache Hadoop project official started to support the standalone development of Map-Reduce and HDFS. March 2006 - Formation of the Yahoo! Hadoop team May 2006 - Yahoo sets up a Hadoop research cluster - 300 nodes

April 2006 - Sort benchmark run on 188 nodes in 47.9 hours May 2006 - Sort benchmark run on 500 nodes in 42 hours (better hardware than April benchmark) October 2006 - Research cluster reaches 600 Nodes December 2006 - Sort times 20 nodes in 1.8 hrs, 100 nodes in 3.3 hrs, 500 nodes in 5.2 hrs, 900 nodes in 7.8 January 2007 - Research cluster reaches 900 node April 2007 - Research clusters - 2 clusters of 1000 nodes April 2008: Won the 1-terabyte sort benchmark in 209 seconds on 900 nodes. October 2008: Loading 10 terabytes of data per day onto research clusters. March 2009: 17 clusters with a total of 24,000 nodes. April 2009: Won the minute sort by sorting 500 GB in 59 seconds (on 1,400 nodes) and the 100terabyte sort in 173 minutes (on 3,400 nodes). 38 Google Code University 2008, Seminar: Mass Data

Processing Technology on Large Scale Clusters, Tsinghua University Aaron Kimball 39 Startup: Cloudera Cloudera is pushing a commercial distribution for Hadoop Mike Olson Christophe Bisciglia Doug Cutting Aaron Kimball Tom White

40 Course Administrivia TextBooks [Tom] Tom White, Hadoop: The Definitive Guide, O'Reilly, 3rd, 2012.5. [Lin] Jimmy Lin and Chris Dyer, Data-Intensive Text Processing with MapReduce, 2013.1. 42 This schedule is tentative and subject to change without notice ID Topics

Contents Reading 7/9 Introduction to MapReduce ( ppt) Why large data? Cloud Computing Story about Hadoop [Lin]Ch1:Introduction

[Tom]Ch1:Meet Hadoop 7/11 MapReduce Basics (ppt) How do we scale up? MapReduce HDFS [Lin]Ch2:Mapreduce Basics [Tom]Ch6:How mapreduce works [GFS&MapReduce Paper] MapReduce Program Develop

Basic MapReduce algorithm design and design patterns [Tom]Ch5:Developing a MapReduce Application [Lin]Ch3:MapReduce algorithm design Introduction to Information Retrieval Inverted Index on MapReduce Retrieval Problems [Lin]Ch4:Inverted Indexing for Text Retrieval Graph Algorithm and Mapreduce Parallel Breadth-First-Search

PageRank [Lin]Ch5:Graph Algorithms What is clustering Applications of clustering in information retrieval K-means algorithm Evaluation of clustering How many clusters Canopy clustering [IIR] Ch16 7/16

7/18 7/23 7/25 MapReduce Algorithm Design(ppt) Text retrieval ( ppt) Graph Algorithm (ppt ) Flat Clustering

(ppt) Canopy Clustering (ppt) 43 Recap Why large data? Cloud computing Story about Hadoop 44

Recently Viewed Presentations

  • Jeopardy

    Jeopardy

    Jeopardy Causes Responses Leaders Miscellaneous Conditions Miscellaneous Q $100 Q $100 Q $100 Q $100 Q $100 Q $200 Q $200 Q $200 Q $200 Q $200 Q $300 Q $300 Q $300
  • DNA - MYP Biology 10

    DNA - MYP Biology 10

    The double helix 'ladder' of a DNA molecule is held together by 'rungs' made from pairs of chemicals called bases. DNA Bases. There are four types of bases, and they are usually identified by their initials. A. T. C. G....
  • GREENING Commercial Leases:

    GREENING Commercial Leases:

    GREEN LEASE? Any strategy that uses a lease to formalize the responsibilities between Tenants and Landlords with respect to a building's green practices. Such leases 'align' the financial and energy incentives of GREEN buildings to that both Tenants and Landlords...
  • Graphing 101 - Laney College

    Graphing 101 - Laney College

    The x-axis lies in the horizontal direction. the y-axis lies in the vertical direction. the intersection of these two axis is the origin. the origin usually has the coordinates (0,0) but can also have any two values.
  • Who Benefits Whom in Product Markets?

    Who Benefits Whom in Product Markets?

    Who Benefits Whom in Product Markets? Joel Waldfogel The Wharton School University of Pennsylvania
  • Economic Evaluation in D & I Research

    Economic Evaluation in D & I Research

    Weigh and Win, (implemented in Denver, Colorado), is a 12-month community weight loss program with daily email and/or text support, online access to health coaches, objective weight assessment and modest financial incentives to increase program reach.
  • Why did Mao launch the Great Leap Forward?

    Why did Mao launch the Great Leap Forward?

    Why did Mao launch the Great Leap Forward? LO: To examine Mao's aims of the 2nd 5 Year Plan . Key - GLF is an industrial 5YP but it has a huge impact on the countryside and agricultural workers. Recap:...
  • THE DURHAM REPORT - University of British Columbia

    THE DURHAM REPORT - University of British Columbia

    Lord Durham's report led to the Union of Upper and Lower Canada into the province of Canada in 1840. Change in Britain's Colonial Policy. Britain was starting to rethink its colonial policies in the 1840s. The purpose of having colonies...