软件工程基础Hadoop生态系统刘驰AnEcosystemforCloudComputing课件.ppt
- 【下载声明】
1. 本站全部试题类文档,若标题没写含答案,则无答案;标题注明含答案的文档,主观题也可能无答案。请谨慎下单,一旦售出,不予退换。
2. 本站全部PPT文档均不含视频和音频,PPT中出现的音频或视频标识(或文字)仅表示流程,实际无音频或视频文件。请谨慎下单,一旦售出,不予退换。
3. 本页资料《软件工程基础Hadoop生态系统刘驰AnEcosystemforCloudComputing课件.ppt》由用户(三亚风情)主动上传,其收益全归该用户。163文库仅提供信息存储空间,仅对该用户上传内容的表现方式做保护处理,对上传内容本身不做任何修改或编辑。 若此文所含内容侵犯了您的版权或隐私,请立即通知163文库(点击联系客服),我们立即给予删除!
4. 请根据预览情况,自愿下载本文。本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
5. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007及以上版本和PDF阅读器,压缩文件请下载最新的WinRAR软件解压。
- 配套讲稿:
如PPT文件的首页显示word图标,表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
- 特殊限制:
部分文档作品中含有的国旗、国徽等图片,仅作为作品整体效果示例展示,禁止商用。设计者仅对作品中独创性部分享有著作权。
- 关 键 词:
- 软件工程 基础 Hadoop 生态系统 AnEcosystemforCloudComputing 课件
- 资源描述:
-
1、软件工程基础软件工程基础Hadoop生态系统生态系统刘刘 驰驰An Ecosystem for Cloud Computing2ProblempBatch (offline) processing of huge data set using commodity hardware is not enough for real-time applicationspStrong desire for linear scalabilitypNeed infrastructure to handle all mechanicspallow developers to focus on the proc
2、essing logic/algorithms3Explosive Data! StoragepNew York Stock Exchange: 1 TB data per daypFacebook: 100 billion photos, 1 PB (1000 TB)pInternet Archive: 2 PB data, growing by 20 TB per monthpCant put data on a SINGLE nodepStrong needs for distributed file systems5Java/Python/C interfaces6Commercial
3、 Hardware典型的典型的2层构架层构架 节点是普通的商业PC机 30-40 节点/Rack 顶层到Rack 带宽3-4Gbps Rack到节点带宽1Gbps7Who is (was) Using Hadoop?8Example: Facebook的的Hadoop集群集群p产品集群产品集群n4800个内核,600个机器,每个机器16GB2009年4月n8000个内核,1000个机器,每个机器32GB2009年7月n每个机器拥有4个1TB大小的SATA硬盘n两层网络结构,每个Rack有40个机器n整个集群大小为2PB,未来还会不断增加p测试集群测试集群800 个内核, 每个16GB 9A D
4、istributed File SystemSingle-Node ArchitectureMemoryDiskCPUMachine Learning, Statistics“Classical” Data Mining11Commodity ClusterspWeb data sets can be very large nTens to hundreds of TBpCannot mine on a single serverpStandard architecture emerging:nCluster of commodity Linux nodesnGigabit Ethernet
5、interconnectpHow to organize computations on this architecture?nMask issues such as hardware failure12Cluster ArchitectureMemDiskCPUMemDiskCPUSwitchEach rack contains 16-64 nodesMemDiskCPUMemDiskCPUSwitchSwitch1 Gbps between any pair of nodesin a rack2-10 Gbps backbone between racks13Stable storagep
6、First order problem: if nodes can fail, how can we store data persistently? pAnswer: Distributed File SystemnProvides global file namespacenGoogle GFS; Hadoop HDFS; Kosmix KFSpTypical usage patternnHuge files (100s of GB to TB)nData is rarely updated in placenReads and appends are common141516Nameno
7、de and DatanodespMaster/slave architecturep1 Namenode, a master server that manages the file system namespace and regulates access to files by clients.pmany DataNodes usually one per node in a cluster.nmanage storagenserves read, write requests, performs block creation, deletion, and replication upo
8、n instruction from Namenode.pHDFS exposes a file system namespace and allows user data to be stored in files.pA file is split into one or more blocks and set of blocks are stored in DataNodes.2022-6-2317Namespace2022-6-23pHierarchical file system with directories and filespCreate, remove, move, rena
9、me etc.pNamenode maintains the file systempAny meta information changes to the file system recorded by the Namenode.pAn application can specify the number of replicas of the file needed: replication factor of the file. This information is stored in the Namenode.18Data Replication2022-6-23pStore very
10、 large files across machines in a large cluster.pEach file is a sequence of blocks of same size.pBlocks are replicated 2-3 times.pBlock size and replicas are configurable per file.pNamenode receives a Heartbeat and a BlockReport from each DataNode in the cluster.pBlockReport contains all the blocks
11、on a Datanode.19Replica Placement2022-6-23pRack-aware: nGoal: improve reliability, availability and network bandwidth utilizationnResearch topicpNamenode determines the rack id for each DataNode.pReplicas are placed: n1 in a local rack, 1 on a different node in the local rack and 1 on a node in a di
12、fferent rack.n1/3 of the replica on a node, 2/3 on a rack and 1/3 distributed evenly across remaining racks.20HDFS: Data Node Distance21Replication PipeliningpWhen the client receives response from Namenode, it flushes its block in small pieces (4K) to the first replica, that in turn copies it to th
13、e next replica and so on.pThus data is pipelined from Datanode to the next.2022-6-2322Replica Selection 2022-6-23pReplica selection for READ operation: HDFS tries to minimize the bandwidth consumption and latency.pIf there is a replica on the Reader node then that is preferred.pHDFS cluster may span
14、 multiple data centers: replica in the local data center is preferred over the remote one.23Datanode2022-6-23pA Datanode stores data in files in its local file system.pDatanode has no knowledge about HDFS filesystempIt stores each block of HDFS data in a separate file.pDatanode does not create all f
15、iles in the same directory.pIt uses heuristics to determine optimal number of files per directory and creates directories appropriately: nResearch issue?pWhen the filesystem starts up it generates a list of all HDFS blocks and send this report to Namenode: Blockreport. 24HDFS: File Read25HDFS: File
16、Write26Communication Protocol2022-6-23pAll protocols are layered on top of the TCP/IP protocolpA client establishes a connection to a configurable TCP port on the Namenode machine. It talks ClientProtocol with the Namenode.pDatanodes talk to the Namenode using Datanode protocol.pRPC abstraction wrap
17、s both ClientProtocol and Datanode protocol.pNamenode is simply a server and never initiates a request; it only responds to RPC requests issued by DataNodes or clients. 27DataNode Failure and HeartbeatpDatanodes lose connectivity with Namenode.pNamenode detects this condition by the absence of a Hea
18、rtbeat message.pNamenode marks Datanodes without Hearbeat and does not send any IO requests to them.pAny data registered to the failed Datanode is not available to the HDFS.2022-6-2328Cluster RebalancingpHDFS architecture is compatible with data rebalancing schemes.pA scheme might move data from one
19、 Datanode to another if the free space on a Datanode falls below a certain threshold.pIn the event of a sudden high demand for a particular file, a scheme might dynamically create additional replicas and rebalance other data in the cluster.pThese types of data rebalancing are not yet implemented: re
20、search issue.2022-6-2329APIspHDFS provides Java API for application to use.pPython access is also used in many applications.pA C language wrapper for Java API is also available.pA HTTP browser can be used to browse the files of a HDFS instance.2022-6-2330FS Shell, Admin and Browser InterfacepHDFS or
展开阅读全文
链接地址:https://www.163wenku.com/p-3025468.html