书签 分享 收藏 举报 版权申诉 / 31
上传文档赚钱

类型来自GoogleDevOps经验的落地实践-SRE课件.pptx

  • 上传人(卖家):晟晟文业
  • 文档编号:5149638
  • 上传时间:2023-02-15
  • 格式:PPTX
  • 页数:31
  • 大小:1.83MB
  • 【下载声明】
    1. 本站全部试题类文档,若标题没写含答案,则无答案;标题注明含答案的文档,主观题也可能无答案。请谨慎下单,一旦售出,不予退换。
    2. 本站全部PPT文档均不含视频和音频,PPT中出现的音频或视频标识(或文字)仅表示流程,实际无音频或视频文件。请谨慎下单,一旦售出,不予退换。
    3. 本页资料《来自GoogleDevOps经验的落地实践-SRE课件.pptx》由用户(晟晟文业)主动上传,其收益全归该用户。163文库仅提供信息存储空间,仅对该用户上传内容的表现方式做保护处理,对上传内容本身不做任何修改或编辑。 若此文所含内容侵犯了您的版权或隐私,请立即通知163文库(点击联系客服),我们立即给予删除!
    4. 请根据预览情况,自愿下载本文。本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
    5. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007及以上版本和PDF阅读器,压缩文件请下载最新的WinRAR软件解压。
    配套讲稿:

    如PPT文件的首页显示word图标,表示该PPT已包含配套word讲稿。双击word图标可打开word文档。

    特殊限制:

    部分文档作品中含有的国旗、国徽等图片,仅作为作品整体效果示例展示,禁止商用。设计者仅对作品中独创性部分享有著作权。

    关 键  词:
    来自 GoogleDevOps 经验 落地 实践 SRE 课件
    资源描述:

    1、SRE 是什么鬼Google SRE 07-14 YouTube Streaming Video transcoding,streaming,storage(1PB/month )Global CDN network(10K nodes,peaking 10Tbps egress).Google SRE 07-14 Google Cloud Platform Machine lifecycle management(X clusters globally,Y machines)Borg ,Omega(X million jobs scheduled every week)SITE RELIAB

    2、ILITY ENGINEERING说白了就是 DevOps 一回事Site 生产线管理员 Ensure user-visible uptime and service quality Authority over production environment.跟网站一起成长 Steep learning curve,mostly due to complexity Continuous retraining,sites always being improved 基础架构设施 Specializations for shared infrastructure Ensure those comp

    3、onents have good reliabilityReliability it just works Service Level Objective (SLO)Monitoring/Deployment Capacity Planning 以一敌百 Team manages monitoring and develops automation Implies use of scripting and data analysis tools Most failures need automated recoveries in place 救火队员和纵火犯合体 Elevated risk d

    4、uring convenient working hours Learn of age mortality risk during preceding workday Infant mortality ideally also avoids mealsEngineering 码农 Not administration 报警系统重度(中毒)用户 Holes may cause outage before notification occurs Routinely use multiple layers,levels and viewpoints Design the manual and aut

    5、omatic escalation paths 对未来负责 Responsible for enabling growth and scaling Plan for requirements,identify inefficiencies File bugs and,where appropriate,fix them tooWho are SRE 跑偏了的程序员 50-50 mix of software background systemsengineering background.重度强迫症和处女座 “a team of people who fundamentally will no

    6、t acceptdoing things over and over by hand.“Ben Treynor 脸皮厚 DEV/OPSEternal conflict DEV The incentive of the development team is to getfeatures launched and to get users to adopt theproduct.OPS The incentives of a team with operational duties is toensure that the thing doesnt blow up on their watch.

    7、一图看懂组织结构 BOSS 产品线 小BOSS 艺术类 开发团队 生产线 APP SRE Infrastructure SRE 数据中心运营 供应链组织结构 以各产品线为核心,松散的学习型组织 Get Incentives right.SRE is a privilege,not a right.Free to move,Free to leave bad service.SRE 要做什么 SRE 说了算 Production Readiness Review(PRR).ROI matters most for SRE SRE resource is limited High marginal

    8、 benefits work.Early phase SRE gives guidance in automating routine tasks Reduces workload by eliminating administrivia SRE points out errors,omissions in documents Developer might then beg others for assistance SRE suggests additional long term monitors These fill in coverage gaps and track perform

    9、ance Administrators need sufficient,trustworthy monitoringMature phaseThe decisions become progressively longer term Daily task workload for a site is getting reduced Software improvements are tuning and analysisThe developer still has a short term viewpoint Working on the next release,fixing known

    10、bugs The old live releases start to be a distraction An obvious incentive to request site transfer to SREONCALL PHASE On call more than quick fixes SRE team members take turns.Fix any problem whose solution is not yet automated Accumulate occurrence counts to identify prioritiesDocument the effectiv

    11、e diagnostics and solutions The permanent solution takes a lot more time File bug,develop patch,test,code review,submit Schedule for integration,release and deployment Why spend many hours or days doing all that?Deployment model Following the sun.Only one engineer responds to any given alert Use a p

    12、riority or escalation rule to avoid wasted effort The other SREs on call are unlikely to be disturbed Redundancy everywhere!What is the failure rate of your paging services?Hopefully better than 10%,unlikely to achieve 1%eg:5%with four way redundant paths is 99.999%SRE Best practiceHow to build a go

    13、od SRE team.CMM maturity model Level 1 -Initial(Chaotic,Heroic)Level 2 Repeatable Level 3 Defined Level 4 Managed Level 5 -Optimizing如何有成效的填坑OPS OVERLOAD Reduce complexity Less dependencies,configuration types,interfaces.Knowledge sharing Assume there are no humans operating.Refocus human involvemen

    14、t Quarterly Service Review Hard cap 50%ops load.Provide career path正确的和开发团队掐架SLO Budgeting Establish SLO Goal Nothing is 100%reliable.Spend error budget On bad releases,technical debt etc.FREEZE on blown budget.No more arguing Its physics!Self-policing Incentives.Moral authority灾难级别分类FAILURESFailove

    15、r with minimal delay,near full quality serviceFailover with significant delay,near full quality servicePartial or limited service,with good to mediumqualityPrevent crash with no or very limited service,lowqualityCrash without data loss or corruptionCrash with data lossCrash with data loss,corruption

    16、,destructionfix it really quickly when it does fail.minutes to something that goes wrong.and takes the appropriate corrective actions,versus安全生产两大指标MTBF /MTTR Availability =f(MTBF,MTTR)You can make it fail very rarely,or you are able to Typically,no human will respond in less than two It is that the

    17、 human correctly assesses the situationdiagnosing incorrectly or taking ineffective steps.如何设计不会坏的系统Design:Failures Defense in depth All the different layers of the system can/will fail.User exp must not be affected.No human involvement.Graceful degradation Caching/Time shifting Failover Redundant I

    18、nstances,N+2 Localization of issue.如何正确的花钱买机器Capacity planning what N+M do you run your services at?I dont know,because weve never assessed what thecapacity of our service is.Tips How to benchmark service,How to measure its response to 100%or 130%of peak.How much spare capacity you have at peak dema

    19、ndtime Expect frequent outages and lots of emergencies.Something that is happening or about to happen,thatthe situation.You have maybe hours,typically,days,but someavailable for diagnostic or forensic purposes.The正确的实现监控系统DESIGN:monitoring Alerts which say a human must take action right now.a human

    20、needs to take action immediately to improve Tickets.A human needs to take action,but not immediately.human action is required.Logging.No one ever needs to look at this information,but it isexpectation is that no one reads it.situations until they dont have to think about it.实战演习Wheel of misfortune O

    21、perational readiness drills.Tips Picking a disaster Role playing Observe Drill people on the correct response to emergency Culturally compatible.地球级别的灾难演习D.I.R.T Simulated disaster recovery.Total site loss.Incident management.Business continuation.POSTMOTERM Blameless postmortem About process and te

    22、chnology,not people Readable&shared to wide variety of readers.No over-engineeringPOSTMOTERM Capture the facts Impacted services and magnitude Incident timeline Key contact info Data Root cause and trigger analysis 5 Whys Why was this possible in the first place.POSTMOTERM Lessons learned What went well What went wrong Action Plan Investigate,management of issue Mitigation Prevention.A problem is resolved to the degree that no humanbeing will ever have to pay attention to it again.

    展开阅读全文
    提示  163文库所有资源均是用户自行上传分享,仅供网友学习交流,未经上传用户书面授权,请勿作他用。
    关于本文
    本文标题:来自GoogleDevOps经验的落地实践-SRE课件.pptx
    链接地址:https://www.163wenku.com/p-5149638.html

    Copyright@ 2017-2037 Www.163WenKu.Com  网站版权所有  |  资源地图   
    IPC备案号:蜀ICP备2021032737号  | 川公网安备 51099002000191号


    侵权投诉QQ:3464097650  资料上传QQ:3464097650
       


    【声明】本站为“文档C2C交易模式”,即用户上传的文档直接卖给(下载)用户,本站只是网络空间服务平台,本站所有原创文档下载所得归上传人所有,如您发现上传作品侵犯了您的版权,请立刻联系我们并提供证据,我们将在3个工作日内予以改正。

    163文库