Hadoop應(yīng)用架構(gòu)(影印版)
 
		
	
		
					 定  價:89 元 
					
				 
				 
				  
				
				   
				 
				  
				
				 
	
				
					
						- 作者::(美)Mark Grover//Ted Malaska//Jonathan Seidman//Gwen Shapira
 - 出版時間:2017/2/1
 
						- ISBN:9787564170011
 
						- 出 版 社:東南大學(xué)出版社
 
					
				  
  
		
				- 中圖法分類:TP274 
  - 頁碼:
 - 紙張:膠版紙
 - 版次:1
 - 開本:16開
 
				
					 
					
			
				
  
   
 
	 
	 
	 
	
	
	
		
		在使用Apache 
Hadoop設(shè)計端到端數(shù)據(jù)管理解決方案時獲得專家級指導(dǎo)。當(dāng)其他很多渠道還停留在解釋Hadoop生態(tài)系統(tǒng)中該如何使用各種紛繁復(fù)雜的組件時,這本專注實踐的書已帶領(lǐng)你從架構(gòu)的整體角度思考,它對于你的特別應(yīng)用場景而言是必不可少的,將所有組件緊密結(jié)合在一起,形成完整有針對性的應(yīng)用程序。
為了增強(qiáng)學(xué)習(xí)效果,本書第二部分提供了各種詳細(xì)的架構(gòu)案例.涵蓋部分*常見的Hadoop應(yīng)用場景。
無論你是在設(shè)計一個新的Hadoop應(yīng)用還是正計劃將 
Hadoop整合到現(xiàn)有的數(shù)據(jù)基礎(chǔ)架構(gòu)中,Mark Grover 、Ted Malaska、Jonathan Seidman、Gwen 
Shapira編*的《Hadoop應(yīng)用架構(gòu)(影印版)(英文版) 》都將在這整個過程中提供技巧性的指導(dǎo)。
使用Hadoop存放數(shù)據(jù)和建模數(shù)據(jù)時需要考慮的要素 
在系統(tǒng)中導(dǎo)入數(shù)據(jù)和從系統(tǒng)中導(dǎo)出數(shù)據(jù)的*佳實踐指導(dǎo) 數(shù)據(jù)處理的框架,包括MapReduce、Spark和 Hive 
常用Hadoop處理模式,例如移除重復(fù)記錄和使用窗口分析 Giraph,GraphX以及其他Hadoop上的大圖片處理工具 
使用工作流協(xié)作和調(diào)度工具,例如Apache Oozie 使用Apache Storm、Apache Spark Streaming 和Apache 
Flume處理準(zhǔn)實時數(shù)據(jù)流 點(diǎn)擊流分析、欺詐防止和數(shù)據(jù)倉庫的架構(gòu)實例
		
	
Foreword
Preface
Part Ⅰ.  Architectural Considerations for Hadoop Applications
1. Data Modeling in Hadoop
  Data Storage Options
  Standard File Formats
  Hadoop File Types
  Serialization Formats
  Columnar Formats
  Compression
  HDFS Schema Design
  Location of HDFS Files
  Advanced HDFS Schema Design
  HDFS Schema Design Summary
  HBase Schema Design
  Row Key
  Timestamp
  Hops
  Tables and Regions
  Using Columns
  Using Column Families
  Time-to-Live
  Managing Metadata
  What Is Metadata?
  Why Care About Metadata?
  Where to Store Metadata?
  Examples of Managing Metadata
  Limitations of the Hive Metastore and HCatalog
  Other Ways of Storing Metadata
  Conclusion
2. Data Movement
  Data Ingestion Considerations
  Timeliness of Data Ingestion
  Incremental Updates
  Access Patterns
  Original Source System and Data Structure
  Transformations
  Network Bottlenecks
  Network Security
  Push or Pull
  Failure Handling
  Level of Complexity
  Data Ingestion Options
  File Transfers
  Considerations for File Transfers versus Other Ingest Methods
  Sqoop: Batch Transfer Between Hadoop and Relational Databases
  Flume: Event-Based Data Collection and Processing
  Kafka
  Data Extraction
  Conclusion
3. Processing Data in Hadoop
  MapReduce
  MapReduce Overview
  Example for MapReduce
  When to Use MapReduce
  Spark
  Spark Overview
  Overview of Spark Components
  Basic Spark Concepts
  Benefits of Using Spark
  Spark Example
  When to Use Spark
  Abstractions
  Pig
  Pig Example
  When to Use Pig
  Crunch
  Crunch Example
  When to Use Crunch
  Cascading
  Cascading Example
  When to Use Cascading
  Hive
  Hive Overview
  Example of Hive Code
  When to Use Hive
  Impala
  Impala Overview
  Speed-Oriented Design
  Impala Example
  When to Use Impala
  Conclusion
4. Common Hadoop Processing Patterns
  Pattern: Removing Duplicate Records by Primary Key
  Data Generation for Deduplication Example
  Code Example: Spark Deduplication in Scala
  Code Example: Deduplication in SQL
  Pattern: Windowing Analysis
  Data Generation for Windowing Analysis Example
  Code Example: Peaks and Valleys in Spark
  Code Example: Peaks and Valleys in SQL
  Pattern: Time Series Modifications
  Use HBase and Versioning
  Use HBase with a RowKey of RecordKey and StartTime
  Use HDFS and Rewrite the Whole Table
  Use Partitions on HDFS for Current and Historical Records
  Data Generation for Time Series Example
  Code Example: Time Series in Spark
  Code Example: Time Series in SQL
  Conclusion
5. Graph Processing on Hadoop
  What Is a Graph?
  What Is Graph Processing?
  How Do You Process a Graph in a Distributed System?
  The Bulk Synchronous Parallel Model
  BSP by Example
  Giraph
  Read and Partition the Data
  Batch Process the Graph with BSP
  Write the Graph Back to Disk
  Putting It All Together
  When Should You Use Giraph?
  GraphX
  Just Another RDD
  GraphX Pregel Interface
  vprog0
  sendMessage0
  mergeMessage0
  Which Tool to Use?
  Conclusion
6. Orchestration
  Why We Need Workflow Orchestration
  The Limits of Scripting
  The Enterprise Job Scheduler and Hadoop
  Orchestration Frameworks in the Hadoop Ecosystem
  Oozie Terminology
  Oozie Overview
  Oozie Workflow
  Workflow Patterns
  Point-to-Point Workflow
  Fan- Out Workflow
  Capture-and-Decide Workflow
  Parameterizing Workflows
  Classpath Definition
  Scheduling Patterns
  Frequency Scheduling
  Time and Data Triggers
  Executing Workflows
  Conclusion
7. Near-Real-Time Processing with Hadoop
  Stream Processing
  Apache Storm
  Storm High-Level Architecture
  Storm Topologies
  Tuples and Streams
  Spouts and Bolts
  Stream Groupings
  Reliability of Storm Applications
  Exactly-Once Processing
  Fault Tolerance
  Integrating Storm with HDFS
  Integrating Storm with HBase
  Storm Example: Simple Moving Average
  Evaluating Storm
  Trident
  Trident Example: Simple Moving Average
  Evaluating Trident
  Spark Streaming
  Overview of Spark Streaming
  Spark Streaming Example: Simple Count
  Spark Streaming Example: Multiple Inputs
  Spark Streaming Example: Maintaining State
  Spark Streaming Example: Windowing
  Spark Streaming Example: Streaming versus ETL Code
  Evaluating Spark Streaming
  Flume Interceptors
  Which Tool to Use?
  Low-Latency Enrichment, Validation, Alerting, and Ingestion
  NRT Counting, Rolling Averages, and Iterative Processing
  Complex Data Pipelines
  Conclusion
Part Ⅱ. Case Studies
8. Clickstream Analysis
  Defining the Use Case
  Using Hadoop for Clickstream Analysis
  Design Overview
  Storage
  Ingestion
  The Client Tier
  The Collector Tier
  Processing
  Data Deduplication
  Sessionization
  Analyzing
  Orchestration
  Conclusion
9. Fraud Detection
  Continuous Improvement
  Taking Action
  Architectural Requirements of Fraud Detection Systems
  Introducing Our Use Case
  High-Level Design
  Client Architecture
  Profile Storage and Retrieval
  Caching
  HBase Data Definition
  Delivering Transaction Status: Approved or Denied?
  Ingest
  Path Between the Client and Flume
  Near-Real-Time and Exploratory Analytics
  Near-Real-Time Processing
  Exploratory Analytics
  What About Other Architectures?
  Flume Interceptors
  Kafka to Storm or Spark Streaming
  External Business Rules Engine
  Conclusion
10. Data Warehouse
  Using Hadoop for Data Warehousing
  Defining the Use Case
  OLTP Schema
  Data Warehouse: Introduction and Terminology
  Data Warehousing with Hadoop
  High-Level Design
  Data Modeling and Storage
  Ingestion
  Data Processing and Access
  Aggregations
  Data Export
  Orchestration
  Conclusion
  A. Joins in Impala
Index