《基于开源工具的数据分析（影印版）》—

基于开源工具的数据分析（影印版）

出版时间：2011年06月

页数：509

数据收集相对比较简单，而要把原始信息转化为有用的数据则需要知道如何精确地抽取你想要的内容。通过这本书的深入讲解，那些对数据分析感兴趣的中等或者富有经验的程序员将可以学习到在商业环境中与数据打交道的技术。你将了解到如何观察数据来找出它所包含的信息，如何在概念模型里捕捉到这些想法，然后把你的理解通过商业计划、度量标准的精确报告和其他方式反馈给你所在的机构。
你将会通过每章结束部分的动手实践来慢慢体验各种概念。最重要的是，你将了解到如何思考你所希望获取的数据——而不是依赖于工具来替你思考。

·使用图形来描述带有一个、两个或者十多个变量的数据
·使用粗略计算以及维度和概率参数来开发概念模型
·使用诸如模拟和聚类的集约计算方法来挖掘数据
·通过报告、信息板和其他度量程序来让你的结论更容易理解
·理解财务计算，包括货币时间价值
·利用降维技术或者预测分析来克服数据分析过程中面临的挑战
·熟悉数据分析的不同开源编程环境

“终于出现了一本简明的参考手册，让你理解如何征服数据”
——Austin King，高级网站开发人员, Mozilla

“有理想的数据研究员不可或缺的资料。”
——Michael E. Driscoll，CEO/创建者, Dataspora
本书作者Philipp K. Janert目前提供数据分析和数学模型的咨询服务，他曾经是物理学家和软件工程师。他是《Gnuplot in Action:Understanding Data with Graphs》 (Manning出版)的作者, 他为O’Reilly Network, IBM developerWorks和IEEE Software写过文章。他拥有Washington大学理论物理学的博士学位。

适用于有编程经验的读者

目录
产品信息
关于作者
封面介绍

PREFACE
1 INTRODUCTION
Data Analysis
What’s in This Book
What’s with theWorkshops?
What’s with the Math?
What You’ll Need
What’sMissing
PART I Graphics: Looking at Data
2 A SINGLE VARIABLE: SHAPE AND DISTRIBUTION
Dot and Jitter Plots
Histograms and Kernel Density Estimates
The Cumulative Distribution Function
Rank-Order Plots and Lift Charts
Only When Appropriate: Summary Statistics and Box Plots
Workshop: NumPy
Further Reading
3 TWO VARIABLES: ESTABLISHING RELATIONSHIPS
Scatter Plots
Conquering Noise: Smoothing
Logarithmic Plots
Banking
Linear Regression and All That
Showing What’s Important
Graphical Analysis and Presentation Graphics
Workshop: matplotlib
Further Reading
4 TIME AS A VARIABLE: TIME-SERIES ANALYSIS
Examples
The Task
Smoothing
Don’t Overlook the Obvious!
The Correlation Function
Optional: Filters and Convolutions
Workshop: scipy.signal
Further Reading
5 MORE THAN TWO VARIABLES: GRAPHICAL MULTIVARIATE ANALYSIS
False-Color Plots
A Lot at a Glance: Multiplots
Composition Problems
Novel Plot Types
Interactive Explorations
Workshop: Tools for Multivariate Graphics
Further Reading
6 INTERMEZZO: A DATA ANALYSIS SESSION
A Data Analysis Session
Workshop: gnuplot
Further Reading
PART II Analytics: Modeling Data
7 GUESSTIMATION AND THE BACK OF THE ENVELOPE
Principles of Guesstimation
How Good Are Those Numbers?
Optional: A Closer Look at Perturbation Theory and
Error Propagation
Workshop: The Gnu Scientific Library (GSL)
Further Reading
8 MODELS FROM SCALING ARGUMENTS
Models
Arguments from Scale
Mean-Field Approximations
Common Time-Evolution Scenarios
Case Study: How Many Servers Are Best?
Why Modeling?
Workshop: Sage
Further Reading
9 ARGUMENTS FROM PROBABILITY MODELS
The Binomial Distribution and Bernoulli Trials
The Gaussian Distribution and the Central Limit Theorem
Power-Law Distributions and Non-Normal Statistics
Other Distributions
Optional: Case Study—Unique Visitors over Time
Workshop: Power-Law Distributions
Further Reading
10 WHAT YOU REALLY NEED TO KNOW ABOUT CLASSICAL STATISTICS
Genesis
Statistics Defined
Statistics Explained
Controlled Experiments Versus Observational Studies
Optional: Bayesian Statistics—The Other Point of View
Workshop: R
Further Reading
11 INTERMEZZO: MYTHBUSTING—BIGFOOT, LEAST SQUARES,
AND ALL THAT
How to Average Averages
The Standard Deviation
Least Squares
Further Reading
PART III Computation: Mining Data
12 SIMULATIONS
AWarm-Up Question
Monte Carlo Simulations
Resampling Methods
Workshop: Discrete Event Simulations with SimPy
Further Reading
13 FINDING CLUSTERS
What Constitutes a Cluster?
Distance and Similarity Measures
Clustering Methods
Pre- and Postprocessing
Other Thoughts
A Special Case:Market Basket Analysis
AWord ofWarning
Workshop: Pycluster and the C Clustering Library
Further Reading
14 SEEING THE FOREST FOR THE TREES: FINDING
IMPORTANT ATTRIBUTES
Principal Component Analysis
Visual Techniques
Kohonen Maps
Workshop: PCA with R
Further Reading
15 INTERMEZZO: WHEN MORE IS DIFFERENT
A Horror Story
Some Suggestions
What About Map/Reduce?
Workshop: Generating Permutations
Further Reading
PART IV Applications: Using Data
16 REPORTING, BUSINESS INTELLIGENCE, AND DASHBOARDS
Business Intelligence
Corporate Metrics and Dashboards
Data Quality Issues
Workshop: Berkeley DB and SQLite
Further Reading
17 FINANCIAL CALCULATIONS AND MODELING
The Time Value of Money
Uncertainty in Planning and Opportunity Costs
Cost Concepts and Depreciation
Should You Care?
Is This All That Matters?
Workshop: The Newsvendor Problem
Further Reading
18 PREDICTIVE ANALYTICS
Introduction
Some Classification Terminology
Algorithms for Classification
The Process
The Secret Sauce
The Nature of Statistical Learning
Workshop: Two Do-It-Yourself Classifiers
Further Reading
19 EPILOGUE: FACTS ARE NOT REALITY
A PROGRAMMING ENVIRONMENTS FOR SCIENTIFIC COMPUTATION
AND DATA ANALYSIS
Software Tools
A Catalog of Scientific Software
Writing Your Own
Further Reading
B RESULTS FROM CALCULUS
Common Functions
Calculus
Useful Tricks
Notation and Basic Math
Where to Go from Here
Further Reading
C WORKING WITH DATA
Sources for Data
Cleaning and Conditioning
Sampling
Data File Formats
The Care and Feeding of Your Data Zoo
Skills
Terminology
Further Reading
INDEX

书名：基于开源工具的数据分析（影印版）

作者：Philipp K. Janert 著

国内出版社：东南大学出版社

出版时间：2011年06月

页数：509

书号：978-7-5641-2674-2

原版书书名：Data Analysis with Open Source Tools

原版书出版商：O'Reilly Media

Philipp K. Janert

Philipp K. Janert于1997年获得华盛顿大学理论物理学博士学位，之后一直从事技术工作，担任程序员、科学家和应用数学家。他著有《数据之魅：基于开源工具的数据分析》《计算机系统的反馈控制》以及《Gnuplot实战》(第2版)。

After previous careers in physics and software development, Philipp K. Janert currently
provides consulting services for data analysis, algorithm development, and mathematical
modeling. He has worked for small start-ups and in large corporate environments, both in
the U.S. and overseas. He prefers simple solutions that work to complicated ones that
don’t, and thinks that purpose is more important than process. Philipp is the author of
“Gnuplot in Action: Understanding Data with Graphs” (Manning Publications), and has
written for the O’Reilly Network, IBM developerWorks, and IEEE Software. He is named
inventor on a handful of patents, and is an occasional contributor to CPAN. He holds a
Ph.D. in theoretical physics from the University of Washington. Visit his company website
at www.principal-value.com.

查看Philipp K. Janert更多信息

The animal on the cover of Data Analysis with Open Source Tools is a common kite, most
likely a member of the genus Milvus. Kites are medium-size raptors with long wings and
forked tails. They are noted for their elegant, soaring flight. They are also called “gledes”
(for their gliding motion) and, like the flying toys, they appear to ride effortlessly on air
currents.
The genus Milvus is a group of Old World kites, including three or four species and
numerous subspecies. These kites are opportunistic feeders that hunt small animals, such
as birds, fish, rodents, and earthworms, and also eat carrion, including sheep and cow
carcasses. They have been observed to steal prey from other birds. They may live 25 to
30 years in the wild.
The genus dates to prehistoric times; an Israeli Milvus pygmaeus specimen is thought to be
between 1.8 million and 780,000 years old. Biblical references to kites probably refer to
birds of this genus. In Coriolanus, Shakespeare calls Rome “the city of kites and crows,”
commenting on the birds’ prevalence in urban areas.
The most widespread member of the genus is the black kite (Milvus migrans), found in
Europe, Asia, Africa, and Australia. These kites are very common in many parts of their
habitat and are well adapted to city life. Attracted by smoke, they sometimes hunt by
capturing small animals fleeing from fires.
The other notable member of Milvus is the red kite (Milvus milvus), which is slightly larger
than the black kite and is distinguished by a rufous body and tail. Red kites are found only
in Europe. They were very common in Britain until 1800, but the population was devastated by poisoning and habitat loss, and by 1930, fewer than 20 birds remained.
Since then, kites have made a comeback in Wales and have been reintroduced elsewhere
in Britain.

购买选项

定价：82.00元

书号：978-7-5641-2674-2

出版社：东南大学出版社

联系出版社邮购