By Pete Warden
To assist you navigate the big variety of new information instruments on hand, this advisor describes 60 of the newest strategies, from NoSQL databases and MapReduce techniques to computing device studying and visualization instruments. Descriptions are in keeping with first-hand event with those instruments in a creation environment.
This convenient thesaurus additionally encompasses a bankruptcy of keywords that support outline a lot of those instrument categories:
- NoSQL Databases—Document-oriented databases utilizing a key/value interface instead of SQL
- MapReduce—Tools that aid allotted computing on huge datasets
- Storage—Technologies for storing facts in a disbursed means
- Servers—Ways to hire computing energy on distant machines
- Processing—Tools for extracting necessary info from huge datasets
- Natural Language Processing—Methods for extracting details from human-created textual content
- Machine Learning—Tools that immediately practice info analyses, in response to result of a one-off research
- Visualization—Applications that current significant facts graphically
- Acquisition—Techniques for cleansing up messy public info assets
- Serialization—Methods to transform facts constitution or item kingdom right into a storable structure
Read or Download Big Data Glossary PDF
Best data modeling & design books
This consultant illustrates what constitutes a sophisticated allotted details approach, and the way to layout and enforce one. the writer offers the foremost components of a complicated disbursed info method: an information administration process helping many periods of knowledge; a dispensed (networked) surroundings helping LANs or WANS with a number of database servers; a complicated person interface.
This ebook deals a complete review of many of the thoughts and learn matters approximately blogs or weblogs. It introduces innovations and techniques, instruments and functions, and overview methodologies with examples and case stories. Blogs let humans to specific their suggestions, voice their evaluations, and percentage their reports and concepts.
This e-book describes the mathematical heritage at the back of discrete ways to morphological research of scalar fields, with a spotlight on Morse conception and at the discrete theories because of Banchoff and Forman. The algorithms and information buildings offered are used for terrain modeling and research, molecular form research, and for research or visualization of sensor and simulation 3D info units.
Object-Role Modeling (ORM) is a fact-based method of information modeling that expresses the data necessities of any company area easily by way of gadgets that play roles in relationships. All evidence of curiosity are handled as situations of attribute-free buildings referred to as truth forms, the place the connection might be unary (e.
- Large-Scale Data Analytics, 1st Edition
- Mathematics and Computation in Music: First International Conference, MCM 2007, Berlin, Germany, May 18-20, 2007. Revised Selected Papers (Communications in Computer and Information Science)
- Database Modeling and Design, Third Edition, 3rd Edition
- Microsoft Dynamics® NAV 2009 - Business Intelligence for IT Professionals
Extra resources for Big Data Glossary
The command-line interface allows you to apply exactly the same code in an automated way for production. Mahout Mahout is an open source framework that can run common machine learning algorithms on massive datasets. To achieve that scalability, most of the code is written as parallelizable jobs on top of Hadoop. It comes with algorithms to perform a lot of common tasks, like clustering and classifying objects into groups, recommending items based on other users’ behaviors, and spotting attributes that occur together a lot.
For example, it’s easy to spot and correct common problems like typos or inconsistencies in text values and to change cells from one format to another. There’s also rich support for linking data by calling APIs with the data contained in existing rows to augment the spreadsheet with information from external sources. Refine doesn’t let you do anything you can’t with other tools, but its power comes from how well it supports a typical extract and transform workflow. It feels like a good step up in abstraction, packaging processes that would typically take multiple steps in a scripting language or spreadsheet package into single operations with sensible defaults.
If you’re representing a list of objects mapping keys to values, the most intuitive way would be to use an indexed array of associative arrays. This means that the string for each key is stored inside each object, which involves a large number of duplicated strings when the number of unique keys is small compared to the number of values. There are manual ways around this, of course, especially as the textual representations usually compress well, but many of the other serialization approaches I’ll talk about try to combine the flexibility of JSON with a storage mechanism that’s more space efficient.