Big Data Glossary: The Ultimate List Of All Big Data & Analytics Terms

Big Data stands true to its name. with billion bytes of data being collected, understanding the intricacies of Big Data has become more than a necessity. However, Big Data often turns out to be complicated because of its highly technical lingo. Not everyone is a highly technical big data practitioner or corporate executive. The common user is often lost over these complex terminologies. Let’s take a dive into Big Data Glossary.

We have come up with a list of Big Data terms and definitions, that would serve as a guide for beginners. Our list comprises of extensive terminologies, from the basics to the advanced, would help you get a clear understanding of Big Data terms.

Big Data and Analytics Glossary from A-D

A

Analytics: Analytics refers to the course of depicting conclusions based on the raw data. it helps us sort out meaningful data from data mass.

Automatic Identification and Capture (AIDC): Automatic Identification and Data Capture (AIDC) refers to a broad set of technologies used to glean data from an object, image, or sound without manual intervention.

Algorithm: The algorithm refers to a mathematical formula placed in software that performs an analysis on a set of data.

Artificial Intelligence: Artificial Intelligence refers to the process of developing intelligence machines and software that can perceive the environment and take the corresponding action as and when required and even learn from those actions.

ACID Test: ACID Test stands for atomicity, consistency, isolation, and durability (ACID) test of data. These four attributes are the benchmarks for ensuring the validity of a data transaction.

Apache Avro: Apache Avro is a row-oriented object container storage format for Hadoop as well as a remote procedure call and data serialization framework. Avro is optimized for write operations and includes a wire format for communication between nodes. Avro ensures simpler translation between different nodes by way of the data definition and serialized permanent data. Avro uses JavaScript object notation to define protocols and data types, as well as serializes data into a compact binary format.

B

Batch Processing: Batch processing is a standard computing strategy that involves processing data in large sets. This practice becomes imperative for non-time sensitive work that operates on very large datasets. The process is scheduled and at a later time, the results are retrieved by the system.

Big Data: Big Data is an umbrella term used for huge volumes of heterogeneous datasets that cannot be processed by traditional computers or tools due to their varying volume, velocity, and variety.

Biometrics: Biometrics implies using analytics and technology in identifying people by one or many of their physical characteristics, such as fingerprint recognition, facial recognition, iris recognition, etc. It is most commonly used in modern smartphones.

Business Intelligence: Business Intelligence is the general term used for the identification, extraction, and analysis of data.

C

Call Detail Record (CDR) analysis: CDRs include data that a telecommunications company collects about phone calls. This may include call duration and time when the call was made or received. This data is used in any number of analytical applications.

Cascading: Cascading refers to s a higher level of abstraction for Hadoop, that allows developers to create complex jobs quickly, easily, and in several different languages that run in the JVM, including Ruby, Scala, and more.

Cassandra: Cassandra is a popular open-source database management system managed by The Apache Software Foundation. Cassandra was designed to handle large volumes of data across distributed servers.

Clickstream Analytics: It refers to the analysis of users’ online activity based on the items that users click on a web page.

Clojure: Clojure is a functional programming language constructed in LISP which uses the JVM (Java Virtual Machine). Clojure is particularly suitable for parallel data processing.

Cluster computing: Clustered Computing is the practice of segmenting the resources of multiple machines and managing their collective capabilities to complete tasks in a more simplified manner. Computer clusters require a cluster management layer which handles communication between the individual nodes and coordinates work assignment.

D

Database-as-a-Service: Database-as-a-Service refers to a database hosted in the cloud on a pay per use basis. For example, Amazon Web Services.
Data cleansing: It refers to the process of reviewing and revising data in order to delete duplicates, correct errors and provide consistency.
Dark Data: Dark Data refers to all the data that is gathered and processed by enterprises not used for any meaningful purposes. It is called dark because it is unused and unexplored. This includes social network feeds, call center logs, meeting notes, etc.

Data Lake: The term Data Lake refers to a storage repository that can hold a huge amount of raw data in its original format. Data Lake uses a flat architecture to store data, unlike a hierarchical data warehouse, which stores data in files or folders. Each data element in a Data Lake is assigned a unique identifier and tagged with a set of extended metadata tags. The Data Lake can be queried for relevant data, and a smaller set of data be analyzed to answer any relevant business question.

Data Mining: Data mining is a broad term used for finding rational patterns in large data sets. The purpose is to refine a humongous data into a more comprehensible and cohesive set of information.

Data Modelling: Data Modelling is defined as the analysis of data objects using data modeling techniques to create insights from the data.

Data Science: Data Science is a broad subject that incorporates statistics, data visualization, computer programming, data mining, machine learning, and database engineering to solve complex problems.

Data Virtualization: It refers to a data integration process to gain more insights. Usually, it involves databases, applications, file systems, websites, big data techniques, etc.

Big Data and Analytics Glossary from E-H

E

Econometrics: Econometrics is the application of statistical and mathematical theories in economics for testing hypotheses and forecasting future trends. Econometrics makes use of economic models, tests them through statistical trials and then compare the results against real-life examples. It can be subdivided into two major categories: theoretical and applied.

ETL: ETL is the acronym for extract, transform, and load. It refers to the process of ‘extracting’ raw data, ‘transforming’ by cleaning/enriching the data for ‘fit for use’ and ‘loading’ into the appropriate repository for future use.

Exabytes: Exabyte (abbreviated as EB) is a large unit of computer data storage, two to the sixtieth power bytes. The prefix ex-means one billion, or one quintillion, which is a decimal term. One exabyte is equal to 1,000 petabytes and precedes the zettabyte unit of measurement. Exabytes are slightly smaller than exbibytes, which contain 1,152,921,504,606,846,976 (260) bytes.
Exploratory Analysis: Exploratory Data Analysis (EDA) is an approach to analyzing data. EPA is often the first step in data analysis, implemented before any formal statistical techniques are applied. Exploratory data analysis is a complement to inferential statistics, which tends to be rigid with rules and formulas. EDA involves the analyst trying to get a “feel” for the data set, often using their own judgment to determine what the most important elements in the data set are.

F

Failover: Failover is defined as the constant capability to automatically and seamlessly switch to a highly reliable backup. This can be operated in a redundant manner or in a standby operational mode upon the failure of a primary server, application, system or another primary system component. The main purpose of Failover is to eliminate, or at least reduce, the impact on users when a system failure occurs.
Fault-tolerant design: It is a system designed to continue working even if certain parts fail.
Feature: Feature is the machine learning expression for a piece of measurable information about something. For example, height, length, and breadth of a solid object. Other terms like property, attribute or characteristic are also used instead of a feature.

Feature Engineering: Feature Engineering is the process of creating new input features for machine learning. It is one of the most effective ways to improve predictive models. Feature Engineering allows you to isolate key information, highlight patterns, and bring in domain expertise.

Feature Reduction: Feature Reduction is the process of reducing the number of features to work on a computation-intensive task without losing much information. principal component analysis (PCA) is one of the most popular Feature Reduction techniques.
Feature Selection: Feature Selection is the process of selecting relevant features for explaining the predictive power of a statistical model.
Flume: Apache Flume or Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of streaming data into the Hadoop Distributed File System (HDFS). It has a simple and flexible architecture based on streaming data flows, and is robust and fault-tolerant with tunable reliability mechanisms for failover and recovery.

Frequentist Statistics: Frequentist Statistics is a procedure to test the probability or improbability of an event (hypothesis). It calculates the probability of an event overall (i.e. the experiment is repeated under the same conditions to obtain the outcome). Here, the sampling distributions of fixed size are taken. Then, the experiment is theoretically repeated an infinite number of times but practically done with a stopping intention. For example, a person may decide to stop testing an application, after 125 test cases are performed.

F-Score: F-score evaluation metric combines both precision and recall as a measure of the effectiveness of classification. It is calculated in terms of the ratio of weighted importance on either recall or precision as determined by a β.

F measure = 2 x (Recall × Precision) / (β² × Recall + Precision)

G

Gamification: Gamification is the application of game elements and digital game design techniques to non-game problems, such as business and social impact challenges. It is the process of Gamification takes the data-driven techniques that game designers use to engage players and applies them to non-game experiences to motivate actions that add value to your business.
Graph Analytics: It is a way to organize and visualize relationships between different data points in a set.

Grid Computing: Grid Computing refers to performing computing functions with resources from several distributed systems. It usually involves large files and is often used for various applications. Systems containing a grid computing network are not required to be alike in design or be from the same location.

H

HANA: HANA refers to a hardware/software in-memory computing tool from SAP which is intended for high-volume transactions and analytics in real-time.

Hadoop: Hadoop is an open-source, a Java-based programming framework that supports the processing and storage of large data sets in a distributed computing environment. It is part of the Apache project sponsored by the Apache Software Foundation. With Hadoop, you may run applications on systems with several commodity hardware nodes, and to handle thousands of terabytes of data. Its distributed file system ensures rapid data transfer rates among nodes and allows the system to continue operating in case of a node failure. This approach lowers the risk of catastrophic system failure and unexpected data loss, even if a significant number of nodes become inoperative. Hadoop was created by Doug Cutting and Mike Cafarella in 2006 to support distribution for the Nutch search engine.
Hive: Hive is a Hadoop-based data warehousing-like framework originally developed by Facebook. It allows users to write queries in a SQL-like language called HiveQL, which are then converted to MapReduce. This allows SQL programmers with no MapReduce experience to use the warehouse and makes it easier to integrate with business intelligence and visualization tools such as MicroStrategy, Tableau, Revolutions Analytics, etc.

Big Data and Analytics Glossary from I-L

Imputation: Imputation is the technique used for handling missing values in the data. It is performed by statistical metrics like mean/mode imputation or by machine learning techniques like kNN imputation.
In-memory: In-memory refers to a database management system stores data on the main memory instead of the disk, for faster processing, storage, and loading of the data.
In-memory Computing: In-memory computing is the storage of information in the main random-access memory (RAM) of dedicated servers rather than in complicated relational databases operating on comparatively slow disk drives. In-memory computing helps business customers, including retailers, banks, and utilities, to quickly detect patterns, analyze massive data volumes on the fly, and perform their operations quickly. The drop-in memory prices in the present market have led to increased popularity of in-memory computing technology. This has made in-memory computing economical among a wide variety of applications.
In-memory data grid (IMDG): In-memory data grid (IMDG) is a data structure that resides entirely in RAM (random access memory) and is distributed among multiple servers. Recent advances in 64-bit and multi-core systems have made it possible to store terabytes of data completely in RAM, obviating the need for electromechanical mass storage media such as hard disks.
Independent Variable: An Independent Variable is a variable that is manipulated to determine the value of a dependent variable s. The dependent variable is what is being measured in an experiment or evaluated in a mathematical equation and the independent variables are the inputs to that measurement.

Inferential Statistics: Inferential Statistics refers to mathematical methods that employ probability theory for deducing (inferring) the properties of a population from the analysis of the properties of a data sample drawn from it. Inferential Statistics also deals with the precision and reliability of the inferences it helps to draw.

IoT: The Internet of Things (IoT) is a system of interrelated computing devices, mechanical and digital machines, objects, animals, or people that are provided with unique identifiers and the ability to transfer data over a network without requiring human-to-human or human-to-computer interaction.
IQR: The interquartile range (IQR) is a measure of variability, based on dividing a data set into quartiles. Quartiles divide a rank-ordered data set into four equal parts. The values that divide each part are called the first, second, and third quartiles; and they are denoted by Q1, Q2, and Q3, respectively.

J

Juridical Data Compliance: In legal terminology, the word “Compliance” refers to the act of adherence to the law of the land. In business terms, in case of any organization, compliance implies strict adherence to the laws, regulations, guidelines, and specifications that are relevant to the life cycle of a business entity. Juridical data compliance is commonly used in the context of cloud-based solutions, where the data is stored in a different country or continent. Data storage in a server or data center located in a foreign country must abide by the data security laws of the nation.

K

Kafka: Kafka, a LinkedIn product, is a dispersed publish-subscribe system for messaging. It provides a solution that is proficient in conducting all activities related to data flow and processing this data over a consumer website. It is an essential element of the current social web.

KeyValue Databases: A key-value database, also known as a key-value store, is the most flexible type of NoSQL database. Key-value databases have emerged as an alternative to many of the limitations of traditional relational databases, where data is structured in tables and the schema must be predefined. In a key-value store, there is no schema and the value of the data is opaque. Values are identified and accessed via a key, and stored values can be numbers, strings, counters, JSON, XML, HTML, binaries, images, short videos, and more. It is the most flexible NoSQL model because the application has complete control over what is stored in the value.

K-Means: K-means clustering is a type of unsupervised learning, used for segregating unlabeled data (i.e., data without defined categories or groups). The purpose is to find groups in the data, with the number of groups represented by the variable K. The algorithm works iteratively to assign each data point to one of K groups based on the features that are provided.
K-nearest neighbors: K nearest neighbors (also known as kNN) is a simple algorithm that stores all available cases and classifies new cases based on a similarity measure (e.g., distance functions). KNN has been used in statistical estimation and pattern recognition already at the beginning of the 1970’s as a non-parametric technique.

L

Latency: Latency refers to delays in transmitting or processing data. Latency is f two types, network latency, and disk latency.
Legacy system: Legacy system refers to outdated computer systems, programming languages or application software that are used instead of available upgraded versions. Legacy systems are also associated with terminology or processes that are no longer applicable to current contexts or content, thus creating confusion.
Linear Regression: Linear Regression refers to a kind of statistical analysis that attempts to show a relationship between two variables. Linear regression looks at various data points and plots a trend line. Linear regression can create a predictive model of apparently random data, showing trends in data, such as in cancer diagnoses or in stock prices.
Load Balancing: Load balancing refers to the efficient distribution of incoming network traffic across a group of backend servers (also known as a server farm or server pool) to get more work done in the same amount of time. Load balancing can be implemented with hardware, software, or a combination of both. Typically, load balancing is the main reason for computer server clustering.
Location Data: Location data refers to the information collected by a network or service about where a user’s phone or device is located.
Logfile: A log file is defined as a file that maintains a registry of events, processes, messages, and communication between various communicating software applications and the operating system. Log files are present in the executable software, operating systems, and programs whereby all the messages and process details are recorded.
Logarithm: Logarithm refers to an exponent used in mathematical calculations to depict the perceived levels of variable quantities such as visible light energy, electromagnetic field strength, and sound intensity.
Logistic Regression: Logistic regression refers to a statistical method for analyzing a dataset in which there are one or more independent variables that determine an outcome. The outcome is measured with a dichotomous variable (in which there are only two possible outcomes).

Big Data and Analytics Glossary from M-P

M

Machine learning: Machine Learning is the study and practice of designing systems that can learn, adjust, and improve based on the data fed. This typically involves the implementation of predictive and statistical algorithms that can continually zero in on “correct” behavior and insights as more data flows through the system.

MongoDB: MongoDB refer to a cross-platform, open-source database that uses a document-oriented data model, rather than a traditional table-based relational database structure. It is designed to make the integration of structured and unstructured data in certain types of applications easier and faster.

MapReduce: MapReduce refers to specific tools that support distributed computing on large datasets.

Mashup: Mashup refers to the method of merging different datasets into a single application to improve output. For instance, combining job listings with demographic data.

Mahout: Mahout refers to a library for data mining. It uses the best prevalent data mining algorithms in performing regression testing, clustering, modeling, and implementing them with the use of the MapReduce model.

Metadata: Metadata may be defined as the data that serves to provide context or additional information about other data. For example, information about the title, subject, author, typeface, enhancements, and size of the data file of a document constitute metadata about that document. It may also describe the conditions under which the data stored in a database was acquired, its accuracy, date, time, a method of compilation and processing, etc.
Munging: Munging refers to the process of manually converting or mapping data from one raw form into another format for more convenient consumption.

N

Natural Language Processing: Natural Language Processing refers to the standard methods for gleaning facts from the human-created

Normalization: Normalization is the process of reorganizing data in a database so that it meets two basic requirements: (1) There is no redundancy of data (all data is stored in only one place), and (2) data dependencies are logical (all related data items are stored together). Normalization is important for many reasons, but chiefly because it allows databases to take up as little disk space as possible, resulting in increased performance. Normalization is also known as data normalization.
NoSQL: NoSQL is a broad term that refers to databases designed outside of the traditional relational model. Unlike relational databases, NoSQL databases have different trade-offs. However, NoSQL databases are well-suited for big data systems because of their flexibility and frequent distributed-first architecture.

O

Object Databases: Object Databases store data in the form of objects, as used by object-oriented programming. They are different from relational or graph databases and most of them offer a query language that allows the object to be found with a declarative programming approach.
Object-based Image Analysis: It refers to analyzing digital images with data from individual pixels, whereas object-based image analysis uses data from a selection of related pixels, called objects or image objects.
Operational Databases: Operational Databases carry out regular operations of an organization and are generally very important to a business. They generally use online transaction processing that allows them to enter, collect and retrieve specific information about the company
Optimization Analysis: It refers to the process of optimization during the design cycle of products done by algorithms. It allows companies to virtually design many different variations of a product and to test that product against pre-set variables.
Oozie: Oozie refers to a workflow processing system which allows its users to define a series of jobs which can be written in several languages like Pig, MapReduce, and Hive. It then intelligently links them to each other. It permits users to state, for instance, that a query is to be started only after defined previous jobs on which it depends on for data are completed.

P

Parse: Parse refers to the division of data, such as a string, into smaller parts for analysis.

Pattern Recognition: Pattern recognition takes place when an algorithm locates recurrences or regularities within large data sets or across disparate data sets. Pattern recognition is like machine learning and data mining.

Persistent storage: Persistent storage refers to a non-changing place, such as a disk, where data is saved after the process that created it has ended.
Python: Python is a general-purpose programming language that emphasizes code readability to allow programmers to use fewer lines of code to express their concepts.

Big Data and Analytics Glossary from Q-T

Q

Quad-Core Processor: A quad-core processor is defined as a multiprocessor architecture that is designed to provide faster processing power. It is a successor to the dual-core processor, which has two processor cores. Quad-core processors integrate two dual-core processors into a single processor. The two separate dual cores communicate with each other using processor cache. A quad-core processor can execute multiple instructions simultaneously, meaning that each core can be dedicated to separate instruction.

Query: A query is a request for data or information from a database table or combination of tables. This data may be generated as results returned by Structured Query Language (SQL) or as pictorials, graphs, or complex results, e.g., trend analyses from data-mining tools. SQL is the most well-known and widely-used query language.

Quick Response Code: A quick response code (QR code) is defined as a type of two-dimensional barcode that consists of square black modules on a white background. QR codes are designed to be read by smartphones. Because they can carry information both vertically and horizontally, they can provide a vast amount of information, including links, text, or other data.

Query Analysis: Query Analysis is a process used in databases which make use of SQL to determine how to further optimize queries for performance.

Quantum Bit (Qubit): A quantum bit (qubit) is defined as the smallest unit of quantum information, which is the quantum analog of the regular computer bit, used in the field of quantum computing. A quantum bit can exist in superposition, which means that it can exist in multiple states at once. Compared to a regular bit, which can exist in one of two states, 1 or 0, the quantum bit can exist as a 1, 0 or 1 and 0 at the same time. This allows for very fast computing and the ability to do multitudes of calculations at once, theoretically.
Quantum Computing: Quantum computing is defined as a theoretical computing model that uses a very different form of data handling to perform calculations. The emergence of quantum computing is based on a new kind of data unit that could be called non-binary, as it has more than two possible values.

R

R: R is a language and environment for statistical computing and graphics. It is a GNU project which is like the S language. R provides a wide variety of statistical (linear and nonlinear modeling, classical statistical tests, time-series analysis, classification, clustering) and graphical techniques, and is highly extensible.

Readme: Readme refers to a file that is attached to a software program that contains critical or important information about that program. The readme file is, in a sense, a manual or instructional resource for that program.

R-Commerce: R-Commerce or Relationship e-commerce is a form of electronic commerce that focuses upon business-to-consumer(B2C) and peer-to-peer(P2P) interaction. In r-commerce, the focus is shifted from product sales toward customer relationship building. The concept of r-commerce stems from the relationships a consumer has with other consumers and with a business by purchasing its online merchandise or services. The customer’s positive experience becomes evident in relaying product information to other customers pointing out its good qualities or benefits and urges further purchases.

Refactoring: Refactoring refers to the process of altering an application’s source code without changing its external behavior. Code refactoring improves the nonfunctional properties of the code, such as readability, complexity, maintainability, and extensibility.

Relational Data Model: A Relational Data Model involves the use of data tables that collect groups of elements into relations. These models work are based on the idea that each table setup will include a primary key or identifier. Other tables use that identifier to provide “relational” data links and results. Database administrators use Structured Query Language (SQL) to retrieve data elements from a relational database.

Regression Testing: It refers to a type of software testing used to determine whether new problems are the result of software changes. A program is tested prior to a change. The same program is re-tested after the change is implemented to determine whether any new bugs or issues are created. Regression Testing is also required to check if the actual change achieved its intended purpose.

Reseller Hosting: Reseller hosting is a business development model provided by a Web hosting service/provider. One or more organizations make use of Reseller hosting to lease out Webspace that is packaged, rebranded, and sold under their brand name. Depending on the primary hosting service provider, each reseller host may be provided with a completely white/private labeled control panel to manage their leased space and their customers.

S

Spark: Apache Spark is a fast, in-memory data processing engine to efficiently execute streaming, machine learning or SQL workloads that require fast iterative access to datasets. Spark is generally a lot faster than MapReduce.

Spatial Analysis: It refers to analyzing spatial data such as geographic data or topological data to identify and understand patterns and regularities within data distributed in geographic space.

Software as a service (SaaS): SaaS refers to an application software that is used over the web by a thin client or web browser. For example

Serialization: Serialization refers to the standard procedures for converting data structure or object state into standard storage formats.

Sqoop: Sqoop is a connectivity tool for moving data from non-Hadoop data stores – such as relational databases and data warehouses – into Hadoop. It allows users to specify the target location inside of Hadoop. Accordingly, instructions are sent to Sqoop to move data from Oracle, Teradata or other relational databases to the target.

Storm: Storm is a free and open-source real-time distributed computing system. Storm makes processing of unstructured data easier and faster with instantaneous processing. It uses Hadoop for batch processing.

Stream Processing: Stream Processing is the standard practice of computing over individual data items as they move through a system. This allows for real-time analysis of the data being fed into the system. It is useful for time-sensitive operations using high-velocity
Structured Query Language (SQL): SQL refers to a programming language designed specifically to manage and retrieve data from a relational database system.

T

Tag: A tag is a piece of information that describes the data or content that it is assigned to. Tags are nonhierarchical keywords used for Internet bookmarks, digital images, videos, files and so on.

Talking Trojan: Talking Trojan is a kind of Trojan virus introduced in 2007 that replays an audio message while it deletes the contents of a hard drive or otherwise attacks a system. This is a type of Trojan program, a virus that looks legitimate but attacks the user system when it is run.

Taxonomy: Taxonomy refers to the classification of data according to a pre-determined system with the resulting catalog. It provides a conceptual framework for easy access and retrieval.

TensorFlow: TensorFlow is a free software library focused on machine learning created by Google. Initially released as part of the Apache 2.0 open-source license, TensorFlow was originally developed by engineers and researchers of the Google Brain Team, mainly for internal use.

Transactional Data: This refers to data that changes unpredictably. Examples include accounts payable and receivable data or data about product shipments.
Thrift: Thrift is a software structure for accessible cross-language services development. Thrift pools a stack of software with a code generation engine to form services which work proficiently and flawlessly between Java, C++, Python, Ruby, PHP, Erlang, Haskell, Perl, and C#.

Big Data and Analytics Glossary from U-Z

U

Unique Constraint: A unique constraint refers to a type of column restriction within a table, which dictates that all values in that column must be unique though may be null. To ensure that a column is UNIQUE and cannot contain null values, the column must be specified as NOT NULL. Interestingly, these are a primary key’s two main attributes. Defining both attributes in a newly-created column should be given serious consideration for the primary key designation.

Unique Visitor: Unique visitor refers to a person who visits a site at least once during the reporting period. A unique visitor is also known as a unique user.

Unstructured data: Data that either does not have a pre-defined data model or is not organized in a pre-defined manner.

Unstructured Data Analysis: Unstructured data analysis refers to the process of analyzing data objects that don’t follow a predefined data model /architecture and/or is unorganized. It refers to the analysis of any data that is stored over time within an organizational data repository without any intent for its orchestration, pattern, or categorization.
Unstructured Data Mining: Unstructured data mining refers to the practice of looking at relatively unstructured data and trying to get more refined data sets out of it. It often consists of extracting data from sources not traditionally used for data mining activities.

V

Vector Graphics Rendering: Vector graphics rendering is the process of generating models from geometrical primitives such as lines, points, curves, and shapes to represent images in computer graphics.

Vector Markup Language (VML): Vector Markup Language (VML) refers to an application of XML 1.0 that defines the encoding of vector graphics in HTML. It was submitted to the W3C in 1998 but never gained traction. Instead, a working group at the W3C created Scalable Vector Graphics (SVG), which became a W3C Recommendation in 2001.

Vertical Scalability: Vertical scalability is defined as the addition of resources to a single system node, such as a single computer or network station, which often results in additional CPUs or memory. Vertical scalability provides more shared resources for the operating system and applications.

View-Based Conversions: View-based conversions are the result of a tracking method employed by Google AdWords.

Virtual Address Extension (VAX): A Virtual Address Extension (VAX) was a midrange server computer developed in the late 1970s by Digital Equipment Corporation (DEC). The VAX was introduced as mainframe computers were being developed. The VAX computer had a 32-bit processor and a virtual memory setup
Visualization: Visualization refers to applications that are used for graphical representation of meaningful data.

W

WannaCry: WannaCry is a kind of ransomware attack that developed in the spring of 2017 and brought the idea of ransomware threats further into the mainstream. This global attack disabled many systems, including public-service systems such as those supporting hospitals and law-enforcement officers. Experts classified WannaCry as a cryptoworm.

Web Application Security: Web application security refers to the process of securing confidential data stored online from unauthorized access and modification.

Wearable Robot: It refers to a specific type of wearable device that is used to enhance a person’s motion and/or physical abilities. Wearable robots are also known as bionic robots or exoskeletons.

Whiteboarding: Whiteboarding refers to the manipulation of digital files on a visual digital whiteboard. It is used for different kinds of collaborative projects and represents a useful form of data visualization in which a sequential series of files can be shown on a screen as a visual object-based model.
Wireless Access Point: A wireless access point (WAP) refers to a hardware device or configured node on a local area network (LAN) that allows wireless capable devices and wired networks to connect through a wireless standard, including Wi-Fi or Bluetooth. WAPs feature radio transmitters and antennae, which facilitate connectivity between devices and the Internet or a network.

X

Xanadu: Xanadu refers to a hypertext/hypermedia project first conceptualized by Ted Nelson.

X Client: X client refers to the application program that is displayed on an X server, although this application program is otherwise separate from that server. All application programs that run in a GUI delivered by the X Window System, which is virtually any GUI employed on Linux as well as other Unix-like operating systems, is considered an X client. Therefore Apache, OpenOffice, gFTP, gedit, GIMP, Xpdf, and rCalc are typically X clients if employed on such operating systems.
XML Databases: XML Databases allow data to be stored in XML format. XML databases are often linked to document-oriented databases. The data stored in an XML database can be queried, exported, and serialized into any format needed.
XML-Query Language: XML Query Language or XQuery refers to a specific query and programming language for processing XML documents and data. XML data and other databases that store data in a format analogous to HTML can be processed with XQuery.

X Server: X server refers to a server program that connects X terminals running on the X Window System, whether locally or in a distributed network. The X server is installed with the X Window System, which is a cross-platform and complete client-server system for managing graphical user interfaces on a single computer or networked ones.

X Terminal: An X terminal is an input terminal with a display, keyboard, mouse, and touchpad that uses X server software to render images. Used as an open-source windowing system known as the X Window System, the X terminal does not perform application processing – this is handled by the network server.

Y

YMODEM: YMODEM refers to an asynchronous communication protocol for modems developed by Chuck Forsberg as a successor to Xmodem and Modem7. It supports batch file transfers and increases transfer block size, enabling the transmission of a whole list or batch of files at one time.

Yoda Condition: Yoda condition refers to a scenario when a piece of computer syntax is inverted or swapped around, for example, where instead of declaring a variable equal to a constant, the programmer declares a constant equal to a variable. A key characteristic of Yoda conditions is that they do not impair the function of the code in any way.

Yoyo Mode: Yoyo mode refers to a situation wherein a computer or a similar device seems stuck in a loop — turning on briefly, then turning off again. The idea is that the rapid restart and shut off patterns can be compared to the down and up cycles of a yo-yo.

Yottabytes: Yottabytes is a data unit. It is equivalent to approximately 1000 Zettabytes, or 250 trillion DVDs. The entire digital universe is estimated to be 1 Yottabyte and this is expected to be double every 18 months.

Z

Zachman Framework: Zachman Framework refers to a visual aid for organizing ideas about enterprise technology. It is attributed to IBM professional John Zachman, as presented in the article “A Framework for Information Systems Architecture” published in the IBM Systems Journal in 1987.

Zend Optimizer: Zend Optimizer refers to an open-source runtime application used with file scripts encoded by Zend Encoder and Zend Safeguard to boost the overall PHP application runtime speed.

Zettabytes: Zettabytes is another data unit, that equals to approximately 1000 Exabytes or 1 billion terabytes.

ZooKeeper: ZooKeeper is a software project of the Apache Software Foundation. It is a service that provides centralized configuration and open code name registration for large distributed systems. ZooKeeper is a sub-project of Hadoop.

Have we missed out any essential Big Data term? Tell us any terms you think we should add. Let us know in the comments.

Big Data Glossary: The Ultimate List of All Big Data & Analytics Terms