HBase Interview Questions And Answers
Explain what is Hbase?
Hbase is a column-oriented database management system which runs on top of HDFS (Hadoop Distribute File System). Hbase is not a relational data store, and it does not support structured query language like SQL.
In Hbase, a master node regulates the cluster and region servers to store portions of the tables and operates the work on the data.
Explain why to use Hbase?
– High capacity storage system
– Distributed design to cater large tables
– Column-Oriented Stores
– Horizontally Scalable
– High performance & Availability
– Base goal of Hbase is millions of columns, thousands of versions and billions of rows
– Unlike HDFS (Hadoop Distribute File System), it supports random real time CRUD operations
Mention what are the key components of Hbase?
Zookeeper: It does the co-ordination work between client and Hbase Maser
Hbase Master: Hbase Master monitors the Region Server
RegionServer: RegionServer monitors the Region
Region: It contains in memory data store(MemStore) and Hfile.
Catalog Tables: Catalog tables consist of ROOT and META
Explain what does Hbase consists of?
Hbase consists of a set of tables
And each table contains rows and columns like traditional database
Each table must contain an element defined as a Primary Key
Hbase column denotes an attribute of an object
Mention how many operational commands in Hbase?
Operational command in Hbases is about five types
Explain what is WAL and Hlog in Hbase?
WAL (Write Ahead Log) is similar to MySQL BIN log; it records all the changes occur in data. It is a standard sequence file by Hadoop and it stores HLogkey’s. These keys consist of a sequential number as well as actual data and are used to replay not yet persisted data after a server crash. So, in cash of server failure WAL work as a life-line and retrieves the lost data’s.
When you should use Hbase?
– Data size is uge: When you have tons and millions of records to operate
– Complete Redesihgn: When you are moving RDBMS to Hbase, you consider it as a complete re-design then mere just changing the ports
– SQL-Less commands: You have several features like transactions; inner joins, typed columns, etc.
– Infrastructure Investment: You need to have enough cluster for Hbase to be really useful
In Hbase what is column families?
Column families comprise the basic unit of physical storage in Hbase to which features like compressions are applied.
Explain what is the row key?
Row key is defined by the application. As the combined key is pre-fixed by the rowkey, it enables the application to define the desired sort order. It also allows logical grouping of cells and make sure that all cells with the same rowkey are co-located on the same server.
Explain deletion in Hbase? Mention what are the three types of tombstone markers in Hbase?
When you delete the cell in Hbase, the data is not actually deleted but a tombstone marker is set, making the deleted cells invisible. Hbase deleted are actually removed during compactions.
Three types of tombstone markers are there:
Version delete marker: For deletion, it marks a single version of a column
Column delete marker: For deletion, it marks all the versions of a column
Family delete marker: For deletion, it marks of all column for a column family
Explain how does Hbase actually delete a row?
In Hbase, whatever you write will be stored from RAM to disk, these disk writes are immutable barring compaction. During deletion process in Hbase, major compaction process delete marker while minor compactions don’t. In normal deletes, it results in a delete tombstone marker- these delete data they represent are removed during compaction.
Also, if you delete data and add more data, but with an earlier timestamp than the tombstone timestamp, further Gets may be masked by the delete/tombstone marker and hence you will not receive the inserted value until after the major compaction.
At TekSlate, we offer resources that help you in learning various IT courses. We avail both written material and demo video tutorials. To gain in-depth knowledge and be on par with practical experience, then explore HBase Training Videos.
Explain what happens if you alter the block size of a column family on an already occupied database?
When you alter the block size of the column family, the new data occupies the new block size while the old data remains within the old block size. During data compaction, old data will take the new block size. New files as they are flushed, have a new block size whereas existing data will continue to be read correctly. All data should be transformed to the new block size, after the next major compaction.
Mention the difference between Hbase and Relational Database?
What is the history of HBase?
2006: BigTable paper published by Google.
2006 (end of year): HBase development starts.
2008: HBase becomes Hadoop sub-project.
2010: HBase becomes Apache top-level project.
What is Apache HBase?
Apache Hbase is one the sub-project of Apache Hadoop,which was designed for NoSql database(Hadoop Database),bigdata store and a distributed, scalable.Use Apache HBase when you need random, realtime read/write access to your Big Data.A table which contain billions of rows X millions of columns -atop clusters of commodity hardware. Apache HBase is an open-source, distributed, versioned, non-relational database modeled after Google’s Bigtable. Apache HBase provides Bigtable-like capabilities run on top of Hadoop and HDFS.
What is NoSql?
Apache HBase is a type of “NoSQL” database. “NoSQL” is a general term meaning that the database isn’t an RDBMS which supports SQL as its primary access language, but there are many types of NoSQL databases: BerkeleyDB is an example of a local NoSQL database, whereas HBase is very much a distributed database. Technically speaking, HBase is really more a “Data Store” than “Data Base” because it lacks many of the features you find in an RDBMS, such as typed columns, secondary indexes, triggers, and advanced query languages, etc.
What are the main features of Apache HBase?
Apache HBase has many features which supports both linear and modular scaling,HBase tables are distributed on the cluster via regions, and regions are automatically split and re-distributed as your data grows(Automatic sharding).HBase supports a Block Cache and Bloom Filters for high volume query optimization(Block Cache and Bloom Filters).
When should we use Hbase?
-We should have milions or billions of rows and columns in table at that point only we have use Hbase otherwise better to go RDBMS(we have use thousand of rows and columns)
-In RDBMS should runs on single database server but in hbase is distributed and scalable and also run on commodity hardware.
-Typed columns, secondary indexes, transactions, advanced query languages, etc these features provided by Hbase,not by RDBMS.
What is the difference between HDFS/Hadoop and HBase?
HDFS doesn’t provides fast lookup records in a file,IN Hbase provides fast lookup records for large table.
Is there any difference between HBase datamodel and RDBMS datamodel?
In Hbase,data is stored as a table(have rows and columns) similar to RDBMS but this is not a helpful analogy. Instead, it can be helpful to think of an HBase table as a multi-dimensional map.
What are key terms are used for designing of HBase datamodel?
-table(Hbase table consists of rows)
-row(Row in hbase which contains row key and one or more columns with value associated with them)
-column(A column in HBase consists of a column family and a column qualifier, which are delimited by a : (colon) character)
-column family(having set of columns and their values,the column families should be considered carefully during schema design)
-column qualifier(A column qualifier is added to a column family to provide the index for a given piece of data)
-cell(A cell is a combination of row, column family, and column qualifier, and contains a value and a timestamp, which represents the value’s version)
-timestamp( represents the time on the RegionServer when the data was written, but you can specify a different timestamp value when you put data into the cell)
What are datamodel operations in HBase?
-Get(returns attributes for a specified row,Gets are executed via HTable.get)
-put(Put either adds new rows to a table (if the key is new) or can update existing rows (if the key already exists). Puts are executed via HTable.put (writeBuffer) or HTable.batch (non-writeBuffer))
-scan(Scan allow iteration over multiple rows for specified attributes)
-Delete(Delete removes a row from a table. Deletes are executed via HTable.delete)
-HBase does not modify data in place, and so deletes are handled by creating new markers called tombstones. These tombstones, along with the dead values, are cleaned up on major compaction.
How should filters are useful in Apache HBase?
Filters In Hbase Shell,Filter Language was introduced in APache HBase 0.92. It allows you to perform server-side filtering when accessing HBase over Thrift or in the HBase shell.
How many filters are available in Apache HBase?
Total we have 18 filters are support to hbase.They are:
How can we use MapReduce with HBase?
Apache MapReduce is a software framework used to analyze large amounts of data, and is the framework used most often with Apache Hadoop. HBase can be used as a data source, TableInputFormat, and data sink, TableOutputFormat or MultiTableOutputFormat, for MapReduce jobs. Writing MapReduce jobs that read or write HBase, it is advisable to subclass TableMapper and/or TableReducer.
How do we back up my HBase cluster?
There are two broad strategies for performing HBase backups: backing up with a full cluster shutdown, and backing up on a live cluster. Each approach has pros and cons.
Full Shutdown Backup
Some environments can tolerate a periodic full shutdown of their HBase cluster, for example if it is being used a back-end analytic capacity and not serving front-end web-pages. The benefits are that the NameNode/Master are RegionServers are down, so there is no chance of missing any in-flight changes to either StoreFiles or metadata. The obvious con is that the cluster is down.
Live Cluster Backup
live clusterbackup-copytable:copy table utility could either be used to copy data from one table to another on the same cluster, or to copy data to another table on another cluster.
live cluster backup-export:export approach dumps the content of a table to HDFS on the same cluster.
Does HBase support SQL?
Not really. SQL-ish support for HBase via Hive is in development, however Hive is based on MapReduce which is not generally suitable for low-latency requests.By using Apache Phoenix can retrieve data from hbase by using sql queries.
What is the maximum recommended cell size?
A rough rule of thumb, with little empirical validation, is to keep the data in HDFS and store pointers to the data in HBase if you expect the cell size to be consistently above 10 MB. If you do expect large cell values and you still plan to use HBase for the storage of cell contents, youll want to increase the block size and the maximum region size for the table to keep the index size reasonable and the split frequency acceptable.
Why cant I iterate through the rows of a table in reverse order?
Because of the way HFile works: for efficiency, column values are put on disk with the length of the value written first and then the bytes of the actual value written second. To navigate through these values in reverse order, these length values would need to be stored twice (at the end as well) or in a side file. A robust secondary index implementation is the likely solution here to ensure the primary use case remains fast.
What happens if we change the block size of a column family on an already populated database?
When we change the block size of the column family, the new data takes the new block size while the old data is within the old block size. When the compaction occurs, old data will take the new block size. “New files, as they are flushed, will have the new block size, whereas existing data will continue to be read correctly. After the next major compaction, all data should be converted to the new block size.