Information retrieval is the process of extracting useful
information from any source which provides insights to resolve a business
problem. This process involves mining through any kind of data source
available. The data could be of two different types.
Structured
Data refers to information where the data model has an
organized structure and provides a straight forward ways to perform searches
using the traditional algorithms. A structured data mostly resides in a proper
relational database. Only about 10%-20% of the data available are in this form.
Some common examples are A structured data would typically look like this:
Unstructured
Data refers to all the other data available outside which
is not held in an organized data model or databases. Unstructured data usually
contains garbage data in addition to the useful information. The challenge with
this type of data is its processing to bring out the garbage vs useful information.
Almost 80%-90% of the data available are unstructured. Some examples of
unstructured data include social media content, word documents, anything
recorded on a paper by human, etc.
Growth
of Unstructured Data Vs Structured Data: There has been an exponential
growth in volume of unstructured data than structured data. Two major reasons
attribute to this uncontrollable growth: User experience is better with rich
content like pictures, videos, music, X-rays etc. and the storage issues that
accompany the rich text.
To manage the wild growth of unstructured data generated
within an enterprise and to extract information from it, organizations have
adopted two main methodologies: Big Data Tools and the Business Intelligence
Tools. The most conventional way is using BI.
Data
warehouse is the central repository of integrated data from disparate
operational systems. It provides a structure to the raw data by organizing it
in the form of OLAP cubes or dimensional modeling. The data from these modeling
techniques are then used by the BI reporting tools.
Advantages
of data warehousing:
- Potential high returns on investment for organizations.
- Centralized, structured and standardized data for easy interpretation and understanding.
- Provides competitive gain.
- Improved decision making by the management over short period of time by providing right information at the right time.
- Better enterprise intelligence to enhance customer service.
- Provides improved reporting capabilities.
Limitations
of data warehousing:
- Cost/Benefit Analysis is a major disadvantage of data warehousing and it may consume lot of IT man hours and budget.
- Extra reporting work may be a problem because data warehouse requires each data type to be generated by the IT professionals.
- Time consuming as it requires data to be extracted, cleaned and then loaded.
- Data owners lose control over their data which creates concerns for data security/privacy issues.
- Data flexibility can be a problem as the data warehouse tends to have static data with minimal ability to drill down to specific solutions.
- Lot of time and money may be wasted over training and maintaining data warehouses especially in a large enterprise.
Future
of Data Warehouse:
Hadoop and Data Warehouse will complement each other
and grow together as the business needs to rip the big data grows. A new
generation of data warehousing would come up to enhance analytics and reporting
in addition to providing integration with the latest technology platforms that
support processing of unstructured data. Future data warehousing will be able
to provide a 360 view of an organization’s operations with much broader
perspective. In addition to this, data warehousing on cloud will become the
trend and organizations will need to prepare for the transition. Compatibility
with anywhere any device will become the trend. All data warehousing activities
should be supported through browser requests and organizations will be able to
work using tablets and mobile phones without installing specialized application.
One last support that will be expected is the ability to transform the entire
web data into a data mesh and make connections as needed on the fly. This will
enable the data warehouse to handle any type of data that may emerge in the
future.
Resources: