Wednesday, February 28, 2018
NTT Docomo: Secure and Scalable Data Warehouse on Redshift
An Overview of Adaptive Joins
I was surprised to find out that a lot of people hadn't heard about the new join type, Adaptive Join. So, I figured I could do a quick overview.
Adaptive Join BehaviorCurrently, the Adaptive Join only works with columnstore indexes, but according to Microsoft, at some point, they will also work with rowstore. The concept is simple. For larger datasets, frequently (but not always... let's not try to cover every possible caveat — it depends, right?), a hash join is much faster than a loops join. For smaller datasets, frequently, a loops join is faster. Wouldn't it be nice if we could change the join type on the fly so that the most effective join is used depending on the data in the query? Ta-da! Enter: the Adaptive Join.
Tuesday, February 27, 2018
Troubleshooting Connection Problems: HANA Express From HANA Studio
After repeatedly seeing questions around failed connections to HANA Express, I'm compiling the basic troubleshooting steps I take when, for example, HANA Studio does not connect to my HXE instance. The most typical error is:
The system cannot be reached. logon data cannot be used.Active-Active, Redis on Flash, Replication, and Clustering [Videos]
We've been busy working on a whole new round of videos about Redis Enterprise. Let's take a look at the topics we've covered.
Active-Active Geo Distribution With CRDTsCRDTs (conflict-free replicated data types) are a fascinating, cutting-edge way of re-merging two pieces of data that have become out-of-sync. Redis Enterprise can be deployed in a geo distribution that uses CRDTs to resolve conflicts between pieces of data that have gone out-of-sync. In this video, we describe a scenario that causes out-of-sync data and show how Active/Active Redis Enterprise resolves the conflict at the database level.
Monday, February 26, 2018
Moving Your Data From MongoDB to AWS Redshift for Analytical Processing
If you are using MongoDB as your database, have you ever considered how you are going to do analytics on top of the NoSQL database? This is one of the questions I have heard often as a limitation of MongoDB and generally for NoSQL. The common complaint is that it is difficult to derive relationships between collections compared to relational databases where tables are already related to generating analytics.
However, moving further, it is important to understand the difference between Online Transaction Processing (OLTP) and Online Analytical Processing (OLAP).
Sunday, February 25, 2018
When Postgres Blocks: 7 Tips for Dealing With Locks
Recently, I wrote about locking behavior in Postgres, which commands block each other, and how you can diagnose blocked commands. Of course, after the diagnosis, you may also want a cure. With Postgres, it is possible to shoot yourself in the foot, but Postgres also offers you a way to stay on target. These are some of the important dos and don'ts that we've seen as helpful when working with users to migrate from their single node Postgres database to Citus or when building new real-time analytics apps on Citus.
1. Never Add a Column With a Default ValueA golden rule of PostgreSQL is: when you add a column to a table in production, never specify a default.
Saturday, February 24, 2018
What Is the Elastic SQL Approach?
Editor's Note: This is an excerpt of a larger white paper written by Craig Mullins that explores database scaling options. You can download the full white paper here.
Elastic SQL: An Alternative, Services-Based ApproachSo far, we have examined the status quo, wherein various architectural alternatives of achieving improved elasticity and scalability using existing solutions are offered. But each of these methods has some significant drawbacks: shared-disk requires expensive hardware and software to achieve, shared-nothing requires partitioning the data that may not match the requirements of all applications, and NoSQL eliminates ACID and often, SQL. Clearly, an alternative approach is needed — one that does not try to work around the shortcomings of the existing solutions.
Friday, February 23, 2018
Simple Java Program to Append to a File in HDFS
In this article, I will present you with a Java program to append to a file in HDFS.
I will be using Maven as the build tool.
Thursday, February 22, 2018
High Throughput and Low Latency Master-Slave Replication
There are different kinds of data replication models — mainly, master-slave (Couchbase, MongoDB, Espresso), master-master (BDR for PostgreSQL, GoldenGate for Oracle), and masterless (Dynamo, Cassandra). This article only discusses the master-slave replication in key-value (KV) stores.
In the master-slave replication model, there is one master for a single data partition and one or more replicas, which are essentially slaves and follow the data in the master partition. The client applications send the key-values to the master and subsequently, the key-values are sent over to the replicas from the master.
Wednesday, February 21, 2018
Apache Cassandra vs. Apache Ignite: Strong Consistency and Transactions
NoSQL databases such as Apache Cassandra are the best-known examples of eventually consistent systems. A contract of such systems is simple: if an application triggers a data change on one machine, then the update will be propagated to all the replicas at some point in time — in other words, eventually.
Until the change is fully replicated, the system as a whole will stay in an inconsistent state. And who knows where the application will end up if it tries to read the changed value from an out-of-sync replica — or even worse, update the value concurrently?
Tuesday, February 20, 2018
Developing Applications with Go and NoSQL [Video]
If you didn't know this, Go is one of my favorite programming technologies. It is fast, clean, and not too difficult to learn.
In the past I had created some content around using Go with Couchbase. For example, I demonstrated how to create a user profile store in a previous tutorial titled "Developing a User Profile Store with Golang and a NoSQL Database."
What Are the Database Scalability Methods?
Editor's Note: This is an excerpt of a larger white paper written by Craig Mullins that explores database scaling options. You can download the full white paper here.
Traditional Database Scalability MethodsLet's turn our attention to traditional methods for achieving scalability in database systems. Database scalability is often implemented by clustering. With clustering, multiple servers are used to serve database requests.
Monday, February 19, 2018
This Week in Data: MongoDB Transactions and Spectre/Meltdown Rumble On
Join Percona Chief Evangelist Colin Charles as he covers happenings, gives pointers and provides musings on the open source database community.
In case you missed last week’s column, don’t forget to read the fairly lengthy FOSDEM MySQL & Friends DevRoom summary.
Saturday, February 17, 2018
PostgreSQL Rocks, Except When it Blocks: Understanding Locks
At Citus Data, we engineers take an active role in helping our customers scale out their Postgres database, be it for migrating an existing application or building a new application from scratch. This means we help you with distributing your relational data model—and also with getting the most out of Postgres.
One problem I often see users struggle with when it comes to Postgres is locks. While Postgres is amazing at running multiple operations at the same time, there are a few cases in which Postgres needs to block an operation using a lock. You, therefore, have to be careful about which locks your transactions take, but with the high-level abstractions that PostgreSQL provides, it can be difficult to know exactly what will happen. This post aims to demystify the locking behaviors in Postgres, and to give advice on how to avoid common problems.
How to Debug Disk Full Errors in Redshift
When working with Amazon’s Redshift for the first time, it doesn’t take long to realize it’s different from other relational databases. You have new options like COPY and UNLOAD, and you lose familiar helpers like key constraints. You can work faster with larger sets of data than you ever could with a traditional database, but there’s a learning curve to get the most out of it.
One area we struggled with when getting started was unhelpful disk full errors, especially when we knew we had disk space to spare. Over the last year, we’ve collected a number of resources on how to manage disk space in Redshift. We’ll share what we’ve learned to help you quickly debug your own Redshift cluster and get the most out of it.
Friday, February 16, 2018
Containers, Kubernetes, and Redis Enterprise Explained
Containers are lightweight, stand-alone, portable, self-contained software execution environments. Containers have their own CPU, memory, I/O, and networking resources but they share the kernel of the host operating system. Containers are based on Linux namespaces and cgroups. Namespaces (developed by IBM) create resource isolation for a single process while (developed by Google) manage resources for a group of processes. Containers have low startup overhead compared to that of a virtual machine running on a hypervisor. Containers are quickly becoming the basic unit of development and software packaging because they decouple applications from operating systems.
Kubernetes is a popular open-source container orchestration engine used to deploy containerized applications. A Kubernetes cluster offers self-healing (restarts), scaling, scheduling and rolling updates of your containerized applications. These are some of the basic primitives that make up a Kubernetes cluster:
Thursday, February 15, 2018
IFQL and the Future of InfluxData
With the big news that came out recently about our Series C financing, I thought I should take some time to talk about what the future holds for the InfluxData platform and InfluxDB. For our CEO Evan Kaplan's perspective see here.
To understand the future direction, I should talk a little bit about the last five years of work and what we've learned in the process. In the beginning of the company I built a "time series API" with web services written in Scala using Cassandra as a backing data store and Redis as a caching and indexing layer. The initial version of this API was RESTful. About six months later, I implemented this as a single binary in Go using LevelDB as the underlying storage engine, but kept the same RESTful API with some key additions.
Wednesday, February 14, 2018
Using PMM With External Monitoring Services
Percona Monitoring and Management (PMM) is a free and open-source platform for managing and monitoring MySQL and MongoDB performance. You can run PMM in your own environment for maximum security and reliability. It provides thorough time-based analysis for MySQL and MongoDB servers to ensure that your data works as efficiently as possible.
Starting with version 1.4.0 and improved in 1.7.0, PMM supports external monitoring services. This means you can plug in Prometheus exporters for technologies not directly provided by Percona. For example, you can start monitoring the metrics of your PostgreSQL database host, Memcached or Redis.
UOL: Pagseguro Uses Mikrotik from AWS Marketplace for Container Network Portability
Hashing With SHA-256 in Oracle 11g R2
As you know, Oracle offers some support for encryption and hashing in the database. Taking advantage of this underlying infrastructure offered by Oracle allows us to accelerate our business considerably.
With Oracle 11g R2, we can see that there are many ways to look at the structure provided to the user.
Tuesday, February 13, 2018
Azure Data Lakes and U-SQL SELECT Transformation Rowsets
In this article, we are going to discuss the basic U-SQL SELECT query transformation rowset technique. I hope it will be informative.
U-SQL SELECT Query Transformation RowsetsIn a previous article, we retrieved data from a “SearchLog.tsv” file to “SearchLog-scalar-variables.csv”. It was just a simple file-to-file movement of data.
Monday, February 12, 2018
Auditing Linked Servers
Last month I noticed this tweet from @SQLPrincess on #sqlhelp, asking if there was a way to find out what happened to a linked server:
The short answer is that SQL Server does not track this information by default. You need to be auditing linked servers for modifications before they happen.
Sunday, February 11, 2018
Do You Really Need That SQL to Be Dynamic?
Dynamic SQL refers to a SQL statement that is constructed, parsed, and executed "dynamically" at runtime (vs. "statically" at compile time).
It's very easy to write static SQL in PL/SQL program units (one of the great joys of working with this database programming language). It's also quite easy to implement dynamic SQL requirements in PL/SQL.
Saturday, February 10, 2018
Event Analytics: How to Define User Sessions With SQL
Recently, we’ve built event analytics for our team and thought to share this experience with you in this post and in an upcoming free webinar. Ready to learn how to transform raw events data into events flow and user sessions?
Many of out-of-the-box analytics solutions come with automatically defined user sessions. It’s good to start with, but as your company grows, you’ll want to have your own session definitions based on your event data. Analyzing user sessions with SQL gives you flexibility and full control over how metrics are defined for your unique business.
Friday, February 09, 2018
This Week in Neo4j: Data Lineage, Google Cloud, Thomson Reuters' OpenPermID
Welcome to this week in Neo4j, where we round up what's been happening in the world of graph databases in the last seven days.
This week, we have a graph of Thomson Reuters' OpenPermID dataset, running Neo4j on Google Cloud, migrating from MySQL to Neo4j, as well as a data lineage talk from GraphConnect NYC 2017.
AWS re:invent 2017: Case Study: Ola Cabs Uses Amazon EBS and Elastic Volumes to Maxi (DAT329)
Thursday, February 08, 2018
Quick Guide to User-Defined Types in Oracle PL/SQL
A Twitter follower recently asked for more information on user-defined types in the PL/SQL language, and I figured the best way to answer is to offer up this post.
PL/SQL is a strongly typed language. Before you can work with a variable or constant, it must be declared with a type (yes, PL/SQL also supports lots of implicit conversions from one type to another, but still, everything must be declared with a type).
Bringing DevOps to the Database, Part 1: Version Control
For some years now, DevOps practices have been exciting application developers with their promise of short iterations, fast releases, and features that get into the hands of users sooner. Those same practices are now entering the database space, but how can database development adapt, and where should it start?
DevOps has been claiming converts everywhere. No surprise there. Developers like it because it streamlines processes, improves software quality, automates repetitive tasks, and supports continuous delivery, freeing them to concentrate on what they're good at: coding.
Wednesday, February 07, 2018
Understanding Memory Utilization in RavenDB
This is a snapshot of our production server right this moment. As you can see, the system is now using every little bit of RAM that it has at its disposal. You can also see that the CPU is kind of busy and the network is quite active.
In most cases, an admin seeing this will immediately hit the alarm bell and start figuring out what is causing the system to use all available memory. This looks like a system that is on the precipice of doom, after all.
Tuesday, February 06, 2018
Customizing the SQL Prompt Built-In Snippets: A Better 'ata' Snippet
Snippets are a great feature of SQL Prompt. They save coding time and introduce standards and consistency to the way you build code modules. They have multiple replacement points (placeholders) for parameters and you can invoke them easily and from an SSMS query pane.
SQL Prompt also comes with many useful built-in snippets, but sometimes, we need to do some customization work to add the functionality we need. As an example, how might we improve the ALTER TABLE ADD snippet, 'ata'?
Monday, February 05, 2018
How to Retrieve Database Data for API Testing With JMeter
When performing API testing, we always have to go to the database to check the values that the tested API returns. The data samples that need to be tested in the database can be either simple or complex, which leads to an increase in the time required to run the tests. It also often happens that the database has a limited number of connected users and running execution tests that can address the database. In such a case, we will get a negative result, because the database will not allow us to connect.
If such cases occur during testing, a possible solution is to retrieve data from the database on to our local machine just once and use it for all the tests you need to perform for the API. This blog post will show you how to do that by using Apache JMeter™.
Heavy Workloads: Our Use Cases of Tarantool
You have probably heard of Tarantool. It’s a super fast DBMS with an application server inside. Tarantool is an open-source project and has been around for more than eight years. At Mail.Ru Group, we’re using it in more than half of our services, such as email, cloud, MyWorld, and Mail.Ru Agent. But being open source, we’re giving all of our work on Tarantool back to the community so that our users have the same version of Tarantool as we do.
Tarantool has client libraries for nearly all popular languages, and many of these were partially written by the community, which we immensely appreciate. When we come across a really efficient library, we immediately include it in our public package repositories, as we’re trying hard to deliver the DBMS and libraries right out of the box.
Sunday, February 04, 2018
JSON Modeling for RDBMS Users [Video]
JSON modeling was covered in a previous blog post which was, in turn, based on a CSV import blog post that came before that. While writing the post on JSON modeling, it occurred to me that it might be useful to see all the data in motion: from relational to CSV to a staging bucket and finally assembled with the new model.
JSON Modeling VideoSaturday, February 03, 2018
Why Smart Caches Matter So Much
What makes us “run faster to keep the same pace?” Both data volume and complexity. One day you wake up and suddenly you have 100 times more petabytes to process and queries to handle than you had the day before. This article will help you address this inevitable data problem by exploring the NoSQL options that are available to you and by explaining the benefits of using a smart cache.
NoSQL DBMS TypesThere are four main types of NoSQL databases. They differ in the respective data models they use for distribution and replication, and thus each has its own set of strengths and weaknesses in relation to various types of tasks.
Friday, February 02, 2018
When to Use Row or Page Compression in SQL Server
Introduced with SQL Server 2008, page and row compression are turning ten years old this year. In that time, the internet became littered with posts describing both features, how they work, the performance gains, etc.
Despite digesting all of that information, a colleague asked me a very simple question those posts did not answer:
MariaDB ColumnStore Distributed User-Defined Aggregate Functions
MariaDB ColumnStore 1.1 introduces the distributed user-defined aggregate functions (UDAF) C++ API. MariaDB Server has supported UDAF (a C API) for a while, but now, we have extended it to the ColumnStore Engine. This new feature allows anyone to create aggregate functions of arbitrary complexity for distributed execution in the ColumnStore Engine. These functions can also be used as Analytic (Window) functions just like any built-in aggregate. You should have a working understanding of C++ to use this API.
For use as analytic functions, all calls are on the UM.
Thursday, February 01, 2018
Reading Data From Oracle Database With Apache Spark
In this article, I will connect Apache Spark to Oracle DB, read the data directly, and write it in a DataFrame.
Following the rapid increase in the amount of data we produce in daily life, big data technology has entered our lives very quickly. Instead of traditional solutions, we are now using tools with the capacity to solve our business quickly and efficiently. The use of Apache Spark is a common technology that can fulfill our needs.
Fun With SQL: Functions in Postgres
DZone Database Zone Fun With SQL: Functions in Postgres In our previous Fun with SQL post on the Citus Data blog, we covered w...
-
DZone Database Zone Monitoring OpenWRT With Telegraf What's the most popular open-source router software in the world? OpenWRT...
-
DZone Database Zone Tarantool Queues (Part 3): The Art of Queue Parsing In our previous article , we used the tarantool-authman mo...
-
DZone Database Zone How to Use SQL Complete for T-SQL Code I was recently working on a project with several stored procedures, fun...