After millions of rows, thousands of files, and hundreds of hours, we've learned a thing or two about working with big data sources, especially files in fixed or CSV formats. They have been around for a really long time, and they are universally known and present. Almost any data storage solution has dedicated tools for dealing with CSV files. That is very good but believe me, there's a limit. I'm talking hundreds of MB in a single file, millions of rows. Nice and beefy.
What are my options?
As I mentioned before: this is not a new problem, and there are several potential solutions:
- SQL Developer can load data from a CSV file into a table. It's actually quite nice because it lets you map the field in the file to the table columns with a graphical interface. If you've used SQL Developer for long enough, you must know that it's not the quickest of tools. If you try to load a really big file, it would probably take hours and may not even finish.
- SQL LOADER can handle CSV files. This is not a bad solution. It's not very flexible, though, since you have to set up your CTRL file and you're pretty much fixed to that specific file. This is great for one or two times, but it gets tricky when you need to load several files every day, each with its name and fields.
- If you're using Application Express (APEX), you can create a "Data Load Wizard." The cool thing about this is that it supports several file formats. You can paste data straight into the wizard, you can map columns and fields, and it lets you do the processing right there, with PL/SQL. Obviously, this is only an option if you already have APEX installed and it is already being used for something else. I wouldn't recommend installing the tool for this specific purpose. It seems like an unnecessarily long route to take.
What's better than SQL LOADER?
What if humans could use 100% of their brain capacity*? Well, some do, and they came up with External Tables. All jokes aside, this is just one more tool at your disposal, depending on your specific use case. Sometimes you need a hammer, and sometimes you need a screwdriver.
External Tables are somewhat of a complement to SQL LOADER. They allow you to access data outside the database and treat it like just another table. Sounds cool, doesn't it? Kind of a mix between SQL LOADER and a pipelined function. Files need to be in a text format, not binary (not entirely true, but for our purposes, it is). It works with clearly delimited records, either a line break or a specific character. If you're dealing with highly structured data such as JSON or XML, this is not for you. Records in JSON or XML format are hard to tell apart without basic parsing, which defeats the purpose of using SQL LOADER. Parsing makes it slow(er). External tables are super fast; they even support partitioning (since release 12.2.0.1), which allows you to take advantage of the same performance improvements provided when you partition tables stored in a database.
Just show me the goods already.
Fine! Here it is. When creating an External Table, you need to specify certain parameters:
- TYPE: the type of external table corresponding to the access driver you'll be using.
- ORACLE_LOADER: reads (loads) data from external tables into internal tables. It does not work the other way around, not for this access driver.
- ORACLE_DATAPUMP: lets you read and write (load and unload) into binary dump files.
- ORACLE_HDFS: reads out of Hadoop Distributed File System (HDFS).
- ORACLE_HIVE: reads out of Apache HIVE.
- DEFAULT DIRECTORY: as you load the data, you can specify the DIRECTORY where the actual files are located. This is an actual DIRECTORY object, not a path. This means you must create the DIRECTORY object before creating the table. If you don't specify a DIRECTORY when you load the data, this parameter is the default value.
- ACCESS PARAMETERS: this is very specific to the access driver you're using. We'll show you examples of ORACLE_LOADER, but if you want to use the other access drivers, have a look at the documentation.
- LOCATION: similar to DEFAULT DIRECTORY, it specifies the location of the data files. For ORACLE_LOADER and ORACLE_DATAPUMP, the files are named in the form DIRECTORY:file. If no DIRECTORY is specified, it will use the default (see above).
We went for the most generic approach we could think of. Instead of creating an External Table for every format we had, we used a single table with generic columns such as A, B, C... AA, AB, AC, and so on. Think of an Excel spreadsheet.
CREATE TABLE STAGE_GENERIC_CSV_EXT
(
row_nr NUMBER,
c_a VARCHAR2(4000),
c_b VARCHAR2(4000),
c_c VARCHAR2(4000),
.
.
.
organization external
(
type ORACLE_LOADER
default directory MISC_FILES
access parameters
(
RECORDS DELIMITED BY NEWLINE
readsize 1048576
fields CSV WITH embedded record Terminators
MISSING FIELD VALUES ARE NULL
NULLIF = '#missing'
(row_nr recnum,
"S_A" CHAR(10000),
"S_B" CHAR(10000),
"S_C" CHAR(10000),
.
.
.
)
column transforms
(C_A from startof S_A (4000)
,C_B from startof S_B (4000)
,C_C from startof S_C (4000)
.
.
.
)
)
location (MISC_FILES:'generic.csv')
)
reject limit 0;
This doesn't make any sense to me.
Let's take it one step at a time. The first group of columns represents the columns of the actual table unrelated to the file contents.
ORGANIZATION EXTERNAL
This is simply telling Oracle you're creating an External Table.
TYPE, DEFAULT DIRECTORY
We discussed this a little earlier.
ACCESS PARAMETERS
This is the important section specific to our use case. You can basically copy what we did, but we also included a link to the official documentation if our use case does not fit yours. You start by defining the record properties:
- RECORDS DELIMITED BY NEWLINE: this is self-explanatory. The symbol that delimits records. In a CSV file, it should be a line break. Another option would be fixed-sized records. This is often used in banking, for example. All records have the exact same length, and fields are padded by zeros or blanks.
- READSIZE: The size of the read buffer must be at least as big as the largest input record the access driver will encounter. The default value is 512 KB.
- WITH EMBEDDED and WITHOUT EMBEDDED options specify whether record terminators are included (embedded) in the data.
- NULLIF: applies to all character fields. Determines when a field is considered to be NULL. We match it to the literal string "#missing" because that's how our file provider handles nulls. Adjust your settings. You can use wildcards (*).
- The next section ("S_A" CHAR(10000),"S_B" CHAR(10000)…) defines the max allowed size for the fields in the CSV.
- COLUMN TRANSFORMS: here, you can map fields from the CSV file (described in the previous section) to columns in the External Table. You have a few options like concatenating (CONCAT), using static files (CONSTANT), nullifying (NULL), or in our case, truncating the contents to a specified max length (STARTOF).
- LOCATION: this has very little importance for our use case since the source file is always different and can have different field definitions. If you have a fixed structure, you can use an empty file (headers only), or you can just point to any existing CSV file. You can override this setting when you query from the external table, which is what we do, and I'll show you how to do it.
- REJECT LIMIT: lets you specify how many conversion errors can occur during a query of the external data before an Oracle Database error is returned and the query is aborted. The default value is 0.
The Query
Finally, what wraps everything together is the query:
SELECT
C_A, C_B, C_C
FROM STAGE_GENERIC_CSV_EXT
EXTERNAL MODIFY (ACCESS PARAMETERS (NOBADFILE NOLOGFILE NODISCARDFILE)
LOCATION(MISC_FILES:'CUST_DW_DELTA_RPT.CSV')
Like I mentioned before, you have the option to override some of the settings you defined when creating the table. The most important one you can see here is the LOCATION. Here you can point to a different directory and file. Since our solution is generic, this changes all the time, so we specify at the time we query, and Oracle will go and open that specific file. The only caveat is that the DIRECTORY needs to exist by the time you query.
The more we progress as a species, the more data we produce and log. Everyone has a phone nowadays. Simple analog tools and devices, like tractors and traffic lights, now have GPS sensors and are network enabled. All this new data needs to be loaded into a centralized repository for us to analyze and make decisions on. CSV is a well-known format that you can use to bring together all those different sources no matter how massive the dataset is. Now that you know how to load big files there's one thing to worry about. You can spend more time focusing on the activities that bring value to your business. Good luck!
*This is a myth, by the way. We do use our entire brain all the time. Well, some people do. I think. I'm not sure what my numbers are. I wonder how they calculate that. Should I see a doctor? Hmmm.
SUBMIT YOUR COMMENT